# Prediction of Cardiovascular Diseases Using Machine Learning Classification Models

## Importing necessary packages

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go # Generate Graphs
from plotly.subplots import make_subplots #To Create Subplots

## Data Cleaning and EDA

In [2]:
cardio=pd.read_excel('cardio.xlsx',index_col=0)

In [3]:
cardio.head()

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,18393,2,168,62,110,80,1,1,0,0,1,0
1,20228,1,156,85,140,90,3,1,0,0,1,1
2,18857,1,165,64,130,70,3,1,0,0,0,1
3,17623,2,169,82,150,100,1,1,0,0,1,1
4,17474,1,156,56,100,60,1,1,0,0,0,0


In [4]:
cardio.shape

(70000, 12)

In [217]:
cardio.describe()

Unnamed: 0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0
mean,19468.865814,1.349571,164.359229,74.205543,128.817286,96.630414,1.366871,1.226457,0.088129,0.053771,0.803729,0.4997
std,2467.251667,0.476838,8.210126,14.395829,154.011419,188.47253,0.68025,0.57227,0.283484,0.225568,0.397179,0.500003
min,10798.0,1.0,55.0,10.0,-150.0,-70.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,17664.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
50%,19703.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
75%,21327.0,2.0,170.0,82.0,140.0,90.0,2.0,1.0,0.0,0.0,1.0,1.0
max,23713.0,2.0,250.0,200.0,16020.0,11000.0,3.0,3.0,1.0,1.0,1.0,1.0


## Data description
**There are 3 types of input features:**

**Objective: factual information;
Examination: results of medical examination;
Subjective: information given by the patient.**

**Features:**

**Age** | Objective Feature | **age** | int (days)

**Height** | Objective Feature | **height** | int (cm) |

**Weight** | Objective Feature | **weight** | float (kg) |

**Gender** | Objective Feature | **gender** | categorical code |

**Systolic blood pressure** | Examination Feature | **ap_hi** | int |

**Diastolic blood pressure** | Examination Feature | **ap_lo** | int |

**Cholesterol** | Examination Feature | **cholesterol** | 1: normal, 2: above normal, 3: well above normal |

**Glucose** | Examination Feature | **gluc** | 1: normal, 2: above normal, 3: well above normal |

**Smoking** | Subjective Feature | **smoke** | binary |

**Alcohol intake** | Subjective Feature | **alco** | binary |

**Physical activity** | Subjective Feature | **active** | binary |

**Presence or absence of cardiovascular disease** | Target Variable | **cardio** | binary |

All of the dataset values were collected at the moment of medical examination.

In [218]:
cardio.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70000 entries, 0 to 99999
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   age          70000 non-null  int64
 1   gender       70000 non-null  int64
 2   height       70000 non-null  int64
 3   weight       70000 non-null  int64
 4   ap_hi        70000 non-null  int64
 5   ap_lo        70000 non-null  int64
 6   cholesterol  70000 non-null  int64
 7   gluc         70000 non-null  int64
 8   smoke        70000 non-null  int64
 9   alco         70000 non-null  int64
 10  active       70000 non-null  int64
 11  cardio       70000 non-null  int64
dtypes: int64(12)
memory usage: 6.9 MB


**Checking the Missing Values**

In [219]:
cardio.isnull().any()

age            False
gender         False
height         False
weight         False
ap_hi          False
ap_lo          False
cholesterol    False
gluc           False
smoke          False
alco           False
active         False
cardio         False
dtype: bool

**Finding the duplicated values**

In [220]:
print("There are {} duplicated values in data frame".format(cardio.duplicated().sum()))

There are 24 duplicated values in data frame


**Displaying the duplicated values**

In [221]:
duplicated = cardio[cardio.duplicated(keep=False)]
duplicated

Unnamed: 0_level_0,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1585,17493,2,169,74,120,80,1,1,0,0,1,1
1685,16793,1,165,68,120,80,1,1,0,0,1,0
2223,21945,1,165,60,120,80,1,1,0,0,1,0
2283,20293,1,162,70,110,70,1,1,0,0,1,0
3247,20495,1,165,70,120,80,1,1,0,0,1,0
3774,22077,1,175,69,120,80,1,1,0,0,1,1
9004,14552,1,158,64,120,80,1,1,0,0,1,0
11684,21778,1,160,58,120,80,1,1,0,0,1,0
14974,16937,2,170,70,120,80,1,1,0,0,0,0
15094,20495,1,165,70,120,80,1,1,0,0,1,0


In [222]:
cardio.drop_duplicates(inplace=True)


print("There is {} duplicated values in data frame".format(cardio.duplicated().sum()))

There is 0 duplicated values in data frame


In [223]:
cardio.shape

(69976, 12)