<a href="https://colab.research.google.com/github/SachinYallapurkar/Cardiovascular-Risk-Prediction/blob/main/Cardiovascular_Risk_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project Title** :**Cardiovascular-Risk-Prediction**

##**Problem Description**

### The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has a 10-year risk of future coronary heart disease (CHD). The dataset provides the patient's information. It includes over 4,000 records and 15 attributes. Variables Each attribute is a potential risk factor. There are both demographic, behavioral, and medical risk factors.


## <u>**Data Description**</u><br><br>
### **Variables:**<br>
Each attribute is a potential risk factor. There are both demographic, behavioral, and medical risk
factors.

###**Demographic:**<br>
* <font color = green>**Sex:**</font> male or female("M" or "F")
* <font color = green>**Age:**</font> Age of the patient;(Continuous - Although the recorded ages have been truncated to
whole numbers, the concept of age is continuous)

###**Behavioral:**<br>
* <font color = 'green'>**is_smoking**:</font> whether or not the patient is a current smoker ("YES" or "NO")
* <font color = 'green'>**Cigs Per Day:**</font> the number of cigarettes that the person smoked on average in one day.(can be
considered continuous as one can have any number of cigarettes, even half a cigarette.)

###**Medical( history):**<br>
* <font color = 'green'> **BP Meds:**</font> whether or not the patient was on blood pressure medication (Nominal)
* <font color = 'green'> **Prevalent Stroke:**</font> whether or not the patient had previously had a stroke (Nominal)
* <font color = 'green'> **Prevalent Hyp:**</font> whether or not the patient was hypertensive (Nominal)
* <font color = 'green'> **Diabetes:**</font> whether or not the patient had diabetes (Nominal)

###**Medical(current):**<br>
* <font color = 'green'> **Tot Chol:**</font> total cholesterol level (Continuous)
* <font color = 'green'> **Sys BP:**</font> systolic blood pressure (Continuous)
* <font color = 'green'> **Dia BP:**</font> diastolic blood pressure (Continuous)
* <font color = 'green'>**BMI:**</font> Body Mass Index (Continuous)
* <font color = 'green'>**Heart Rate:**</font> heart rate (Continuous - In medical research, variables such as heart rate though in
fact discrete, yet are considered continuous because of large number of possible values.)
* <font color = 'green'>**Glucose:**</font> glucose level (Continuous)

###**Predict variable (desired target):**<br>
 10-year risk of <font color = 'green'>**coronary heart disease CHD**</font>(binary: “1”, means “Yes”, “0” means “No”) -
DV

## **Importing Libraries**

In [None]:
#importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import missingno as msno
import warnings
warnings.filterwarnings('ignore')
import plotly.graph_objects as go
from imblearn.over_sampling import SMOTE 
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
sns.set_style('darkgrid')
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score,recall_score,accuracy_score,f1_score,confusion_matrix,roc_auc_score,classification_report,plot_confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

## **Data Inspection**

In [None]:
#Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Cardiovascular-Risk-Prediction/data_cardiovascular_risk.csv')

In [None]:
#first five rows
df.head()

Unnamed: 0,id,age,education,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,0,64,2.0,F,YES,3.0,0.0,0,0,0,221.0,148.0,85.0,,90.0,80.0,1
1,1,36,4.0,M,NO,0.0,0.0,0,1,0,212.0,168.0,98.0,29.77,72.0,75.0,0
2,2,46,1.0,F,YES,10.0,0.0,0,0,0,250.0,116.0,71.0,20.35,88.0,94.0,0
3,3,50,1.0,M,YES,20.0,0.0,0,1,0,233.0,158.0,88.0,28.26,68.0,94.0,1
4,4,64,1.0,F,YES,30.0,0.0,0,0,0,241.0,136.5,85.0,26.42,70.0,77.0,0


In [None]:
#last five rows
df.tail()

In [None]:
#shape
df.shape

(3390, 17)

In [None]:
#info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3390 entries, 0 to 3389
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               3390 non-null   int64  
 1   age              3390 non-null   int64  
 2   education        3303 non-null   float64
 3   sex              3390 non-null   object 
 4   is_smoking       3390 non-null   object 
 5   cigsPerDay       3368 non-null   float64
 6   BPMeds           3346 non-null   float64
 7   prevalentStroke  3390 non-null   int64  
 8   prevalentHyp     3390 non-null   int64  
 9   diabetes         3390 non-null   int64  
 10  totChol          3352 non-null   float64
 11  sysBP            3390 non-null   float64
 12  diaBP            3390 non-null   float64
 13  BMI              3376 non-null   float64
 14  heartRate        3389 non-null   float64
 15  glucose          3086 non-null   float64
 16  TenYearCHD       3390 non-null   int64  
dtypes: float64(9),

**Checking Duplicate Data**

In [None]:
#checking duplicate value in the given dataset
df.duplicated().sum()

0

**Missing Values and Percentage**

In [None]:
#calculating no.of missing values
df1 = df.isnull().sum().reset_index().rename(columns={'index':'column_name', 0:'no.of_missing'})

#calculating missing percentage
percent_missing = df.isnull().sum() * 100 / len(df)
df2 = percent_missing.reset_index().rename(columns={'index':'column_name', 0:'missing_percentage'}).round(2)

#merging dataframes on column_name
missing_value_df = df1.merge(df2,on='column_name')

In [None]:
#missing value and their percentage
missing_value_df

Unnamed: 0,column_name,no.of_missing,missing_percentage
0,id,0,0.0
1,age,0,0.0
2,education,87,2.57
3,sex,0,0.0
4,is_smoking,0,0.0
5,cigsPerDay,22,0.65
6,BPMeds,44,1.3
7,prevalentStroke,0,0.0
8,prevalentHyp,0,0.0
9,diabetes,0,0.0


**Droping columns**

In [None]:
#droping unnecessary columns
df.drop(['id','education'],axis=1,inplace=True)

In [None]:
#first five rows after droping unnecessary columns
df.head()

Unnamed: 0,age,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,64,F,YES,3.0,0.0,0,0,0,221.0,148.0,85.0,,90.0,80.0,1
1,36,M,NO,0.0,0.0,0,1,0,212.0,168.0,98.0,29.77,72.0,75.0,0
2,46,F,YES,10.0,0.0,0,0,0,250.0,116.0,71.0,20.35,88.0,94.0,0
3,50,M,YES,20.0,0.0,0,1,0,233.0,158.0,88.0,28.26,68.0,94.0,1
4,64,F,YES,30.0,0.0,0,0,0,241.0,136.5,85.0,26.42,70.0,77.0,0


**Relplacing Nan Values with Median**

In [None]:
#replacing glucose nan value with median
df['glucose'].fillna(df['glucose'].median(),inplace=True)

In [None]:
df.isnull().sum()

age                 0
sex                 0
is_smoking          0
cigsPerDay         22
BPMeds             44
prevalentStroke     0
prevalentHyp        0
diabetes            0
totChol            38
sysBP               0
diaBP               0
BMI                14
heartRate           1
glucose             0
TenYearCHD          0
dtype: int64

**Creating copy**

In [None]:
#creating the copy of dataframe
cardiovascular_risk = df.copy()

**Dropping Nan Value**

In [None]:
#dropping nan value
cardiovascular_risk.dropna(inplace=True)

In [None]:
#first five rows after dropping nan value
cardiovascular_risk.head()

Unnamed: 0,age,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
1,36,M,NO,0.0,0.0,0,1,0,212.0,168.0,98.0,29.77,72.0,75.0,0
2,46,F,YES,10.0,0.0,0,0,0,250.0,116.0,71.0,20.35,88.0,94.0,0
3,50,M,YES,20.0,0.0,0,1,0,233.0,158.0,88.0,28.26,68.0,94.0,1
4,64,F,YES,30.0,0.0,0,0,0,241.0,136.5,85.0,26.42,70.0,77.0,0
5,61,F,NO,0.0,0.0,0,1,0,272.0,182.0,121.0,32.8,85.0,65.0,1


**Unique Values**

In [None]:
#columns list
col_lst = ['sex', 'is_smoking', 'BPMeds', 'prevalentStroke', 'prevalentHyp', 'diabetes']

In [None]:
#unique element function
def unique_element(columns):
  for col_name in columns:
    print(f'{col_name} : {cardiovascular_risk[col_name].unique()}\n')    #getting unique element of each columns list

unique_element(col_lst)                                                  #function calling

sex : ['M' 'F']

is_smoking : ['NO' 'YES']

BPMeds : [0. 1.]

prevalentStroke : [0 1]

prevalentHyp : [1 0]

diabetes : [0 1]



**Data Description**

In [None]:
#Pandas describe() is used to view some basic statistical details like percentile, mean, std etc. of a data frame
df.describe()

Unnamed: 0,age,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
count,3390.0,3368.0,3346.0,3390.0,3390.0,3390.0,3352.0,3390.0,3390.0,3376.0,3389.0,3390.0,3390.0
mean,49.542183,9.069477,0.029886,0.00649,0.315339,0.025664,237.074284,132.60118,82.883038,25.794964,75.977279,81.720059,0.150737
std,8.592878,11.879078,0.170299,0.080309,0.464719,0.158153,45.24743,22.29203,12.023581,4.115449,11.971868,23.161265,0.357846
min,32.0,0.0,0.0,0.0,0.0,0.0,107.0,83.5,48.0,15.96,45.0,40.0,0.0
25%,42.0,0.0,0.0,0.0,0.0,0.0,206.0,117.0,74.5,23.02,68.0,72.0,0.0
50%,49.0,0.0,0.0,0.0,0.0,0.0,234.0,128.5,82.0,25.38,75.0,78.0,0.0
75%,56.0,20.0,0.0,0.0,1.0,0.0,264.0,144.0,90.0,28.04,83.0,85.0,0.0
max,70.0,70.0,1.0,1.0,1.0,1.0,696.0,295.0,142.5,56.8,143.0,394.0,1.0


## **Data Analysis**

**Target Variable**

In [None]:
#10-year risk of coronary heart disease CHD(binary: “1”, means “Yes”, “0” means “No”)
#checking target variable is balanced or not
target_class = cardiovascular_risk.TenYearCHD.value_counts()
print(f'Class0: {target_class[0]}\nClass1: {target_class[1]}\nProportion: {round((target_class[0]/target_class[1]),2)}:1')

Class0: 2784
Class1: 488
Proportion: 5.7:1


In [None]:
# sns.countplot(cardiovascular_risk.TenYearCHD,palette='BuPu_r')
# plt.title('Risk of coronary heart disease',fontsize=15)

# Plotting the pieplot using plotly for dependent variable
fig = go.Figure([go.Pie(labels=['Not Having CVD', 'Having CVD'],values=cardiovascular_risk['TenYearCHD'].value_counts().values)])
fig.update_layout(title_text="Pie chart of Target Variable", template="plotly_white")
fig.data[0].marker.line.color = 'hsl(255, 255, 255)'
fig.data[0].marker.line.width = 2
fig.update_traces(hole=.4,)
fig.show()

**Observation:**<br>
* Clearly it's a **Unbalanced dataset**
* Class0: 2784<br>
  Class1: 488<br>
  Proportion: 5.7:1

**Outlier Detection**<br>

<font color = 'red'>***Continuous variables:***</font><br>

* age
* cigsPerDay
* totChol
* sysBP
* diaBP
* BMI
* heartRate
* glucose

In [None]:
#ploting the box plot using plotly for continuous variables
fig = px.box(cardiovascular_risk, x=['age','cigsPerDay','totChol','sysBP','diaBP','BMI','heartRate','glucose'])
fig.show()

**Obervations:**
* Here age has no outliers

In [None]:
#Continuous variables
continuous_variables = ['cigsPerDay','totChol','sysBP','diaBP','BMI','heartRate','glucose']

In [None]:
#handling outliers
for col_name in continuous_variables:
  upper_lim = cardiovascular_risk[col_name].quantile(.95)
  lower_lim = cardiovascular_risk[col_name].quantile(.05)
  cardiovascular_risk.loc[(cardiovascular_risk[col_name] > upper_lim),col_name] = upper_lim
  cardiovascular_risk.loc[(cardiovascular_risk[col_name] < lower_lim),col_name] = lower_lim

In [None]:
#ploting the box plot using plotly for continuous variables after handling outliers
fig = px.box(cardiovascular_risk, x=['age','cigsPerDay','totChol','sysBP','diaBP','BMI','heartRate','glucose'])
fig.show()

**Age**

In [None]:
#ploting the countplot for age variable with target class
plt.figure(figsize=(20,10))
sns.countplot(data=cardiovascular_risk, x="age", hue="TenYearCHD",palette="Set2")
plt.title('Variation of Age for each target class',fontsize=15)

**Observation:**
* Here we see that CHD increases from 51 to 63 then decreases.
* Age group (*34 < age < 51*) are at lower risk of        cardiovascular disease.  

**Sex**

In [None]:
#plt.figure(figsize=(10,5))
fig,ax = plt.subplots(2,2,figsize=(12,8))

#ploting the countplot for sex variable with target class
ax1 = plt.subplot(2,1,1)
sns.countplot(data=cardiovascular_risk,x="sex",hue="TenYearCHD",palette = 'magma')
plt.title("Sex Vs Target class",fontsize=15)

#ploting the barplot between sex variable vs age varibale with target class
ax2 = plt.subplot(2,1,2)
sns.barplot(data=cardiovascular_risk,x="sex", y='age',hue="TenYearCHD",palette = 'magma')
plt.title("Distributions of Age Vs Sex with Target class",fontsize=15)
fig.tight_layout()

**Observation:**
* In the above bar chart we can say that no.of CHD female is less than male.
* In the above bar chart we conclude that male got early CHD as comapared to female.