<a href="https://colab.research.google.com/github/Faraz-Khan02/Cardiovascular-Risk-Prediction/blob/main/Cardiovascular_Risk_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Cardiovascular Risk Prediction**



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Name -** Faraz Faisal Khan


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts.
The classification goal is to predict whether the patient has a 10-year risk of future coronary heart disease (CHD).
The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes. Variables Each attribute is a potential risk factor. There are both demographic, behavioral, and medical risk factors.** 

# ***Let's Begin !***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from scipy.stats import chi2, chi2_contingency, f_oneway
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score, recall_score, precision_score,\
 accuracy_score, roc_curve, auc, classification_report, confusion_matrix
from xgboost import XGBClassifier
from imblearn.over_sampling import RandomOverSampler, SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.neighbors import KNeighborsClassifier

### Dataset Loading

In [None]:
# Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/Capstone Project-3/data_cardiovascular_risk.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


Our Dataset contains 3390 rows and 17 columns.

### Dataset Information

In [None]:
# Dataset Info
df.info()

Our dataset contains data of many feautures/columns for which we have to predict whether a patient has a 10 year risk of future coronary heart disease(CHD).

The following columns are:


*   **Id** : It contains unique Id of patients from 0 to 3390.
*   **Age** : It contains the age of patients for which we have to predict CHD.

*   **Education** : It has Education of the patients 1,2,3,4.  
*   **Sex** : It contains the gender of the patients whether the patient is Male or Female.

*   **is_smoking** : This columns contains whether the patient is smoking or not. So values in it is in either YES or NO.
*   **cigsPerDay** : It contains the quantity of cigarette the patient consumes per day. 

*   **BPmeds** : It contains whether the patient is taking BP Medicine or not. Here, 1 means patient is taking BP medicines and 0 means patient is not taking medicines. 
*   **pevalentStroke** : It  contains whether the patient has history of stroke or not. Here, if it is yes then it should be 1 or if no then it should be 0.

*   **prevalentHyp** : It contains whether the patient has history of hypertension or not. Here, 1 denotes he has hypertension before and 0 denotes the patient doesnot have hypertension.
*   **Diabetes** : It contains whether the patient has diabetes or not. Here, 1 means patient has diabetes and 0  means patient doesnot have diabetes.

*   **totChol** : It contains the measure of the cholestrol of the patients.
*   **sysBP** : It contains systollic Blood Pressure measure of the patients.

*   **diaBP** : It contains diastolic Blood Pressure measure of the patients.
*   **BMI** : It contains Body Mass Index of the patients.

*   **heartRate** : It contains the heart rate of the patients. 
*   **glucose** : It contains the glucose level of the patients.

*   **TenYearCHD** : It contains whether the patients whether a patient has a 10 year risk of future coronary heart disease(CHD).




















#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate = df.duplicated()
print(duplicate.value_counts())


Here, we get result as false means our dataset doesnot contain any duplicate data.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

There is many null values in our dataset. we can see clearly that education contains 87 null values, cigsPerDay contain 22 null values, BPMeds contain 44 null values, totChol contains 38 null values, BMI contain 14 null values, heartRate contain 1 null value and glucose contain 304 null values.

In [None]:
# Visualizing the missing values
plt.figure(figsize=(15, 8))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.xlabel("Column_name", size=12, weight="bold")
plt.title("Missing values",fontweight="bold",size=15)
plt.show()

From our correlation heatmap we can say that education, cigsPerDay, BPMeds, totChol, BMI,heartRate, glucose are the columns which has missing values.

# ***Data Cleaning***

In [None]:
# Copying data to preserve orignal dataset
new_df = df.copy()

In [None]:
# Dropping 'id' as it is not required
new_df.drop(columns=['id'],inplace=True)

In [None]:
# Encoding the binary columns
new_df['sex'] = np.where(new_df['sex'] == 'M',1,0)
new_df['is_smoking'] = np.where(new_df['is_smoking'] == 'YES',1,0)

In [None]:
new_df.head()



*   Here, we can see we have dropped 'id' column which is not required.
*   Here we have converted sex column where Male = '1' and Female = '0'.

*   And we have converted is_smoking column where YES = '1' and NO = '0'.





### **Replacing the missing values in Education**

In [None]:
# Replacing the missing values in the Education columns with its mode
new_df['education'] = new_df['education'].fillna(new_df['education'].mode()[0])


### **Replacing the missing values in BPMeds**

In [None]:
# Replacing the missing values in the BPMeds columns with its mode
new_df['BPMeds'] = new_df['BPMeds'].fillna(new_df['BPMeds'].mode()[0])

### **Replacing the missing values in cigsPerDay**

In [None]:
# All missing values in the cigsPerDay column
new_df[new_df['cigsPerDay'].isna()]

From this we can say that all the missing values in cigsPerDay are smoking daily.

In [None]:
# distribution of no. of cigarettes per day for smokers 
plt.figure(figsize=(8,4))
sns.distplot(new_df[new_df['is_smoking']==1]['cigsPerDay'])
plt.axvline(new_df[new_df['is_smoking']==1]['cigsPerDay'].mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(new_df[new_df['is_smoking']==1]['cigsPerDay'].median(), color='cyan', linestyle='dashed', linewidth=2)
plt.title('Cigarette per day  distribution')
plt.show()

From above visualization we can say that both mean and median are close to each other so, we will check outliers for proper imputting the data in the missing places.

In [None]:
# box plot for the no. of cigarettes per day for smokers 
plt.figure(figsize=(8,4))
sns.boxplot(new_df[new_df['is_smoking']==1]['cigsPerDay'])

From above visualization we can see that there are some outliers in this column so we will impute the median value in the missing places.

In [None]:
# Imputing the missing values in the cigsPerDay 
new_df['cigsPerDay'] = new_df['cigsPerDay'].fillna(new_df[new_df['is_smoking']==1]['cigsPerDay'].median())

### **Replacing the missing values in totChol**

In [None]:
# distribution of total cholestrol of the patient 
plt.figure(figsize=(8,4))
sns.distplot(new_df['totChol'])
plt.axvline(new_df['totChol'].mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(new_df['totChol'].median(), color='cyan', linestyle='dashed', linewidth=2)
plt.title('Total Cholestrol of the Patient')
plt.show()

Now, we will check outliers in the column.

In [None]:
# box plot for total cholestrol of the patient
plt.figure(figsize=(8,4))
sns.boxplot(new_df['totChol'])

Here, we can see that this totChol column contains outliers so we will impute median in the missing places.

In [None]:
# Imputing missing values in the totChol with their medain values
new_df['totChol'] = new_df['totChol'].fillna(new_df['totChol'].median())

### **Replacing the missing values in BMI**

In [None]:
# distribution of BMI of the patient 
plt.figure(figsize=(8,4))
sns.distplot(new_df['BMI'])
plt.axvline(new_df['BMI'].mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(new_df['BMI'].median(), color='cyan', linestyle='dashed', linewidth=2)
plt.title('BMI of the Patient')
plt.show()

After seeing that both mean and median are very close we will check outliers in our column.

In [None]:
# box plot for BMI of the patient
plt.figure(figsize=(8,4))
sns.boxplot(new_df['BMI'])

From this boxplot we can say that there are lot outliers so we will impute median in the misssing places.

In [None]:
# Imputing missing values in the BMI with their medain values
new_df['BMI'] = new_df['BMI'].fillna(new_df['BMI'].median())

### **Dropping the missing value in heartRate**

As heartRate contains only 1 missing value so, we can easily drop that row because it would not affect our model.

In [None]:
# Dropping the missing values from column heartRate
new_df=new_df[new_df['heartRate'].notna()]

### **Replacing the missing values in glucose**

In [None]:
# distribution of glucose
plt.figure(figsize=(8,4))
sns.distplot(new_df['glucose'])
plt.axvline(new_df['glucose'].mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(new_df['glucose'].median(), color='cyan', linestyle='dashed', linewidth=2)
plt.title('Glucose distribution')
plt.show()

Mean and Median are close enough and the graph is positively skewed so we will check for outliers.

In [None]:
# box plot for glucose
plt.figure(figsize=(8,4))
sns.boxplot(new_df['glucose'])

Here, glucose column contains lot of outliers so, we can impute median values in missing places but glucose column contain 304 missing values which is a great number if we will impute median value then we will get very high bias.

So, to avoid this we can impute the missing values using KNN imputer.

In [None]:
# Using KNN imputer with K=10
imputer = KNNImputer(n_neighbors=10)
imputed = imputer.fit_transform(new_df)
new_df = pd.DataFrame(imputed, columns=new_df.columns)

In [None]:
# Checking our dataset after applying KNN imputer
new_df.info()

Here, we can see KNN imputer has coverted every value into float so we will convert it accordingly.

In [None]:
# Changing datatypes of the following columns
new_df = new_df.astype({'age': int, 'education':int,'sex':int,'is_smoking':int,'cigsPerDay':int,
               'BPMeds':int,'prevalentStroke':int,'prevalentHyp':int,'diabetes':int,
               'totChol':float,'sysBP':float,'diaBP':float,
               'BMI':float,'heartRate':float,'glucose':float,'TenYearCHD':int})
     

In [None]:
# Checking for missing values
new_df.isna().sum()

Here, all our missing values are managed and we are ready to do EDA.

# **Exploratory Data Analysis**

## **Univariate Analysis**

In [None]:
# understanding distribution of data before imputation
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
new_df.hist(ax = ax)
plt.show()

Here, BPMeds, prevalent stroke and diabetes are poorly imbalanced.

In [None]:
# Categorizing different features in dependent, continous and categorical variables.
dependent_var = ['TenYearCHD']
continuous_var = ['age','cigsPerDay','totChol','sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose']
categorical_var = ['education', 'sex', 'is_smoking','BPMeds','prevalentStroke', 'prevalentHyp', 'diabetes']

## **Distribution of Dependent Variable**

In [None]:
# Distribution of dependent varaible 
plt.figure(figsize=(8,4))
sns.countplot(df[dependent_var[0]])
plt.xlabel(dependent_var[0])
plt.title(dependent_var[0]+' distribution')

Here, we can see TenYearCHD is poorly imbalanced.

## **Distribution of the Discrete Independent Variables**

In [None]:
# Analysing the distribution of categorical variables in the dataset
for i in categorical_var:
  plt.figure(figsize=(8,4))
  p = sns.countplot(df[i])
  plt.xlabel(i)
  plt.title(i+' distribution')
  for i in p.patches:
    p.annotate(f'{i.get_height()}', (i.get_x() + i.get_width() / 2., i.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
  plt.show()

From our above visualisation we got following inferences:

*   Maximum education category of people is 1 followed by 2,3 and 4.
*   There are more female patients than male.

*   Almost half of the patient are smokers.
*   Only  100 patients are taking BP Medicines.

*   Only 22 patient have experienced a stroke.
*   1069 patients have hypertension.

*   87 patients have diabetes.









## **Bivariate Analysis**

## **Analyzing the relationship between the Dependent Variable and the Continuous Variables**


In [None]:
# Relationship between the dependent variable and continuous independent variables
for i in continuous_var:
  plt.figure(figsize=(8,4))
  sns.catplot(x=dependent_var[0],y=i,data=new_df,kind='violin')
  plt.ylabel(i)
  plt.xlabel(dependent_var[0])
  plt.title(dependent_var[0]+' vs '+i)
  plt.show()


From this we can say that as the age increases changes of CHD increases.

## **Analyzing the relationship between the Dependent Variable and the Discrete Variables**

In [None]:
# Analyzing the relationship between the dependent variable and categorical independent variables
for i in categorical_var:
  plt.figure(figsize=(8,4))
  sns.histplot(x=i, hue=dependent_var[0], data=new_df, stat="count", multiple="stack")
  plt.title('Risk of CHD by: '+i)
  plt.show()

Since this data is unevenly distributed and this graph doesnot give any conclusive inference.

In [None]:
#stacked bar chart

for i in categorical_var:
    x_var, y_var = i, dependent_var[0]
    plt.figure(figsize=(8,4))
    df_grouped = new_df.groupby(x_var)[y_var].value_counts(normalize=True).unstack(y_var)*100
    df_grouped.plot.barh(stacked=True)
    plt.legend(
        bbox_to_anchor=(1.05, 1),
        loc="upper left",
        title=y_var)

    plt.title("% of patients at the risk of CHD by: "+i)
    for ix, row in df_grouped.reset_index(drop=True).iterrows():
        # print(ix, row)
        cumulative = 0
        for element in row:
            if element > 0.1:
                plt.text(
                    cumulative + element / 2,
                    ix,
                    f"{int(element)} %",
                    va="center",
                    ha="center",
                )
            cumulative += element
    plt.show()

From the following graph we conclude the following inferences:

*   Percentage of CHD for educdation are- 1(18%),2(11%),3(12%) and 4(14%).
*   Male have relatively high chance of CHD(18%) than female(12%).

*   Smokers have more chances of CHD(16%).
*   Patients taking BP medicines have chances of CHD(33%).

*   Patients having stroke in past have high chances of CHD(45%).
*   Patient having hypertension are have more chances of CHD(23%).

*   Patients having diabetes are more prone towards CHD(37%).











# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***