<a href="https://colab.research.google.com/github/Faraz-Khan02/Cardiovascular-Risk-Prediction/blob/main/Cardiovascular_Risk_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Cardiovascular Risk Prediction**



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Name -** Faraz Faisal Khan


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts.
The classification goal is to predict whether the patient has a 10-year risk of future coronary heart disease (CHD).
The dataset provides the patientsâ€™ information. It includes over 4,000 records and 15 attributes. Variables Each attribute is a potential risk factor. There are both demographic, behavioral, and medical risk factors.** 

# ***Let's Begin !***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from scipy.stats import chi2, chi2_contingency, f_oneway
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score, recall_score, precision_score,\
 accuracy_score, roc_curve, auc, classification_report, confusion_matrix
from xgboost import XGBClassifier
from imblearn.over_sampling import RandomOverSampler, SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from tensorflow.keras import Sequential, layers
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.neighbors import KNeighborsClassifier

### Dataset Loading

In [None]:
# Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/Capstone Project-3/data_cardiovascular_risk.csv")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


Our Dataset contains 3390 rows and 17 columns.

### Dataset Information

In [None]:
# Dataset Info
df.info()

Our dataset contains data of many feautures/columns for which we have to predict whether a patient has a 10 year risk of future coronary heart disease(CHD).

The following columns are:


*   **Id** : It contains unique Id of patients from 0 to 3390.
*   **Age** : It contains the age of patients for which we have to predict CHD.

*   **Education** : It has Education of the patients 1,2,3,4.  
*   **Sex** : It contains the gender of the patients whether the patient is Male or Female.

*   **is_smoking** : This columns contains whether the patient is smoking or not. So values in it is in either YES or NO.
*   **cigsPerDay** : It contains the quantity of cigarette the patient consumes per day. 

*   **BPmeds** : It contains whether the patient is taking BP Medicine or not. Here, 1 means patient is taking BP medicines and 0 means patient is not taking medicines. 
*   **pevalentStroke** : It  contains whether the patient has history of stroke or not. Here, if it is yes then it should be 1 or if no then it should be 0.

*   **prevalentHyp** : It contains whether the patient has history of hypertension or not. Here, 1 denotes he has hypertension before and 0 denotes the patient doesnot have hypertension.
*   **Diabetes** : It contains whether the patient has diabetes or not. Here, 1 means patient has diabetes and 0  means patient doesnot have diabetes.

*   **totChol** : It contains the measure of the cholestrol of the patients.
*   **sysBP** : It contains systollic Blood Pressure measure of the patients.

*   **diaBP** : It contains diastolic Blood Pressure measure of the patients.
*   **BMI** : It contains Body Mass Index of the patients.

*   **heartRate** : It contains the heart rate of the patients. 
*   **glucose** : It contains the glucose level of the patients.

*   **TenYearCHD** : It contains whether the patients whether a patient has a 10 year risk of future coronary heart disease(CHD).




















#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate = df.duplicated()
print(duplicate.value_counts())


Here, we get result as false means our dataset doesnot contain any duplicate data.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

There is many null values in our dataset. we can see clearly that education contains 87 null values, cigsPerDay contain 22 null values, BPMeds contain 44 null values, totChol contains 38 null values, BMI contain 14 null values, heartRate contain 1 null value and glucose contain 304 null values.

In [None]:
# Visualizing the missing values
plt.figure(figsize=(15, 8))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.xlabel("Column_name", size=12, weight="bold")
plt.title("Missing values",fontweight="bold",size=15)
plt.show()

From our correlation heatmap we can say that education, cigsPerDay, BPMeds, totChol, glucose are the columns which has missing values.

# ***Data Cleaning***

In [None]:
# Copying data to preserve orignal dataset
new_df = df.copy()

In [None]:
# Dropping 'id' as it is not required
new_df.drop(columns=['id'],inplace=True)

In [None]:
# Encoding the binary columns
new_df['sex'] = np.where(new_df['sex'] == 'M',1,0)
new_df['is_smoking'] = np.where(new_df['is_smoking'] == 'YES',1,0)

In [None]:
new_df.head()



*   Here, we can see we have dropped 'id' column which is not required.
*   Here we have converted sex column where Male = '1' and Female = '0'.

*   And we have converted is_smoking column where YES = '1' and NO = '0'.





### **Replacing the missing value in Education**

In [None]:
# Replacing the missing values in the Education columns with its mode
new_df['education'] = new_df['education'].fillna(new_df['education'].mode()[0])


### **Replacing the missing value in BPMeds**

In [None]:
# Replacing the missing values in the BPMeds columns with its mode
new_df['BPMeds'] = new_df['BPMeds'].fillna(new_df['BPMeds'].mode()[0])

### **Replacing the missing values in cigsPerDay**

In [None]:
# All missing values in the cigsPerDay column
new_df[new_df['cigsPerDay'].isna()]

From this we can say that all the missing values in cigsPerDay are smoking daily.

In [None]:
# distribution of no. of cigarettes per day for smokers 
plt.figure(figsize=(10,5))
sns.distplot(new_df[new_df['is_smoking']==1]['cigsPerDay'])
plt.axvline(new_df[new_df['is_smoking']==1]['cigsPerDay'].mean(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(new_df[new_df['is_smoking']==1]['cigsPerDay'].median(), color='cyan', linestyle='dashed', linewidth=2)
plt.title('Cigarette per day  distribution')
plt.show()

From above visualization we can say that both mean and median are close to each other so, we will check outliers for proper imputting the data in the missing places.

In [None]:
# box plot for the no. of cigarettes per day for smokers 
plt.figure(figsize=(10,5))
sns.boxplot(new_df[new_df['is_smoking']==1]['cigsPerDay'])

From above visualization we can see that there are some outliers in this column so we will impute the median value in the missing places.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***