<a href="https://colab.research.google.com/github/Aman1647/Cardiovascular-Risk-Prediction/blob/main/Cardiovascular_Risk_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - 



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **The dataset contained information of over 3000 patients with different medical histories.For better interpretability of data ,age group was divided into 3 classes 'young, middle, old age' . Heart rates were classified based on their Hypertension severity into 'Normal, Elevated, Stage_1, Stage_2, Crisis, Isolated systolic , Isolated diastolic' on the basis systolic and diastolic Blood Pressures & Cholestrol levels were classified into 'Normal, Elevated, High risk'.Several features were interlinked to each other based on their usage and measures , which were further understood by visualizing their effect on CHD risk for patients. Null values were removed since we can't and should not make assumptions wrt medical conditions of patients . Some features were heavily influenced by outliers which can affect the model prediction hence they were handled using Inter-quartile Range (IQR) method. Further some categorical features were encoded using label encoding while rest were encoded using One-hot-df method as they contained multiple values , which are better understood when converted into dummies. For checking and handling multicollinearity Variance inflation factor(VIF) method was used to remove correlated independent variables. Since, the data was heavily imbalanced Synthetic Minority Oversampling Technique (SMOTE) was used to generate synthetic samples from the minority class. Before model implementation , dataset was Standardized using StandardScaler method since,the dataset is mostly normally distributed & it mantains information about outliers and makes model less sensitive. 6 models were implemented and some with below-par score underwent hypertuning for better results. Random Forest made the best predictions with an F1-score and test-accuracy of 90 %. Age , Heart rate and were the top features used by the model for data interpretation.**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**
The dataset is from an ongoing cardiovascular study on residents of town of Framingham , Massacheutts. The classification goal is to predct whether the patient has a 10-year risk of future coronary heart disease(CHD). The dataset provides the patients information. It includes over 4000 records and has 15 attributes . Each attribute is a potential risk factor. There are both demographic , behavioral and medical risk factors..**

# ***Let's Begin !***

## ***1. Know Your Data***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report, roc_curve
from sklearn.metrics import make_scorer, recall_score, f1_score, accuracy_score, precision_score, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier

from scipy.stats import zscore
from statsmodels.stats.outliers_influence import variance_inflation_factor
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTETomek
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("/content/drive/MyDrive/Almabetter/Capstone Projects/Classification/data_cardiovascular_risk.csv")
cvd_df = df.copy()

### Dataset First View

In [None]:
# Dataset First Look
cvd_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("(Rows, columns) : " + str(cvd_df.shape))

### Dataset Information

In [None]:
# Dataset Info
cvd_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
cvd_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
cvd_df.isnull().sum()

In [None]:
# Visualizing the missing values
missing_values = cvd_df.isnull()
plt.figure(figsize=(12,10))
sns.heatmap(missing_values)

### What did you know about your dataset?

The dataset contains over 3390 columns ,some of the columns contains many null values which need to be handled for better interpretability of dataset and better model prediction

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
cvd_df.columns

In [None]:
# Dataset Describe
cvd_df.describe(include='all').T

### Variables Description 

From the variables description , it can be seen that the dataset contains information of people aged greater than 32 and mostly is focued on older people in the age group of 45 and above and it is mostly female centric

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in cvd_df:
    print(col ,":\n", cvd_df[col].unique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
cvd_df.drop(['id'], axis=1, inplace=True)

In [None]:
cvd_df['TenYearCHD'].value_counts()

In [None]:
#check target variable value proportion
plt.figure(figsize=(10,5))
sns.countplot(x = 'TenYearCHD', data = cvd_df)

In [None]:
cvd_df['TenYearCHD'].value_counts(normalize=True)

In [None]:
#defining age group to dataset

def age_range(row):
    if row>=29 and row<40:
        return 'Young Age'
    elif row>=40 and row<55:
        return 'Middle Age'
    elif row>55:
        return 'Elder Age'

In [None]:
#Applying converted data into our dataset with new column - Age_Range

cvd_df['Age_Group']= cvd_df['age'].apply(age_range)
cvd_df.head()

In [None]:
categorical_features = [i for i in cvd_df.columns if cvd_df[i].nunique()<=4]
categorical_features

In [None]:
numerical_features = [i for i in cvd_df.columns if i not in categorical_features]
numerical_features

In [None]:
plt.figure(figsize=(15,15))
for n, col in enumerate(numerical_features[1:-1]):
  plt.subplot(4,3, n+1)
  sns.distplot((cvd_df[col]))
  plt.title({col.title()})
  plt.tight_layout()

In [None]:
def heart_rate(sysBP, diaBP):
  if sysBP <= 120 and diaBP <=80:
    return 'Normal BP'
  elif sysBP <= 129 and diaBP <= 80:
    return 'Elevated BP'
  elif sysBP <= 139 and diaBP <=90:
    return 'Stage_1'
  elif sysBP <= 179 and diaBP <120:
    return 'Stage_2'
  elif sysBP >= 180 or diaBP >=120: 
    return 'Crisis'
  elif sysBP >=125 and diaBP >= 90:
    return 'Isolated Diastolic'
  elif sysBP >= 140 and diaBP < 90:
    return 'Isolated Systolic'


In [None]:
cvd_df['Hypertension']= cvd_df.apply(lambda  x: heart_rate(x['sysBP'], x['diaBP']), axis=1 )

In [None]:
cvd_df['Hypertension'].value_counts()

Diabetes V/s Cholestrol

In [None]:
def cholestrol_lvl(totChol):
  if totChol < 200:
    return 'Normal'
  elif totChol > 200 and totChol < 240 :
    return 'Elevated'
  else :
    return 'High_Risk'

In [None]:
cvd_df['Cholestrol'] = cvd_df['totChol'].apply(cholestrol_lvl)

In [None]:
cvd_df.head()

### What all manipulations have you done and insights you found?

15 % of the population have heart disease. The age column is divided into 3 groups, heart rates are classified based on Hypertension risk and risk of CHD is calculated on the basis of Cholestrol level for better interpretability of data.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

#Swarm Plot Creation of Gender, Age based Heart disease risk 

plt.figure(figsize=(12,7))
sns.swarmplot(x='TenYearCHD', y='age', hue='sex', data=cvd_df, palette='Oranges_r')
plt.title('Gender, Age V/s Heart Disease', fontsize=17)
plt.xlabel('Heart Disease', fontsize=15)
plt.ylabel('Age', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Since ,the age_groups are further sub-divided into Male and Female. Swarmplot describes them in interpretable manner

##### 2. What is/are the insight(s) found from the chart?

The Elder_Age group has the highest risk of heart disease than the others ,out of it females are at a greater risk than men

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Thorough check-up of eldery patients should be done, and focus should be given on female patients predominantly

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Diabetes v/s heart disease
plt.figure(figsize = (10,10))
sns.countplot(x = cvd_df['diabetes'], hue = cvd_df['TenYearCHD'])
plt.title("Diabetics V/s Heart Disease", size =15)
plt.xlabel("Diabetes", size =12)
plt.ylabel("People_Count", size =12)
plt.legend(['No Risk','At Risk'])
plt.show()

In [None]:
plt.figure(figsize =(10, 8))
sns.swarmplot(x='diabetes', y='age', hue='TenYearCHD', data=cvd_df, palette='Oranges_r')
plt.title("Diabetes V/s CVD Risk wrt Age", size=15)
plt.xlabel("Diabetes", size=12)
plt.ylabel("Age", size=12)
plt.plot()

##### 1. Why did you pick the specific chart?

To check the relation between diabetes and heart disease. Count plot gives a detailed insight into the same

##### 2. What is/are the insight(s) found from the chart?

Very less no of population have diabetes & it is clear that if a patient is non-diabetic , they surely don't possess a risk of CHD

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

For a diabetic patient , cholestrol levels should be checked for further conclusion

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Smokers V/s CVD Risk
plt.figure(figsize = (10,8))
sns.countplot(x = cvd_df['is_smoking'], hue = cvd_df['TenYearCHD'])
plt.title("Smokers V/s CVD risk")
plt.show()

In [None]:
# Cigerettes per day V/s CVD risk
plt.figure(figsize = (10,8))
sns.swarmplot(x='TenYearCHD', y='cigsPerDay', hue='is_smoking', data=cvd_df, palette='Oranges_r')
plt.title("Cigerettes V/s CVD risk")
plt.show()

##### 1. Why did you pick the specific chart?

To check realtion between smokers and CHD risk ,since both have binary results, countplot describes the values in a better way

##### 2. What is/are the insight(s) found from the chart?

A very less no. of population are constant smokers . While smokers are definitely at a risk of a heart disease ,a considerable amount of non-smokers also have CHD risk

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Smoking should not be directly linked with a heart disease .Other factors should be checked before final conclusion

#### Chart - 4

In [None]:
# Chart - 4 visualization code

plt.figure(figsize = (10,10))
sns.countplot(x = cvd_df['BPMeds'], hue = cvd_df['TenYearCHD'])
plt.title("BP meds V/s CHD", size=15)
plt.xlabel("BP meds taken or not", size=12)
plt.show()

In [None]:
# Systolic BP V/s CVD risk
plt.figure(figsize = (10,8))
sns.barplot(x = cvd_df['TenYearCHD'], y = cvd_df['sysBP'])
plt.title("Systolic BP V/s CHD risk", size=15)
plt.show()

In [None]:
# Diastolic BP V/s CVD risk
plt.figure(figsize = (10,8))
sns.barplot(x = cvd_df['TenYearCHD'], y = cvd_df['diaBP'])
plt.title("Diastolic BP V/s CHD risk", size=15)
plt.show()

In [None]:
cvd_df['Hypertension'].value_counts()

##### 1. Why did you pick the specific chart?

To check realtion between BP Medicines taken, BP and CVD risk, countplot & barplot describe the values in best way

##### 2. What is/are the insight(s) found from the chart?

No significant relation between BP Medicines taken and Heart disease, ,though patients with systolic BP > 140 & diastolic BP > 90 have extremely high risk of CHD

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Dataset contains significant amount of patients needing immediate medical attention, based on their Systolic & Diastolic BP

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Prevalent Hypertension V/s CVD risk
plt.figure(figsize = (10,10))
sns.countplot(hue = cvd_df['Hypertension'], x = cvd_df['TenYearCHD'])
plt.title("Blood Pressure levels wrt CHD risk", size =15)
plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 10))

sns.countplot(ax=axes[0], x = cvd_df['prevalentStroke'], hue = cvd_df['TenYearCHD'])
axes[0].set_title("Prevalent Stroke V/s CHD risk", size=15)

sns.countplot(ax=axes[1], x = cvd_df['prevalentHyp'], hue = cvd_df['TenYearCHD'])
axes[1].set_title("Prevalent Hypertension V/s CHD risk", size=15)

##### 1. Why did you pick the specific chart?

To check realtion between Prevalent Hypertension and CVD risk,barplot describe the values in best way

##### 2. What is/are the insight(s) found from the chart?

While significant number of patients are not at risk of CHD, high Blood Pressure levels are linked with a risk of CHD

Prevalent Hypertension --> ( Systolic BP > 140 mm Hg & Diastolic Hypertension > 90 mm Hg
People with Prevalent Hypertension have extremely high risk of Heart disease.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

From systolic BP , Diastolic BP, Prevalent Hypertension graph we can conclude that , immediate treatment should be given to people with prevalent hypertension or any symptoms of the same

#### Chart - 6

In [None]:
# Chart - 6 visualization code

plt.figure(figsize=(12,8))
sns.lineplot(x='age', y='totChol', data=cvd_df, color='r')
plt.title('Cholestrol VS Age', fontsize=17)
plt.xlabel('Age', fontsize=15)
plt.ylabel('Cholestrol', fontsize=15)
plt.show()

In [None]:
# Prevalent Hypertension V/s CVD risk
plt.figure(figsize = (10,10))
sns.barplot(x = cvd_df['TenYearCHD'], y = cvd_df['totChol'])
plt.title("Cholestrol V/s CHD risk", fontsize =15)
plt.xlabel("Ten Year CHD ", fontsize =15)
plt.ylabel("Cholestrol level", fontsize =15)
plt.show()

In [None]:
plt.figure(figsize = (10,10))
sns.countplot(x = cvd_df['TenYearCHD'], hue = cvd_df['Cholestrol'])
plt.title("Cholestrol V/s CHD risk", fontsize =15)
plt.xlabel("Ten Year CHD ", fontsize =15)
plt.ylabel("Cholestrol level", fontsize =15)
plt.show()

In [None]:
cvd_df['Cholestrol'].value_counts()

##### 1. Why did you pick the specific chart?

Line plot and barplot are used to understand relation between Cholestrol level and CHD risk.

##### 2. What is/are the insight(s) found from the chart?

Cholestrol level goes on increasing with the age and ,correspondingly the risk of CHD also increases

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

People should be advised to bring a change in lifestyle when their cholestrol level is high

#### Chart - 7

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(12,8))
sns.barplot(y='BMI', x='TenYearCHD', data=cvd_df)
plt.title('BMI VS CHD', fontsize=17)
plt.xlabel('CHD', fontsize=15)
plt.ylabel('BMI', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

To get a realtionship between BMI and CHD, Since one is binary and other is continuous

##### 2. What is/are the insight(s) found from the chart?

High level of BMI is associated with increasing risk of CHD

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Patients should be advised to keep their BMI index in check

#### Chart - 8

In [None]:
cvd_df.columns

In [None]:
# Chart - 8 visualization code

plt.figure(figsize=(12,8))
sns.boxplot(x='TenYearCHD', y='heartRate', data=cvd_df)
plt.title('Heart Rate VS CHD', fontsize=17)
plt.xlabel('CHD', fontsize=15)
plt.ylabel('Heart Rate', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Boxplot describes the heart-rate range and relation with CHD

##### 2. What is/are the insight(s) found from the chart?

Though high heart-rate is usually linked with high risk of CHD, the data is influenced by the outliers to get a detailed insight

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

plt.figure(figsize=(12,8))
sns.boxplot(x='TenYearCHD', y ='glucose', data=cvd_df)
plt.title('glucose VS CHD', fontsize=17)
plt.xlabel('CHD', fontsize=15)
plt.ylabel('glucose', fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(x='diabetes', y ='glucose', data=cvd_df)
plt.title('Diabetes V/S Glucose', fontsize=17)
plt.xlabel('Diabetes', fontsize=15)
plt.ylabel('Glucose', fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(hue='TenYearCHD', y='glucose', x='diabetes', data=cvd_df)
plt.title('Diabetes V/S Glucose', fontsize=17)
plt.xlabel('Diabetes', fontsize=15)
plt.ylabel('Glucose', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Box-plot best describes the relation between th countinuous & binary values, whereas scatter plot sums up the relation between all 3 variables in one plot

##### 2. What is/are the insight(s) found from the chart?

The relationship between Glucose and CHD is heavily inifluenced by outliers, but the relation between glucose and diabetes points out that high glucose levels are associated with increased risk of diabetes and high diabetes is linked with a CHD risk

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Patients should be advised to keep glucose levels within stipulated limits , and have regular check-ups of diabetes

#### Chart - 10

In [None]:
cvd_df.columns

In [None]:
# Chart - 10 visualization code


plt.figure(figsize=(12,10))
sns.scatterplot(hue='prevalentHyp', x='diaBP', y='sysBP', data=cvd_df)
plt.title('Hypertension relation with Systolic & Diastolic BP', fontsize=17)
plt.xlabel('Diastolic BP', fontsize=15)
plt.ylabel('Systolic BP', fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Scatter pplot gives relationship bbetween Systolic BP, Diastolic BP & Prevalent Hypertension

##### 2. What is/are the insight(s) found from the chart?

High levels of Systolic BP & Diastolic BP are linked with risk of Prevalent Hypertension , which is a clear risk of CHD

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Same treatment/medication as for High BP

#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize = (15,12))
correlation = cvd_df.corr()
sns.heatmap(abs(correlation), annot=True, cmap = 'flare')

##### 1. Why did you pick the specific chart?

Heatmap determines correlation between all variables

##### 2. What is/are the insight(s) found from the chart?

Some features are highly correlated , due to same measures, usage and their link with CHD risk

## ***5 Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# drop null values
cvd_df.dropna(inplace =True)

In [None]:
missing_val = cvd_df.isnull()
plt.figure(figsize = (12, 10))
sns.heatmap(missing_val)

#### What all missing value imputation techniques have you used and why did you use those techniques?

All the null values are dropped

### 2. Handling Outliers

In [None]:
# Lets check the discrete and continuous features
categorical_features = [i for i in cvd_df.columns if cvd_df[i].nunique()<=4]
numeric_features = [i for i in cvd_df.columns if i not in categorical_features]

print(categorical_features)
print(numeric_features)

In [None]:
plt.figure(figsize=(18,12))
for n,column in enumerate(numerical_features):
  plt.subplot(5, 4, n+1)
  sns.boxplot(cvd_df[column])
  plt.title(f'{column.title()}',weight='bold')
  plt.tight_layout()

In [None]:
for col in numerical_features:
  # Using IQR method to define the range of inliners:
  q1, q3, median = cvd_df[col].quantile([0.25,0.75,0.5])
  lower_limit = q1 - 1.5*(q3-q1)
  upper_limit = q3 + 1.5*(q3-q1)

  # Replacing Outliers with median value
  cvd_df[col] = np.where(cvd_df[col] > upper_limit, median,np.where(
                         cvd_df[col] < lower_limit,median,cvd_df[col]))

In [None]:
plt.figure(figsize=(18,12))
for n,column in enumerate(numerical_features):
  plt.subplot(5, 4, n+1)
  sns.boxplot(cvd_df[column])
  plt.title(f'{column.title()}',weight='bold')
  plt.tight_layout()

##### What all outlier treatment techniques have you used and why did you use those techniques?

Inter-quartile range (IQR) method was used for treatment of outliers. Since,it shows how the data is distribubted around the median

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
categorical_features

In [None]:
for col in ['sex', 'is_smoking']:
  print(cvd_df[col].value_counts(),'\n')

In [None]:
encoder = {'sex':{'M':1, 'F':0},
           'is_smoking':{'YES':1, 'NO': 0},
           }

# Label Encoding
cvd_df = cvd_df.replace(encoder)

In [None]:
cvd_df.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding to convert variables with double categorical values into binary format

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
 
   # Calculating VIF
   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

   #return(vif.sort_values(by='VIF',ascending=False).reset_index(drop=True))
 
   return vif

In [None]:
calc_vif(cvd_df[[i for i in cvd_df.describe().columns if i not in ['TenYearCHD']]])

In [None]:
calc_vif(cvd_df[[i for i in cvd_df.describe().columns if i not in ['TenYearCHD','glucose', 'sysBP', 'diaBP', 'education', 'id', 'BMI', 'totChol', 'heartRate']]])

In [None]:
plt.figure(figsize = (15,12))
correlation = cvd_df.corr()
sns.heatmap(abs(correlation), annot=True, cmap = 'flare')

#### 2. Feature Selection

In [None]:
cvd_df.columns

In [None]:
pair_df = [cvd_df[['age','sex', 'heartRate','prevalentStroke','is_smoking', 'cigsPerDay', 'BPMeds',]],
           pd.get_dummies(cvd_df[['Age_Group', 'Hypertension','Cholestrol']], drop_first=False), 
           cvd_df['TenYearCHD']
           ]
one_hot_df = pd.concat(pair_df, axis=1)
one_hot_df.head()

In [None]:
# Select your features wisely to avoid overfitting

one_hot_df.columns

In [None]:
features = [i for i in one_hot_df.columns if i not in ['TenYearCHD']]

In [None]:
features

In [None]:
X = one_hot_df[features]
y = one_hot_df['TenYearCHD']

##### What all feature selection methods have you used  and why?

One-hot-df method is used for feature selection , since some of the categorical features were to be encoded in binary format


### 6. Data Scaling

In [None]:
# Feature Scaling 
scaler = StandardScaler()
scaler.fit(X_train)
scaled_features = scaler.transform(X_train)

##### Which method have you used to scale you data and why?

Standard Scaler method is used , our data is mostly normally distributed and Standardization mantains useful information about outliers and makes the algorithm less sensitive to them.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size = 0.3, random_state=0)

##### What data splitting ratio have you used and why? 

Splitting ratio of 70-30 was used , since data was focused on medical condition of the patients . Hence, more focus should be given on training the data well*

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Class 0 contained 2398 values , whereas Class 1 contained only 430 values. This is due to the fact that, not many people suffer form a risk of CHD. Hence, the data is imbalanced

In [None]:
# the numbers before SMOTE
num_before = dict(Counter(y))

#perform SMOTE

# define pipeline
under = RandomUnderSampler()
over = SMOTE()
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

# transform the dataset
X_smote, y_smote = pipeline.fit_resample(X, y)


#the numbers after SMOTE
num_after =dict(Counter(y_smote))
print(num_before, num_after)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Class 0 contained 2398 values , whereas Class 1 contained only 430 values. This is due to the fact that, not many people suffer form a risk of CHD. Hence, the data is imbalanced

## ***6. ML Model Implementation***

### ML Model - 1

In [None]:
lr = LogisticRegression(fit_intercept=True, max_iter=10000)
lr.fit(X_train, y_train)

In [None]:
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, y_test)))

In [None]:
y_train_pred = lr.predict(X_train)

In [None]:
y_pred_lr = lr.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print(classification_report(y_test, y_pred_lr))

In [None]:
print(roc_auc_score(y_test, y_pred_lr))

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_train, y_train_pred)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_test, y_pred_lr)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Importing all necessary libraries
from sklearn.metrics import roc_curve, auc

class_probabilities = lr.predict_proba(X_test)
preds = class_probabilities[:, 1]

fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)

# Printing AUC
print(f"AUC for our classifier is: {roc_auc}")

# Plotting the ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Model Performance 
lr_accuracy = accuracy_score(y_pred_lr, y_test)
lr_precision = precision_score(y_pred_lr ,y_test, average ='weighted')
lr_recall = recall_score(y_pred_lr, y_test, average ='weighted')
lr_f1_score = f1_score(y_pred_lr,y_test,average ='weighted')
print(' Accuracy:',lr_accuracy,'\n','Predicion :' ,lr_precision, '\n', 'Recall :',lr_recall, '\n', 'F1_score :', lr_f1_score)

### ML Model - 2:Naive Bayes Classification

In [None]:
nbc = GaussianNB()
nbc.fit(X_train, y_train)

In [None]:
print("Training set score: {:.2f}".format(nbc.score(X_train, y_train)))
print("Test set score: {:.2f}".format(nbc.score(X_test, y_test)))

In [None]:
pred_train_nbc = nbc.predict(X_train)

In [None]:
y_pred_nbc = nbc.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print(classification_report(y_test, y_pred_nbc))

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_train, pred_train_nbc)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_test, y_pred_nbc)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Importing all necessary libraries
from sklearn.metrics import roc_curve, auc

class_probabilities = nbc.predict_proba(X_test)
preds = class_probabilities[:, 1]

fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)

# Printing AUC
print(f"AUC for our classifier is: {roc_auc}")

# Plotting the ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
param_grid_nb = {                                                                # var_smoothing --> smothen the curve to account for more samples whihc are further away from mean
    'var_smoothing': np.logspace(0,-9, num=100)                                  # returns numbers spaced evenly on a log scale , starts from 0, ends at -9, generates 100 samples 
}

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
nb_grid = GridSearchCV(estimator= nbc, 
                       param_grid=param_grid_nb, 
                       verbose=1, 
                       cv=10, 
                       n_jobs=-1,
                       scoring = 'f1_micro')
nb_grid.fit(X_train, y_train)
print(nb_grid.best_estimator_)

In [None]:
print("Training set score: {:.2f}".format(nb_grid.score(X_train, y_train)))
print("Test set score: {:.2f}".format(nb_grid.score(X_test, y_test)))

In [None]:
pred_train_nbc = nb_grid.predict(X_train)

In [None]:
y_pred_gnbc= nb_grid.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_gnbc))

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_train, pred_train_nbc)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_test, y_pred_gnbc)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Precision , Recall , F1 score for the model
nb_accuracy = accuracy_score(y_pred_gnbc, y_test)
nb_precision = precision_score(y_pred_gnbc ,y_test, average ='weighted')
nb_recall = recall_score(y_pred_gnbc, y_test, average ='weighted')
nb_f1_score = f1_score(y_pred_gnbc,y_test,average ='weighted')
print(' Accuracy:',nb_accuracy,'\n','Predicion :' ,nb_precision, '\n', 'Recall :',nb_recall, '\n', 'F1_score :', nb_f1_score)

### ML Model - 3

In [None]:
svm = SVC(random_state=0, probability =True)
svm.fit(X_train, y_train)

In [None]:
print("Training set score: {:.2f}".format(svm.score(X_train, y_train)))
print("Test set score: {:.2f}".format(svm.score(X_test, y_test)))

In [None]:
pred_train_svm = svm.predict(X_train)

In [None]:
y_pred_svm = svm.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print(classification_report(y_test, y_pred_svm))

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_train, pred_train_svm)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_test, y_pred_svm)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
class_probabilities = svm.predict_proba(X_test)
preds = class_probabilities[:, 1]

fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)

# Printing AUC
print(f"AUC for our classifier is: {roc_auc}")

# Plotting the ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
param_grid = {'C': [0.1, 1, 10, 100, 1000], 
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf']} 
  
grid_svm = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)

In [None]:
grid_svm.fit(X_train, y_train)

In [None]:
# print best parameter after tuning
print(grid_svm.best_params_)
  
# print how our model looks after hyper-parameter tuning
print(grid_svm.best_estimator_)

In [None]:
print("Training set score: {:.2f}".format(grid_svm.score(X_train, y_train)))
print("Test set score: {:.2f}".format(grid_svm.score(X_test, y_test)))

In [None]:
pred_train_svm = grid_svm.predict(X_train)

In [None]:
y_pred_gsvm = grid_svm.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_gsvm))

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_train, pred_train_svm)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_test, y_pred_gsvm)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
# Precision , Recall , F1 score for the model
svm_accuracy = accuracy_score(y_pred_gsvm, y_test)
svm_precision = precision_score(y_pred_gsvm ,y_test, average ='weighted')
svm_recall = recall_score(y_pred_gsvm, y_test, average ='weighted')
svm_f1_score = f1_score(y_pred_gsvm,y_test,average ='weighted')
print(' Accuracy:',svm_accuracy,'\n','Predicion :' ,svm_precision, '\n', 'Recall :',svm_recall, '\n', 'F1_score :', svm_f1_score)

# Model -4 : Random Forest Classification

In [None]:
rfc = RandomForestClassifier(n_estimators=100,max_depth=15,random_state=30,min_samples_split=3,criterion='entropy')
rfc.fit(X_train, y_train)

In [None]:
print("Training set score: {:.2f}".format(rfc.score(X_train, y_train)))
print("Test set score: {:.2f}".format(rfc.score(X_test, y_test)))

In [None]:
pred_train_rfc = rfc.predict(X_train)

In [None]:
y_pred_rfc = rfc.predict(X_test)

1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.


In [None]:
print(roc_auc_score(y_test, y_pred_rfc))

In [None]:
print(classification_report(y_test, y_pred_rfc))

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_train, pred_train_rfc)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_test, y_pred_rfc)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
class_probabilities = rfc.predict_proba(X_test)
preds = class_probabilities[:, 1]

fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)

# Printing AUC
print(f"AUC for our classifier is: {roc_auc}")

# Plotting the ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Precision , Recall , F1 score for the model
rf_accuracy = accuracy_score(y_pred_rfc, y_test)
rf_precision = precision_score(y_pred_rfc ,y_test, average ='weighted')
rf_recall = recall_score(y_pred_rfc, y_test, average ='weighted')
rf_f1_score = f1_score(y_pred_rfc,y_test,average ='weighted')
print(' Accuracy:',rf_accuracy,'\n','Predicion :' ,rf_precision, '\n', 'Recall :',rf_recall, '\n', 'F1_score :', rf_f1_score)

# Model -5 : XG-Boost **Classification** bold text

In [None]:
xgb = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb.fit(X_train, y_train)

In [None]:
print("Training set score: {:.2f}".format(xgb.score(X_train, y_train)))
print("Test set score: {:.2f}".format(xgb.score(X_test, y_test)))

In [None]:
pred_train_xgb = xgb.predict(X_train)

In [None]:
y_pred_xgb = xgb.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
print(classification_report(y_test, y_pred_xgb))

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_train, pred_train_xgb)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_test, y_pred_xgb)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
class_probabilities = xgb.predict_proba(X_test)
preds = class_probabilities[:, 1]

fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)

# Printing AUC
print(f"AUC for our classifier is: {roc_auc}")

# Plotting the ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Precision , Recall , F1 score for the model
xgb_accuracy = accuracy_score(y_pred_xgb, y_test)
xgb_precision = precision_score(y_pred_xgb ,y_test, average ='weighted')
xgb_recall = recall_score(y_pred_xgb, y_test, average ='weighted')
xgb_f1_score = f1_score(y_pred_xgb,y_test,average ='weighted')
print(' Accuracy:',xgb_accuracy,'\n','Predicion :' ,xgb_precision, '\n', 'Recall :',xgb_recall, '\n', 'F1_score :', xgb_f1_score)

# Model-6 : KNN Classification

In [None]:
# calculate accuracy score for first 10 neighbors
# Setup arrays to store training and test accuracies
neighbors = 10
train_accuracy = np.empty(neighbors)
test_accuracy = np.empty(neighbors)

for i in range(1, neighbors + 1):
    # Setup a knn classifier with k neighbors
    knn = KNeighborsClassifier(n_neighbors=i)
    # Fit the model
    knn.fit(X_train, y_train)
    # Compute accuracy on the training set
    train_accuracy[i - 1] = knn.score(X_train, y_train)
    # Compute accuracy on the test set
    test_accuracy[i - 1] = knn.score(X_test, y_test) 

In [None]:
# print accuracy
print("Train Accuracy: ", train_accuracy)
print("Test Accuracy: ", test_accuracy)

In [None]:
# Generate plot
plt.title('k-NN Score with varying number of neighbors')
x_axis = [i for i in range(1,11)]
plt.plot(x_axis, test_accuracy, label='Testing Accuracy')
plt.plot(x_axis, train_accuracy, label='Training accuracy')
plt.legend()
plt.xlabel('Number of neighbors')
plt.ylabel('Accuracy')
plt.show()
     

In [None]:
param_grid = {'n_neighbors':np.arange(1,50),
               'weights' : ['uniform','distance'],
               'metric' : ['minkowski','euclidean','manhattan']
              }

In [None]:
knn = KNeighborsClassifier()
knn_cv= GridSearchCV(knn,param_grid,cv=3, verbose=1, n_jobs=-1)
knn_cv.fit(X_train,y_train)

In [None]:
# best score and paramter from above options
print('best_score :',knn_cv.best_score_)
print(knn_cv.best_params_)

In [None]:
print("Training set score: {:.2f}".format(knn_cv.score(X_train, y_train)))
print("Test set score: {:.2f}".format(knn_cv.score(X_test, y_test)))

In [None]:
pred_train_knn = knn_cv.predict(X_train)

In [None]:
y_pred_knn = knn_cv.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_knn))

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_train, pred_train_knn)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
labels = ['No CHD', 'CHD']
cm = confusion_matrix(y_test, y_pred_knn)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)

In [None]:
class_probabilities = knn_cv.predict_proba(X_test)
preds = class_probabilities[:, 1]

fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)

# Printing AUC
print(f"AUC for our classifier is: {roc_auc}")

# Plotting the ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Precision , Recall , F1 score for the model
knn_accuracy = accuracy_score(y_pred_knn, y_test)
knn_precision = precision_score(y_pred_knn ,y_test, average ='weighted')
knn_recall = recall_score(y_pred_knn, y_test, average ='weighted')
knn_f1_score = f1_score(y_pred_knn,y_test,average ='weighted')
print(' Accuracy:',knn_accuracy,'\n','Predicion :' ,knn_precision, '\n', 'Recall :',knn_recall, '\n', 'F1_score :', knn_f1_score)

### 1. Which Evaluation metrics did you consider for a positive business impact and why?


Accuracy is not the best metric to use when evaluating imbalanced datasets as it can be misleading.

Metrics considered for evaluation are as follows :

* **Confusion Matrix**: a table showing correct predictions and types of incorrect predictions.
* **Precision**: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.
* **Recall**: the number of true positives divided by the number of positive values in the test data. The recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives.
* **F1 Score**: the weighted average of precision and recall.
* **Area Under ROC Curve (AUC-ROC)**: AUC-ROC represents the likelihood of your model distinguishing observations from two classes. In other words, if you randomly select one observation from each class, what’s the probability that your model will be able to “rank” them correctly?

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

*Considering the optimal values from the above table and auc_roc curve , Random Forest gives the best result*

### 3. Explain feature importance ?

In [None]:
importances = pd.DataFrame({'Features': X.columns, 
                                'Importances': rfc.feature_importances_})
    
importances.sort_values(by=['Importances'], axis='index', ascending=False, inplace=True)
fig = plt.figure(figsize=(14, 4))
sns.barplot(x='Features', y='Importances', data=importances)
plt.xticks(rotation='vertical')
plt.title('Feature importance score w.r.t. Random Forest model')
plt.show()
     

*Age is the most important factor , when diagonising a patient related to CHD , followed by heart Rate. Hence, more focus be given on these factors*

# **Conclusion**

6 different Machine Learning algorithms were trained on training dataset

Random forest provided best results and thus classifying the patients with an F1-score and test accuracy of 90 %

To provide immediate treatment to the patients at arisk of CHD. Type II error should be low, i.e. High Recall value is desired

To avoid time consumption on providing patients with CHD treatment ,when actually they don't have any. Type I error should be low, i.e. High Precision is desired.

To treat patients with actual risk of CHD, there should be a balance between precision & recall. i.e. High F1-score is desirable , since it punishes the extreme values more.

The models which provide the above desired results are as follows :
Recall : k-Nearest Neighbors

Precision : Support Vector Machines
F1-score : Random Forest
Test Accuracy : Random Forest

Age is the deciding factors for the patients mentioned in the dataset, followed by heart_rate, cigerettes per day and cholestrol level.

More focus should be given on the above mentioned features of the patients attending for treatment.

Though count of patients not suffering from CHD is more, count of patients with risk of CHD is high.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***