<a href="https://colab.research.google.com/github/121deepti/Cardiovascular_Risk_Prediction/blob/main/Cardiovascular_Risk_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Classification
##### **Contribution**    - Individual

# **Project Summary -**

The project aims to predict the 10-year risk of future coronary heart disease (CHD) for patients in Framingham, Massachusetts. A dataset containing demographic, behavioral, and medical risk factors for over 4000 patients is used to build a predictive model. The model will use machine learning techniques to analyze the provided information and make accurate CHD risk predictions. The goal of the project is to develop a tool for early detection and prevention of CHD, addressing a significant public health concern. The outcome of the project will be a predictive model that can be used by healthcare providers to make informed decisions regarding patient care.

There were approximately 3390 records and 17 attributes in the dataset.
We started by importing the dataset, and necessary libraries and conducted exploratory data analysis (EDA).
Outliers and null values were removed from the raw data and treated. Data were transformed to ensure that it was compatible with machine learning models.
We handled target class imbalance using SMOTE.
Then finally cleaned and scaled data was sent to 8 various models, the metrics were made to evaluate the model, and we tuned the hyperparameters to make sure the right parameters were being passed to the model.
When developing a machine learning model, it is generally recommended to track multiple metrics because each one highlights distinct aspects of model performance. We are, focusing more on the Recall score and F1 score.
It is categorically unacceptable to miss identifying a particular patient or to classify a particular patient as healthy (false negative). That is why we have preferred recall score.

# **GitHub Link -**

https://github.com/121deepti/Cardiovascular_Risk_Prediction/blob/main/Cardiovascular_Risk_Prediction.ipynb

# **Problem Statement**


**What exactly are cardiovascular diseases?**

A group of conditions affecting the heart and blood vessels is known as cardiovascular diseases. They consist of heart disease, which affects the blood vessels that supply the heart muscle. The majority of the time, a blockage that prevents blood from flowing to the heart or brain is to blame for heart attacks and strokes, which are typically sudden events. A buildup of fatty deposits on the inner walls of the blood vessels that supply the heart or brain is the most common cause of this.

The goal of the classification is to predict the 10-year risk of future coronary heart disease (CHD) for patients. The issue of coronary heart disease is a significant public health concern and early prediction of CHD risk is crucial for preventative measures. The dataset is from an ongoing cardiovascular study on residents of Flamingham, Massachusetts. The data set includes over 4000 records and 16 attributes, each of which is a potential risk factor, including demographic, behavioral, and medical risk factors.

**WHY DO WE NEED CARDIOVASCULAR RISK PREDICTION?**

The greatest obstacle facing the medical industry is accurately predicting and diagnosing heart disease. Heart diseases are influenced by numerous factors.
Heart disease is even referred to as a "silent killer" because it kills people without showing any obvious symptoms.
When high-risk patients are diagnosed with heart disease early, it is easier to make lifestyle changes, which in turn lowers the risk of complications.
Based on the way people currently live, machine learning can help predict the likelihood of heart disease in the coming years.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import  make_scorer,f1_score,roc_curve,accuracy_score,classification_report,confusion_matrix,roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBRFClassifier
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

from imblearn.combine import SMOTETomek

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df=pd.read_csv('/content/drive/MyDrive/my_data/data_cardiovascular_risk.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.sample(3)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isna().sum().sort_values(ascending=False)

In [None]:
#Missing values percentage
round((df.isna().sum().sort_values(ascending=False))*100/df.shape[0],2)

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(),cbar=False)

In [None]:
!pip install pandas-profiling

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Pandas Profiling Report")
profile

### What did you know about your dataset?

There are 3390 rows and 17 columns in the dataset. No duplicates are found in the dataset.Some Null values are observed in the features.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

The dataset provides the patients’ information. It includes over 3,390 records and 17 attributes(from which TenYearCD is the target column). Variables Each attribute is a potential risk factor. There are demographic, behavioural, and medical risk factors

Demographic:
*   Sex: male or female ("M" or "F")
*   Age: Age of the patient (Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)
*   Education: The level of education of the patient (categorical values - 1,2,
3,4)

Behavioral:<br>
*   is_smoking: whether or not the patient is a current smoker ("YES" or "NO")
*   Cigs Per Day: the number of cigarettes that the person smoked on average in one day

Medical(History):


*   BP Meds: whether or not the patient was on blood pressure medication
*   Prevalent Stroke: whether or not the patient had previously had a stroke
*   Prevalent Hyp: whether or not the patient was hypertensive
*   Diabetes: whether or not the patient had diabetes (Nominal) Medical(current)
*   Tot Chol: total cholesterol level
*   Sys BP: systolic blood pressure
*   Dia BP: diastolic blood pressure
*   BMI: Body Mass Index
*   Heart Rate: heart rate
*   Glucose: glucose level
*   TenYearCHD(**Target Variable**): 10-year risk of coronary heart disease CHD(binary: “1”, means “Yes”, “0” means “No”)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns:
  print("No of Unique Values in ", i, " is:", df[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#create a copy of dataset
df_eda=df.copy()

In [None]:
#Prevalent stroke effect on heart disease
pd.crosstab(df_eda.prevalentStroke,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

In [None]:
#Prevalent Hypertension impact on heart disease
pd.crosstab(df_eda.prevalentHyp,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

In [None]:
#Creating age bins and impact on heart disease
df_eda['age_bins'] = pd.cut(x=df_eda['age'], bins=[30, 35, 40, 45,50,55,60,65,70])
# We can check the frequency of each bin
pd.crosstab(df_eda.age_bins,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

In [None]:
#Creating BMI bins and finding its relation with heart disease
df_eda['BMI_bins'] = pd.cut(x=df_eda['BMI'], bins=[15,25,35,45,55,65])
print(pd.crosstab(df_eda.BMI_bins,df_eda.TenYearCHD))
print('\n')
pd.crosstab(df_eda.BMI_bins,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

In [None]:
#Creating cigrette bins and finding its relation with heart disease
df_eda['cig_bins'] = pd.cut(x=df_eda['cigsPerDay'], bins=[10, 20,30,40,50,60])

print(pd.crosstab(df_eda.cig_bins,df_eda.TenYearCHD))
print('\n')
pd.crosstab(df_eda.cig_bins,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

In [None]:
#Prevalent Hypertension impact on heart disease
pd.crosstab(df_eda.sex,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

In [None]:
#Diebetes impact on heart disease
pd.crosstab(df_eda.diabetes,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)


In [None]:
#BP medication impact on heart disease
pd.crosstab(df_eda.BPMeds,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

In [None]:
#Creating cholestrol bins and finding its relation with heart disease
df_eda['chol_bins'] = pd.cut(x=df_eda['totChol'], bins=[100, 200,300,400,500,600,700])
print(pd.crosstab(df_eda.chol_bins,df_eda.TenYearCHD))
print('\n')
pd.crosstab(df_eda.chol_bins,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

In [None]:
#Creating heartRate bins and finding its relation with heart disease
df_eda['heartRate_bins'] = pd.cut(x=df_eda['heartRate'], bins=[40,60,80,100,120,140,160])
print(pd.crosstab(df_eda.heartRate_bins,df_eda.TenYearCHD))
print('\n')
pd.crosstab(df_eda.heartRate_bins,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

In [None]:
#Creating Glucose bins and finding its relation with heart disease
df_eda['Glucose_bins'] = pd.cut(x=df_eda['glucose'], bins=[40,150,250,350,450])
# We can check the frequency of each bin
print(pd.crosstab(df_eda.Glucose_bins,df_eda.TenYearCHD))
print('\n')
pd.crosstab(df_eda.Glucose_bins,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

In [None]:
df.glucose.min()

In [None]:
df.head(2)

In [None]:
df1=df[df['TenYearCHD']==0]
df1[['age','cigsPerDay','totChol','sysBP','diaBP','BMI','heartRate','glucose']].describe()

In [None]:
df2=df[df['TenYearCHD']==1]
df2[['age','cigsPerDay','totChol','sysBP','diaBP','BMI','heartRate','glucose']].describe()

In [None]:
print(len(df[df['TenYearCHD']==1]))
print(len(df[df['TenYearCHD']==0]))

In [None]:
print(pd.crosstab(df_eda.is_smoking,df_eda.TenYearCHD))
print('\n')
pd.crosstab(df_eda.is_smoking,df_eda.TenYearCHD).apply(lambda r: round((r/r.sum()),2), axis=1)

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

###Univariate Analysis

In [None]:
numeric_cols=['age','cigsPerDay','totChol','sysBP','diaBP','BMI','heartRate','glucose']
cate_cols=['sex','education','is_smoking','BPMeds','prevalentStroke', 'prevalentHyp', 'diabetes','TenYearCHD']

#### Chart - 1

In [None]:
# Chart - 1 Count plot of Categorical columns
fig = plt.figure(figsize=(12, 10))
for index,item in enumerate(cate_cols):
  plt.subplot(3,3,index+1)
  ax = fig.gca()
  plt.xlabel(item)
  sns.countplot(x =item, data = df)
  for p in ax.patches:
    ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.25, p.get_height()+0.01))
  plt.tight_layout()
  print("\n")

##### 1. Why did you pick the specific chart?

This chart helps in analyzing the categorical features count in the data set.

##### 2. What is/are the insight(s) found from the chart?

The findings are:
*   There are more number of females as compared to male.
*   Almost equal ratio between smoker and non-smoker.
*   There are more number of non-BP patients as compared to BP patients.
*   The persons who history of prevalent stroke are very less.
*   Hypertension and Diabetic patients are less in number.
*   There are more number of patients who are in less risk zone.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Chart - 2

In [None]:
# Chart - 2 Pie charts for Categorical Columns
fig = plt.figure(figsize=(12, 10))
for index,item in enumerate(cate_cols):
  plt.subplot(3,3,index+1)
  ax = fig.gca()
  df[item].value_counts().plot(kind='pie',autopct='%1.0f%%')
  plt.tight_layout()

##### 1. Why did you pick the specific chart?

This chart shows the percentage of distribution of categorical data

##### 2. What is/are the insight(s) found from the chart?

The ratio is as follows:
*   57% female and 43% males
*   50% smokers and 50% non-smokers
*   97% NonBP and NonDiebitic patients while 3% BP and Diebitic patients
*   99% have Stroke history
*   68% non-Hypertension Patients  
*   85% are under safe category

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 Histogram for Numerical columns
fig = plt.figure(figsize=(12, 10))
for index,item in enumerate(numeric_cols):
  plt.subplot(3,3,index+1)
  ax = fig.gca()
  plt.xlabel(item)
  sns.distplot(df[item])
  sk=round(df[item].skew(),2)
  ax=fig.gca()
  ax.axvline(df[item].mean(), color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(df[item].median(), color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(item+'  skewness'+str(sk))
  plt.tight_layout()


##### 1. Why did you pick the specific chart?

This chart displays the PDF of numerical variables with their skewness.

##### 2. What is/are the insight(s) found from the chart?

Here are some important findings-
*   Except cigs_per_day all other are following Normal or almost Normal Distribution
*   All features are Positively skewed but cigs_per_day and BMI highly skewed but Glusose is very high skewed ~6.14.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 Log Transformation of skewed variable
fig = plt.figure(figsize=(8, 3))
plt.subplot(1,3,1)
ax=fig.gca()
sns.distplot(np.log1p(df['BMI']))
sk=round((np.log1p(df['BMI'].skew())),2)
ax.set_title("BMI skewness "+str(sk))

plt.subplot(1,3,2)
ax=fig.gca()
sns.distplot(np.log1p(df['cigsPerDay']))
sk=round((np.log1p(df['cigsPerDay'].skew())),2)
ax.set_title("cigsPerDay "+str(sk))

plt.subplot(1,3,3)
ax=fig.gca()
sns.distplot(np.log1p(df['glucose']))
sk=round((np.log1p(df['glucose'].skew())),2)
ax.set_title("Glucose "+str(sk))

##### 1. Why did you pick the specific chart?

This chart helps to show the variable distribution after log transformation.

##### 2. What is/are the insight(s) found from the chart?

After applying log transformation, BMI skewness is reduced to ~ 0.7,Cigs perday to ~0.8 and Glucose to ~1.97.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 Box plot Analysis
fig = plt.figure(figsize=(6, 4))
for index,item in enumerate(numeric_cols):
  plt.subplot(3,3,index+1)
  ax = fig.gca()
  plt.xlabel(item)
  sns.boxplot(x =item, data = df)
  plt.tight_layout()

##### 1. Why did you pick the specific chart?

This chart analyses the outliers present in numerical features.

##### 2. What is/are the insight(s) found from the chart?

Here the outcomes of this chart-
*   "Age"-No outlier is present
*   "cigs Per Day" -Two outliers are present beyond the upper limit.
*   "totChol", "diaBP","heartRate" and "glucose"-Many outliers are present beyond the upper limit and some are present on the lower boundary too.
*   "SysBP" and "BMI"-Many outliers are present beyond the upper boundary.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Outliers can mislead the analysis of data that can give wrong signals to the business.

###Bivariate Analysis

#### Chart - 6

In [None]:
# Chart - 6 Relation of SystolicBP and diastolicBP
print (numeric_cols)
print(cate_cols)
fig = plt.figure(figsize=(6, 4))
sns.scatterplot(data=df,x='sysBP',y='diaBP')

##### 1. Why did you pick the specific chart?

This chart shows the relationship between sysBP and diaBP.

##### 2. What is/are the insight(s) found from the chart?

There is linear relationship between sysBP and diaBP.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

sysBP and diaBP are crucial factor in determining the risk of heart disease, so they should be in controllled manner.

#### Chart - 7

In [None]:
# Chart - 7 Bar graphs for categorical vs numerical features analysis
fig = plt.figure(figsize=(8, 6))
plt.subplot(3,4,1)
ax = fig.gca()
sns.barplot(data=df,x='BPMeds',y='sysBP',hue='prevalentStroke')

plt.subplot(3,4,2)
ax = fig.gca()
sns.barplot(data=df,x='prevalentStroke',y='BMI',hue='sex')

plt.subplot(3,4,3)
ax = fig.gca()
sns.barplot(data=df,x='prevalentHyp',y='totChol',hue='TenYearCHD')

plt.subplot(3,4,4)
ax = fig.gca()
sns.barplot(data=df,x='diabetes',y='glucose',hue='TenYearCHD')

plt.subplot(3,4,5)
ax = fig.gca()
sns.barplot(data=df,x='TenYearCHD',y='cigsPerDay')

plt.subplot(3,4,6)
ax = fig.gca()
sns.barplot(data=df,x='TenYearCHD',y='totChol')

plt.subplot(3,4,7)
ax = fig.gca()
sns.barplot(data=df,x='TenYearCHD',y='sysBP')

plt.subplot(3,4,8)
ax = fig.gca()
sns.barplot(data=df,x='TenYearCHD',y='BMI')

plt.subplot(3,4,9)
ax = fig.gca()
sns.barplot(data=df,x='TenYearCHD',y='glucose',hue='diabetes')

plt.subplot(3,4,10)
ax = fig.gca()
sns.barplot(data=df,x='diabetes',y='heartRate')

plt.subplot(3,4,11)
ax = fig.gca()
sns.barplot(data=df,x='prevalentStroke',y='heartRate')

plt.subplot(3,4,12)
ax = fig.gca()
#sns.barplot(data=df,x='is_smoking',y='heartRate')
#sns.barplot(data=df,x='TenYearCHD',y='age')

plt.tight_layout()


##### 1. Why did you pick the specific chart?

This chart reflects the relationship between numerical and categorical variables.

##### 2. What is/are the insight(s) found from the chart?

The following are the insights-
*   The persons on BP medicines and storke history having lower sysBP with those having no stroke history, while the persons not on BP medicine and stroke history having higher sysBP with those having no sysBP that means BP medicine plays an imporatnt role.
*  There are higher BMI of females having stroke history but in case of no stroke the males have higher BMI .It signifies that increased BMI increses the risk of stroke.
*   The persons having stroke history also have higher level of cholestrol that increases the risk of heart diseases.
*   There is high risk of heart disease for diebitic persons.
*   Smoking factor,Cholostrol,BP,BMI and Glucose level infulences high risk factor.
*   Diebitic,Smokers having comparatively higher heart rate but prevalent stroke patients have significantly lower heart rate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This analysis helps in recognizing the crucial factors that have strong infulence in determing heart disease.

#### Chart - 8

In [None]:
df.is_smoking.value_counts()

In [None]:
# Chart - 8 visualization code
fig = plt.figure(figsize=(10, 12))
plt.subplot(5,4,1)
ax = fig.gca()
sns.boxplot(data=df,x='cigsPerDay',y='is_smoking')

plt.subplot(5,4,2)
ax = fig.gca()
sns.boxplot(data=df,y='glucose',x='diabetes',hue='TenYearCHD')

plt.subplot(5,4,3)
ax = fig.gca()
sns.boxplot(data=df,y='sysBP',x='BPMeds')

x=['prevalentStroke','prevalentHyp','diabetes']
for i,item in enumerate(x):
  plt.subplot(5,4,i+4)
  ax = fig.gca()
  sns.boxplot(data=df,y='sysBP',x=item)

x=['prevalentStroke','prevalentHyp','diabetes','TenYearCHD','BPMeds','sex']
for i,item in enumerate(x):
  plt.subplot(5,4,i+7)
  ax = fig.gca()
  sns.boxplot(data=df,y='BMI',x=item)

x=['prevalentStroke','prevalentHyp','diabetes','TenYearCHD']
for i,item in enumerate(x):
  plt.subplot(5,4,i+13)
  ax = fig.gca()
  sns.boxplot(data=df,y='heartRate',x=item)
plt.tight_layout()

##### 1. Why did you pick the specific chart?

This helps in understanding the presence of outliers in various sceniors.

##### 2. What is/are the insight(s) found from the chart?

These are the insights as per my undersatnding
*   Some persons are smoking beyond the limit.
*   In case of Non-Diebitic persons lot of outliers are present in their glucose level.
*   Persons not on BP medication, no stroke history and non-diebitic having a lot high range of outliers for sysBP are present but high range of outliers are seen in prevalent as well as non prevalent hypertension patients.
*   In BMI case, high range of outliers are present in high risk and low risk patients.BP medicated ,Males having lesser outliers comparatively.No stroke history patients,Non Diebitic having higher outliers.For HyperTension or not BMI showing high range of outliers present.
*   In HeartRate case,Non prevalent stroke,Non diabetic having higher range of outliers,For HyperTension or not Heart Rate showing high range of outliers present and same in case of risk factor.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Persons not on BP medications,not Diebitic and no stroke history having outliers so these must be inspected carefully otherwise can lead to wrong prediction.

#### Chart - 9

In [None]:
# Chart - 9 Relation between Education and Sex with TenYearCHD
pd.crosstab(df.education,df.TenYearCHD).plot(kind='pie',autopct='%1.0f%%',subplots=True)
pd.crosstab(df.sex,df.TenYearCHD).plot(kind='pie',autopct='%1.0f%%',subplots=True)

##### 1. Why did you pick the specific chart?

It shows the relation with heart risk with education.

##### 2. What is/are the insight(s) found from the chart?

As the education level increases the percentage of high risk patients decreases or remain equal.
Majority of the patients belong to the education level 1, followed by 2, 3, and 4 respectively.
Males have higher percentage of heart disease while females are in safer zone.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hypertensive heart disease is a long-term condition that develops over many years in people who have high blood pressure that increases the risk of cardiovascular diseases.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 correlation heatmap
plt.figure(figsize=(12,12))
correlation =df.corr()
sns.heatmap(abs(correlation), annot=True)

##### 1. Why did you pick the specific chart?

This chart is important to know the correlation between various dependent and independent features.

##### 2. What is/are the insight(s) found from the chart?

Some important insights are-
*   sysBP and DiaBP is highly correlated with prevalent Hypertension and BMI.
*   Glucose is highly correlated with diabetes.
*   sysBP and diaBP is highly correlated.
*   age is correlated with prevalent hypertension,Choloestrol ,sysBP and diaBP as well.
*   BP meds is correlated with sysBP,diaBP,hypertension
*   TenYearCHD is correlated with age,sysBP,diaBP,glucose and hypertension.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These correlation must to know as it saves us from the pitfall of bias-variance and multicollinearity.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
#creating copy of data frame
df_cpy=df.copy()

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# features which has less than 5%  null values present.
nan_columns = ['education', 'cigsPerDay', 'BPMeds', 'totChol', 'BMI', 'heartRate']

# dropping null values
df_cpy.dropna(subset=nan_columns, inplace=True)

#glucose has ~8% null values
df_cpy['glucose'] = df_cpy.glucose.fillna(df_cpy.glucose.median())

df_cpy.isna().sum().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Typically, we use other records to replace these null values. However, the entries in this dataset are person-specific. The values vary from person to person, and the dataset is related to the medical field in this particular instance. Consequently, removing rows with any null value is the most logical choice we have for dealing with such values.

We cannot take any risks with this prediction, so if we attempt to impute null values using advanced methods, it may affect the outcome because the values will be incorrect.

In the healthcare industry, every piece of data is crucial. Because of this, we came up with a solution by setting a threshold value. If a feature has less than 5% null values, we decide to drop those rows, and the remaining rows are imputing, which will affect prediction but not significantly.As in case of glucose ~8% null values and outliers are present so we replace it with median value.


### 2. Handling Outliers

In [None]:
df_cpy.describe().T

As can be seen in the statistical summary for numerical features, there is a significant difference between the 75% percentile and maximum value, indicating that the dataset contains skewness and outliers.

In [None]:
# figsize
plt.figure(figsize=(10,5))
# boxplot of numerical features
sns.boxplot(data=df_cpy[numeric_cols])
plt.show()

As lot of outliers are present since we have limited datapoint hence we are not simply removing the outlier instead of that we are using the clipping method.

In [None]:
# Handling Outliers & Outlier treatments
def clip_outliers(risk_df):
    for col in risk_df[numeric_cols]:
        # using IQR method to define range of upper and lower limit.
        q1 = risk_df[col].quantile(0.25)
        q3 = risk_df[col].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr

        # replacing the outliers with upper and lower bound
        risk_df[col] = risk_df[col].clip(lower_bound, upper_bound)
    return risk_df

In [None]:
# using the function to treat outliers
df_cpy = clip_outliers(df_cpy)

In [None]:
#BoxPlot after clipping outliers
plt.figure(figsize=(10,5))
sns.boxplot(data=df_cpy[numeric_cols])
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

We have implemented **clipping method**.  In this method, we set a cap on our outliers data, which means that if a value is higher than or lower than a certain threshold, all values will be considered outliers. This method replaces values that fall outside of a specified range with either the minimum or maximum value within that range.
Here we have set the threshold of .25-1.5*IQR(Lower limit) and .75+1.5*IQR(Upper limit).

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
df_cpy['sex']=df_cpy['sex'].map({'M':1,'F':0})
df_cpy['is_smoking']=df_cpy['is_smoking'].map({'YES':1,'NO':0})

In [None]:
# check the datatypes of each column in the DataFrame
df_cpy.dtypes

In [None]:
# one-hot encode the 'education' feature
education_onehot = pd.get_dummies(df_cpy['education'], prefix='education',drop_first=True)

# drop the original education feature
df_cpy.drop('education', axis=1, inplace=True)

# concatenate the one-hot encoded education feature with the rest of the data
df_cpy = pd.concat([df_cpy, education_onehot], axis=1)
df_cpy.head(3)

#### What all categorical encoding techniques have you used & why did you use those techniques?

'BPMeds','prevalentStroke', 'prevalentHyp', 'diabetes'and 'TenYearCHD' are categorical type of features but already have numeric values.We have encoded the "sex" and "is_smoking" columns to number.As "education" column has 4 unique values ,we converted it to object data type and perform one hot encoding to it  

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)<BR>
**As in dataset we have no textual data so we have skipped this step.**

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
#create a new feature MAP(Mean Arterial Pressure) by using sysBP and diaBP
df_cpy['Pulse_Pressure']=df_cpy['sysBP']-df_cpy['diaBP']
#Dropping sysBP and DiaBP columns
df_cpy.drop(['sysBP','diaBP'],axis=1,inplace=True)

In [None]:
# checking data, weather the provide information is correct or not
df_cpy[(df_cpy.is_smoking == 1) & (df_cpy.cigsPerDay == 0)]
# droping is_smoking column due to multi-collinearity
df_cpy.drop('is_smoking', axis=1, inplace=True)

In [None]:
#Dropping ID column as not relevant
df_cpy.drop('id',axis=1,inplace=True)

In [None]:
#updating the numeric and categorical columns list
numeric_cols.remove('sysBP')
numeric_cols.remove('diaBP')
cate_cols.remove('is_smoking')
numeric_cols.append('Pulse_Pressure')


#### 2. Feature Selection

In [None]:
# plotting correlation heatmap to check multicollinearity.
plt.figure(figsize=(15,4))
sns.heatmap(df_cpy.corr(),annot=True)

In [None]:
# Calculating VIF
def calc_vif(X):
   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
   return(vif)

In [None]:
#Checking VIF of all the columns after excluding high VIF columns
calc_vif(df_cpy[[i for i in df_cpy.describe().columns  if i not in['glucose','BMI','heartRate','totChol','Pulse_Pressure']]])

In [None]:
#Removing high VIF columns and create a new data frame named df_removed
df_removed=df_cpy.drop(['Pulse_Pressure','glucose','BMI','totChol','heartRate'],axis=1)

In [None]:
#updating the numeric column list
del numeric_cols[2:]
numeric_cols

In [None]:
df_removed.shape

##### What all feature selection methods have you used  and why?

First of all we have checked the VIF of all features and remove the features which are having high VIF(less than 10) and less important wrt. target variable.
    This step is necessary as it saves our model from overfitting by removing multicollnerity.

##### Which all features you found important and why?

After removing collinear features the data frame is left with 11 features which are important for building the model.From which "age" is the most important feature highly correlated with the target variable.  

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Visualizing code of hist plot for each columns to know the data distibution
for col in numeric_cols:
  fig=plt.figure(figsize=(4,5))
  ax=fig.gca()
  feature= (df_removed[col])
  sns.distplot(df_removed[col])
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(col+' '+str(feature.skew()))
plt.show()

In [None]:
df_removed['cigsPerDay']=np.log1p(df_removed['cigsPerDay'])
fig=plt.figure(figsize=(4,5))
ax=fig.gca()
sns.distplot(df_cpy['cigsPerDay'])
ax.axvline(df_cpy['cigsPerDay'].mean(),color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(df_cpy['cigsPerDay'].median(),color='cyan', linestyle='dashed', linewidth=2)
ax.set_title('cigsPerDay'+' '+str(df_removed['cigsPerDay'].skew()))
plt.show()

All the numeric features after outliers removal are almost following Gaussian Distribution and having skewness less than 0.5 which seems quite normal.
  But "cigsPerDay" having skewness >1 so i have applied log tranforamtion to make it follow Gaussian Distribution and finally its skewness is reduced less than 0.5

### 6. Data Scaling

In [None]:
#Created X and y dataset
#creating X(independent features) and y(target feature)
X_cols=df_removed.copy()
y=df_removed.TenYearCHD
X_cols.drop('TenYearCHD',axis=1,inplace=True)

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
scalar=StandardScaler()
X=scalar.fit_transform(X_cols)
y=y

##### Which method have you used to scale you data and why?

I have used Standard Scaler to scale the indendepent features as all the numeric features are following Gaussian Distribution.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

As per my knowledge, for this dataset dimensionality reduction is not required.

Essentially where high dimensions are a problem or where it is a particular point in the algorithm to dimension reduction.

Hard rules are hard to state, other than “after you have tried it, did it improve matters”, which isn’t always the most useful guidance.

Instead, looking at why we might want to do this we can get a bit of insight. Admittedly some of the following might blur together a bit at the edges but the aim is to give a flavour.

1. Our data are too big. 4 million rows. 50,000 columns… is there a lot of redundancy there? Building a model on this could be very expensive. Even relatively simple dimension reduction techniques like PCA can capture almost all of the information in a fraction of the memory if there are strong relationships (that can be linearly approximated) in the data.

2. We are over-fitting. If you build a model with tens of thousands of degrees of freedom but don’t have a lot of examples you can easily overfit. Dimension reduction is one way of handling this, though often not the the best

3. We want to bring in external data. OK, this is a bit different but worth a note. In applications like word2vec we want to build a classifier using an embedding. We may want to classify some text into different categories but with only a limited number of examples. The complexity of free text is vast but a low dimension embedding is much smaller and will not overfit so badly in a classifier. Building a low dimensional embedding on external text, applying it to the text to be classified then building a classifier is using dimension reduction to bring in external data.

4. We suffer from the curse of dimesnionality. Consider something like a nearest neighbour search. As the number of dimensions gets large we see some unwanted behaviour, especially if we are looking at things like euclidean distances. Projecting your data to a lower dimensional space for nearest neighbour, clustering or outlier detection can be both more robust and more meaningful.

5. Some tools are all about this. Collaborative filtering through matrix factorisation is an example. Can we approximately describe behaviour as a linear combination of a smaller number of preferences/behaviours?

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# split into 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0,stratify=y)


##### What data splitting ratio have you used and why?

There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.

If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).

You'd be surprised to find out that 80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle. It's usually a safe bet if you use that ratio.

In this case the training dataset is small, that's why I have taken 70:30 ratio.
As my target variable is highly imbalanced i have used Stratified Sampling so that training and testing set get equal proportion of 1 and 0 class.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

In [None]:
# Chart - 1 visualization code
# Dependant Column Value Counts
print(df_removed.TenYearCHD.value_counts())
print(" ")
# Dependant Variable Column Visualization
df_removed['TenYearCHD'].value_counts().plot(kind='pie',
                              figsize=(15,6),
                               autopct="%1.1f%%",
                               startangle=90,
                               shadow=True,
                               labels=['Not at Risk(%)','at Risk(%)'],
                               colors=['skyblue','red'],
                               explode=[0,0]
                              )

Here one can easily notice that ~85% persons are safe only ~15% are at risk, so the ratio is 85:15 which is the sign of unbalanced dataset.

In [None]:
# Handling Imbalanced Dataset (If needed)
# Handling class imbalance by oversampling followed by removing the Tomek link
X_smote, y_smote = SMOTETomek(random_state=42).fit_resample(X_train, y_train)
# Checking Value counts for both classes Before and After handling Class Imbalance:
for col,label in [[y_train,"Before"],[y_smote,'After']]:
  print(label+' Handling Class Imbalace:')
  print(col.value_counts(),'\n')

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I have used SMOTE (Synthetic Minority Over-sampling technique) followed by removing the Tomek link for balanced the 85:15 dataset.

SMOTE is a technique in machine learning for dealing with issues that arise when working with an unbalanced data set. In practice, unbalanced data sets are common and most ML algorithms are highly prone to unbalanced data so we need to improve their performance by using techniques like SMOTE.

To address this disparity, balancing schemes that augment the data to make it more balanced before training the classifier were proposed. Oversampling the minority class by duplicating minority samples or undersampling the majority class is the simplest balancing method.

The idea of incorporating synthetic minority samples into tabular data was first proposed in SMOTE, where synthetic minority samples are generated by interpolating pairs of original minority points.

SMOTE is a data augmentation algorithm that creates synthetic data points from raw data. SMOTE can be thought of as a more sophisticated version of oversampling or a specific data augmentation algorithm.

SMOTE has the advantage of not creating duplicate data points, but rather synthetic data points that differ slightly from the original data points. SMOTE is a superior oversampling option.

That's why for lots of advantages, I have used SMOTE technique for balancing the dataset.

## ***7. ML Model Implementation***

In [None]:
# Defining a function to train the input model and print evaluation matrix
def analyse_model(model, X_train, X_test, y_train, y_test):

  '''Takes classifier model and train test splits as input and prints the
  evaluation matrices with the plot and returns the model'''

  # Fitting the model
  model.fit(X_train,y_train)

  # Feature importances
  try:
    try:
      importance = model.feature_importances_
      feature = features
    except:
      importance = np.abs(model.coef_[0])
      feature = features
    indices = np.argsort(importance)
    indices = indices[::-1]
  except:
    pass

  # Plotting Evaluation Metrics for train and test dataset
  for x, act, label in ((X_train, y_train, 'Train-Set'),(X_test, y_test, "Test-Set")):

    # Getting required metrics
    pred = model.predict(x)
    pred_proba = model.predict_proba(x)[:,1]
    report = pd.DataFrame(classification_report(y_true=act,y_pred=pred, output_dict=True))
    fpr, tpr, thresholds = roc_curve(act, pred_proba)

    # Classification report
    plt.figure(figsize=(18,3))
    plt.subplot(1,3,1)
    sns.heatmap(report.iloc[:-1, :-1].T, annot=True, cmap='coolwarm')
    plt.title(f'{label} Report')

    # Confusion Matrix
    plt.subplot(1,3,2)
    sns.heatmap(confusion_matrix(y_true=act, y_pred=pred), annot=True, cmap='coolwarm')
    plt.title(f'{label} Confusion Matrix')
    plt.xlabel('Predicted labels')
    plt.ylabel('Actual labels')

    # AUC_ROC Curve
    plt.subplot(1,3,3)
    plt.plot([0,1],[0,1],'k--')
    plt.plot(fpr,tpr,label=f'AUC = {np.round(np.trapz(tpr,fpr),3)}')
    plt.legend(loc=4)
    plt.title(f'{label} AUC_ROC Curve')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.tight_layout()

  # Plotting Feature Importance
  try:
    plt.figure(figsize=(21,3))
    plt.bar(range(len(indices)),importance[indices])
    plt.xticks(range(len(indices)), [feature[i] for i in indices])
    plt.title('Feature Importance')
    plt.tight_layout()
  except:
    pass
  plt.show()

  return model

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
lr = LogisticRegression(fit_intercept=True, max_iter=10000)
analyse_model(lr, X_smote, X_test, y_smote, y_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Define the hyperparameter grid

param_grid = {'C': [100,10,1,0.1,0.01,0.001,0.0001],
              'penalty': ['l1', 'l2'],
              'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

# Initializing the logistic regression model
logreg = LogisticRegression(fit_intercept=True, max_iter=10000, random_state=0)
# Using GridSearchCV to tune the hyperparameters using cross-validation
grid = GridSearchCV(logreg, param_grid, cv=5)
# Fit the Algorithm
grid.fit(X_smote, y_smote)
# Select the best hyperparameters found by GridSearchCV
best_params = grid.best_params_
print("Best hyperparameters: ", best_params)
# Predict on the model
# Initiate model with best parameters
lr_model2 = LogisticRegression(C=best_params['C'],
                                  penalty=best_params['penalty'],
                                  solver=best_params['solver'],
                                  max_iter=10000, random_state=0)
analyse_model(lr_model2, X_smote, X_test, y_smote, y_test)

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

In [None]:
# ML Model - 2 Implementation

rf_model = RandomForestClassifier(random_state=2)
# Fit the Algorithm
# Predict on the model
# Making predictions on train and test data
analyse_model(rf_model, X_smote, X_test, y_smote, y_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# HYperparameter Grid
grid = {'n_estimators' : [100,150],
        'max_depth' : [4,6,8],
        'min_samples_split' : [50,80],
        'min_samples_leaf' : [46,60]}

# GridSearch to find the best parameters
rf = GridSearchCV(rf_model, param_grid = grid, scoring = scoring, cv=5)
# Fit the Algorithm
rf.fit(X_smote, y_smote)

# Analysing the model with best set of parametes
analyse_model(rf.best_estimator_, X_smote, X_test, y_smote, y_test)
# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
from xgboost import XGBClassifier
clf = XGBClassifier(random_state=2)
# Fit the Algorithm
# Predict on the model
# Making predictions on train and test data
analyse_model(clf, X_smote, X_test, y_smote, y_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Calculating accuracy on train and test
train_accuracy = accuracy_score(y_smote,train_class_preds)
test_accuracy = accuracy_score(y_test,test_class_preds)

print(classification_report(train_class_preds, y_smote))
print(" ")
print("roc_auc_score")
print(roc_auc_score(y_smote, train_class_preds))

print(classification_report(test_class_preds, y_test))
print(" ")
print("roc_auc_score")
print(roc_auc_score(y_test, test_class_preds))

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# HYperparameter Grid
grid = {'n_estimators' : [50,80,100],
        'max_depth' : [4,6,8],
        'eta' : [0.05,0.08,0.1]
        }

# Fit the Algorithm
# GridSearch to find the best parameters
xgb = GridSearchCV(clf, param_grid = grid, scoring = roc_auc_score, cv=5,verbose=2)
xgb.fit(X_smote, y_smote)
# Predict on the model
# Analysing the model with best set of parametes
analyse_model(xgb.best_estimator_, X_smote, X_test, y_smote, y_test)


##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

###Model 4

In [None]:
# ML Model - 4 Implementation
# SVM algorithm
clf = SVC(random_state= 0,probability=True)
# Fit the Algorithm
# Predict on the model
# Making predictions on train and test data
analyse_model(clf, X_smote, X_test, y_smote, y_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# HYperparameter Grid
grid = {'kernel': ["linear","rbf","poly","sigmoid"],
        'C': [0.1, 1, 10, 100],
        'max_iter' : [100,1000]}

# GridSearch to find the best parameters
svc = GridSearchCV(clf, param_grid = grid, scoring = scoring, cv=5,verbose=2)
svc.fit(X_smote, y_smote)
# Analysing the model with best set of parametes
analyse_model(svc.best_estimator_, X_smote, X_test, y_smote, y_test)

###Model 5

In [None]:
# Classifier
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
analyse_model(knn_clf, X_smote, X_test, y_smote, y_test)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# HYperparameter Grid
grid = {'n_neighbors' : [5,7,9],
        'metric' : ['minkowski','euclidean','manhattan']}

# GridSearch to find the best parameters
knn = GridSearchCV(knn_clf, param_grid = grid, scoring = scoring, cv=5,verbose=1)
knn.fit(X_smote, y_smote)

# Analysing the model with best set of parametes
analyse_model(knn.best_estimator_, X_smote, X_test, y_smote, y_test)


In [None]:
# Fitting Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB
nbc = GaussianNB()
analyse_model(nbc, X_smote, X_test, y_smote, y_test)

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***