<a href="https://colab.research.google.com/github/Srishti6125/Cardiovascular-Risk-Prediction/blob/main/Cardiovascular_Risk_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Name**            - Srishti Singhal

# **Project Summary -**

The aim of this project was to build a predictive system to identify individuals at risk of developing Coronary Heart Disease (CHD) within 10 years, using clinical and lifestyle data. The dataset included demographic, behavioral, and health-related features such as age, blood pressure, cholesterol, BMI, glucose, smoking status, and more.

Key Steps in the Project:

1. Data Preprocessing:
* Handled missing values using context-specific imputation techniques (mean, median, and conditional logic based on smoking status).
* Encoded categorical features using Label Encoding (for binary variables like sex, is_smoking).
* Scaled numerical features to standardize model input.

2. Exploratory Data Analysis:
* Visualized relationships between CHD and age, gender, smoking habits, blood pressure, glucose levels, and BMI using Plotly charts.
* Key insights revealed strong links between CHD and high blood pressure, smoking, diabetes, and obesity.

3. Handling Class Imbalance:
* The dataset had imbalance in the target variable (TenYearCHD), which was addressed using SMOTE (Synthetic Minority Oversampling Technique) to balance the classes before training.

4. Model Building & Evaluation:
* Trained multiple models: Logistic Regression, Random Forest, XGBoost and Decision Tree.
* Performed Hyperparameter Tuning using GridSearchCV for optimal model performance.
* Used metrics like Precision, Recall, F1 Score, and ROC-AUC to evaluate model performance.

5. Model Interpretability:
* Used SHAP (SHapley Additive exPlanations) to interpret feature importance and understand model decisions.
* Identified age, cigsPerDat, systolic blood pressure, diastolic blood pressure, prevalent hypertension, and cholesterol as key predictors of CHD.

# **GitHub Link -**

https://github.com/Srishti6125/Cardiovascular-Risk-Prediction

# **Problem Statement**


The cardiovascular pathologies, especially the coronary heart disease (CHD), account for considerable mortalities all over the world. Timely detection is also crucial to effective prevention and management. Existing methods used in diagnosis commonly rely on expensive and invasive procedures, which impede early detection, particularly in settings where resources are limited.

The aim of this project is to create a machine-learning predictive model that accurately predicts the very high risk of development of CHD in the next 10 years using easily available patients' clinical and lifestyle characteristics. Therefore, using interpretable, data-driven insights, this project intends to help health individuals make informed and proactive decisions regarding preventive measures, including timely intervention.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

In [152]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## ***1. Know Your Data***

### Import Libraries

In [153]:
# Import Libraries

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import GridSearchCV

import shap

### Dataset Loading

In [154]:
# Load Dataset

df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/Copy of data_cardiovascular_risk.csv")

### Dataset First View

In [155]:
# Dataset First Look

df.head()

Unnamed: 0,id,age,education,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,0,64,2.0,F,YES,3.0,0.0,0,0,0,221.0,148.0,85.0,,90.0,80.0,1
1,1,36,4.0,M,NO,0.0,0.0,0,1,0,212.0,168.0,98.0,29.77,72.0,75.0,0
2,2,46,1.0,F,YES,10.0,0.0,0,0,0,250.0,116.0,71.0,20.35,88.0,94.0,0
3,3,50,1.0,M,YES,20.0,0.0,0,1,0,233.0,158.0,88.0,28.26,68.0,94.0,1
4,4,64,1.0,F,YES,30.0,0.0,0,0,0,241.0,136.5,85.0,26.42,70.0,77.0,0


### Dataset Rows & Columns count

In [156]:
# Dataset Rows & Columns count

df.shape

(3390, 17)

### Dataset Information

In [157]:
# Dataset Info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3390 entries, 0 to 3389
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               3390 non-null   int64  
 1   age              3390 non-null   int64  
 2   education        3303 non-null   float64
 3   sex              3390 non-null   object 
 4   is_smoking       3390 non-null   object 
 5   cigsPerDay       3368 non-null   float64
 6   BPMeds           3346 non-null   float64
 7   prevalentStroke  3390 non-null   int64  
 8   prevalentHyp     3390 non-null   int64  
 9   diabetes         3390 non-null   int64  
 10  totChol          3352 non-null   float64
 11  sysBP            3390 non-null   float64
 12  diaBP            3390 non-null   float64
 13  BMI              3376 non-null   float64
 14  heartRate        3389 non-null   float64
 15  glucose          3086 non-null   float64
 16  TenYearCHD       3390 non-null   int64  
dtypes: float64(9),

#### Duplicate Values

In [158]:
# Dataset Duplicate Value Count

df.duplicated().sum()

np.int64(0)

#### Missing Values/Null Values

In [159]:
# Missing Values/Null Values Count

df.isnull().sum()

Unnamed: 0,0
id,0
age,0
education,87
sex,0
is_smoking,0
cigsPerDay,22
BPMeds,44
prevalentStroke,0
prevalentHyp,0
diabetes,0


In [160]:
# Visualizing the missing values

fig = px.imshow(df.isnull(), text_auto=True, height=500, width=1300,
    color_continuous_scale='Mint', aspect='auto',
    title='<b>Missing Data Heatmap</b>')
fig.update_layout(title_x=0.5)
fig.show()

### What did you know about your dataset?

Dataset contains 3390 rows and 17 columns in total.

This dataset consist of following columns with their respective data types :

* int64 - id, age, prevalentStroke, prevalentHyp, diabetes, TenYearCHD
* float64 - education, cigsPerDay, BPMeds, totChol, sysBP, diaBP, BMI, heartRate, glucose
* object - sex, is_smoking

Among which there are null values present in columns 'education', 'cigsPerDay', 'BPMeds', 'totChol', 'BMI', 'heartRate' and 'glucose'.

There are no duplicate values present in dataset.

## ***2. Understanding Your Variables***

In [161]:
# Dataset Columns

df.columns

Index(['id', 'age', 'education', 'sex', 'is_smoking', 'cigsPerDay', 'BPMeds',
       'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP',
       'diaBP', 'BMI', 'heartRate', 'glucose', 'TenYearCHD'],
      dtype='object')

In [162]:
# Dataset Describe

df.describe()

Unnamed: 0,id,age,education,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
count,3390.0,3390.0,3303.0,3368.0,3346.0,3390.0,3390.0,3390.0,3352.0,3390.0,3390.0,3376.0,3389.0,3086.0,3390.0
mean,1694.5,49.542183,1.970936,9.069477,0.029886,0.00649,0.315339,0.025664,237.074284,132.60118,82.883038,25.794964,75.977279,82.08652,0.150737
std,978.753033,8.592878,1.019081,11.879078,0.170299,0.080309,0.464719,0.158153,45.24743,22.29203,12.023581,4.115449,11.971868,24.244753,0.357846
min,0.0,32.0,1.0,0.0,0.0,0.0,0.0,0.0,107.0,83.5,48.0,15.96,45.0,40.0,0.0
25%,847.25,42.0,1.0,0.0,0.0,0.0,0.0,0.0,206.0,117.0,74.5,23.02,68.0,71.0,0.0
50%,1694.5,49.0,2.0,0.0,0.0,0.0,0.0,0.0,234.0,128.5,82.0,25.38,75.0,78.0,0.0
75%,2541.75,56.0,3.0,20.0,0.0,0.0,1.0,0.0,264.0,144.0,90.0,28.04,83.0,87.0,0.0
max,3389.0,70.0,4.0,70.0,1.0,1.0,1.0,1.0,696.0,295.0,142.5,56.8,143.0,394.0,1.0


### Variables Description

* id - identifer representing each unique row
* age - Age of the individual
* education - Educational level
* sex - Gender (M/F)
* is_smoking - Smoking status (yes/no)
* cigsPerDay - Cigarettes smoked per day
* BPMeds - Blood pressure medication usage
* prevalentStroke - History of stroke (0 or 1)
* prevalentHyp - 	Prevalent hypertension (0 or 1)
* diabetes - Diabetes status (0 or 1)
* totChol - Total cholesterol level
* sysBP - Systolic blood pressure
* diaBP - Diastolic blood pressure
* BMI - Body Mass Index
* heartRate - Heart rate (pulse)
*  glucose - Glucose level
* TenYearCHD - Risk of coronary heart disease in next 10 years (0/1)

### Check Unique Values for each variable.

In [163]:
# Check Unique Values for each variable.

df.nunique()

Unnamed: 0,0
id,3390
age,39
education,4
sex,2
is_smoking,2
cigsPerDay,32
BPMeds,2
prevalentStroke,2
prevalentHyp,2
diabetes,2


## 3. ***Data Wrangling***

### Data Wrangling Code

In [164]:
# Write your code to make your dataset analysis ready.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3390 entries, 0 to 3389
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               3390 non-null   int64  
 1   age              3390 non-null   int64  
 2   education        3303 non-null   float64
 3   sex              3390 non-null   object 
 4   is_smoking       3390 non-null   object 
 5   cigsPerDay       3368 non-null   float64
 6   BPMeds           3346 non-null   float64
 7   prevalentStroke  3390 non-null   int64  
 8   prevalentHyp     3390 non-null   int64  
 9   diabetes         3390 non-null   int64  
 10  totChol          3352 non-null   float64
 11  sysBP            3390 non-null   float64
 12  diaBP            3390 non-null   float64
 13  BMI              3376 non-null   float64
 14  heartRate        3389 non-null   float64
 15  glucose          3086 non-null   float64
 16  TenYearCHD       3390 non-null   int64  
dtypes: float64(9),

In [165]:
# dropping 'id' column as it is of no use

df.drop(['id'],inplace=True,axis=1)

In [166]:
# checking for any null values if present

df.isnull().sum()

Unnamed: 0,0
age,0
education,87
sex,0
is_smoking,0
cigsPerDay,22
BPMeds,44
prevalentStroke,0
prevalentHyp,0
diabetes,0
totChol,38


In [167]:
# 'education' column is a categorical columns, having null values which we replacing them with their mode

df['education'].fillna(df['education'].mode()[0], inplace=True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [168]:
# 'BPMeds','totChol','BMI','heartRate','glucose' are columns with numerical values ,having null values which we are replacing them with their mean

col=['BPMeds','totChol','BMI','heartRate','glucose']
for i in col:
  df[i].fillna(df[i].median(),inplace=True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [169]:
# 'cigsPerDay' having null values, aligned acc to 'is_smoking', fill null values with 0, if 'is_smoking is 0', else filling it up with median value

df.loc[(df['is_smoking'] == 'NO') & (df['cigsPerDay'].isnull()), 'cigsPerDay'] = 0

median_smoker = df.loc[df['is_smoking'] == 'YES', 'cigsPerDay'].median()
df['cigsPerDay'] = df['cigsPerDay'].fillna(median_smoker)

In [170]:
# checking if null values are still present

df.isnull().sum()

Unnamed: 0,0
age,0
education,0
sex,0
is_smoking,0
cigsPerDay,0
BPMeds,0
prevalentStroke,0
prevalentHyp,0
diabetes,0
totChol,0


In [171]:
# converting BP_meds in integer type as they are in float

df['BPMeds'] = df['BPMeds'].astype(int)

In [172]:
# youngest and oldest person with chd

print("Youngest person in records with chd : ", df[df['TenYearCHD']==1]['age'].min() )
print(" Oldest  person in records with chd : ", df[df['TenYearCHD']==1]['age'].max() )

Youngest person in records with chd :  35
 Oldest  person in records with chd :  70


In [173]:
# converting education column type into integer as it is originally in float and education cannot be half (like 1.2,...)

df['education'] = df['education'].astype(int)

In [174]:
# 'education' distribution

df['education'].value_counts()

Unnamed: 0_level_0,count
education,Unnamed: 1_level_1
1,1478
2,990
3,549
4,373


In [175]:
# 'sex' distribution

df['sex'].value_counts()

Unnamed: 0_level_0,count
sex,Unnamed: 1_level_1
F,1923
M,1467


In [176]:
# 'is_smoking' distribution

df['is_smoking'].value_counts()

Unnamed: 0_level_0,count
is_smoking,Unnamed: 1_level_1
NO,1703
YES,1687


In [177]:
# min and max no of cigg intake per day

print("Minimum number of cigarettes intake per day : ",df['cigsPerDay'].min())
print("Maximum number of cigarettes intake per day : ",df['cigsPerDay'].max())

Minimum number of cigarettes intake per day :  0.0
Maximum number of cigarettes intake per day :  70.0


In [178]:
# 'BPMeds' distribution

df['BPMeds'].value_counts()

Unnamed: 0_level_0,count
BPMeds,Unnamed: 1_level_1
0,3290
1,100


In [179]:
# 'prevalentStroke' distribution

df['prevalentStroke'].value_counts()

Unnamed: 0_level_0,count
prevalentStroke,Unnamed: 1_level_1
0,3368
1,22


In [180]:
# 'prevalentHyp' distribution

df['prevalentHyp'].value_counts()

Unnamed: 0_level_0,count
prevalentHyp,Unnamed: 1_level_1
0,2321
1,1069


In [181]:
# 'diabetes' distribution

df['diabetes'].value_counts()

Unnamed: 0_level_0,count
diabetes,Unnamed: 1_level_1
0,3303
1,87


In [182]:
# Lowest and Highest cholestrol level acc. to data

print("Lowest  cholestrol level acc. to data : ",df['totChol'].min())
print("Highest cholestrol level acc. to data : ",df['totChol'].max())

Lowest  cholestrol level acc. to data :  107.0
Highest cholestrol level acc. to data :  696.0


In [183]:
# categorizing 'sysBP' into 'BP_category' acc. to bp_category() function

def bp_category(sbp):
    if sbp < 120:
        return 'Normal'
    elif sbp < 130:
        return 'Elevated'
    elif sbp < 140:
        return 'Hypertension Stage 1'
    elif sbp < 180:
        return 'Hypertension Stage 2'
    else:
        return 'Hypertensive Crisis'

df['BP_category'] = df['sysBP'].apply(bp_category)

In [184]:
# 'BP_category' distribution

df['BP_category'].value_counts()

Unnamed: 0_level_0,count
BP_category,Unnamed: 1_level_1
Normal,1015
Hypertension Stage 2,896
Elevated,745
Hypertension Stage 1,595
Hypertensive Crisis,139


In [185]:
# min and max values of sysBP

print("Minimum systolic BP acc. to records : ",df['sysBP'].min())
print("Maximum systolic BP acc. to records : ",df['sysBP'].max())

Minimum systolic BP acc. to records :  83.5
Maximum systolic BP acc. to records :  295.0


In [186]:
# min and max values of diaBP

print("Minimum diastolic BP acc. to records : ",df['diaBP'].min())
print("Maximum diastolic BP acc. to records : ",df['diaBP'].max())

Minimum diastolic BP acc. to records :  48.0
Maximum diastolic BP acc. to records :  142.5


In [187]:
# Lowest and Highest BMI acc. to records

print(" Lowest BMI acc. to records : ",df['BMI'].min())
print("Highest BMI  acc. to records : ",df['BMI'].max())

 Lowest BMI acc. to records :  15.96
Highest BMI  acc. to records :  56.8


In [188]:
# categorizing 'BMI' into 'BMI_category' acc. to categorize_bmi() function

def categorize_bmi(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif bmi < 25:
        return 'Normal'
    elif bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

df['BMI_category'] = df['BMI'].apply(categorize_bmi)

In [189]:
# 'BMI_category' distribution

df['BMI_category'].value_counts()

Unnamed: 0_level_0,count
BMI_category,Unnamed: 1_level_1
Normal,1512
Overweight,1398
Obese,439
Underweight,41


In [190]:
# Lowest and Highest heartrate acc. to data

print(" Lowest heartrate acc. to data : ",df['heartRate'].min())
print("Highest heartrate acc. to data : ",df['heartRate'].max())

 Lowest heartrate acc. to data :  45.0
Highest heartrate acc. to data :  143.0


In [191]:
#Highest glucose level acc. to data

print(" Lowest glucose level acc. to data : ",df['glucose'].min())
print("Highest glucose level acc. to data : ",df['glucose'].max())

 Lowest glucose level acc. to data :  40.0
Highest glucose level acc. to data :  394.0


In [192]:
# 'TenYearCHD' distribution

df['TenYearCHD'].value_counts()

Unnamed: 0_level_0,count
TenYearCHD,Unnamed: 1_level_1
0,2879
1,511


In [193]:
# mapping 'TenYearCHD' with descriptional values (easy for visualization)

df['CHD_status'] = df['TenYearCHD'].map({0: 'No CHD', 1: 'CHD Risk'})

### What all manipulations have you done and insights you found?

**MANIPULATIONS**

1. Dropped id column since it's just an identifier and holds no predictive value.

2. Converted BP_meds and education column in integer as they were originally in float. (these values cannot be fractional)

3. Handling Missing Values
* education: Filled with mode (most common value).
* cigsPerDay, BPMeds, totChol, BMI, heartRate, glucose: Filled with median to reduce outlier impact.

4. Feature Engineering & Binning
* BP_category : Based on systolic blood pressure (sysBP) : Normal, Elevated, Hypertension Stage 1, Hypertension Stage 2, Hypertensive Crisis.
* BMI_category : Based on WHO BMI cutoffs : Underweight, Normal, Overweight, Obese.

**INSIGHTS**

1. CHD Age Range -
* Youngest with CHD: 35
* Oldest with CHD: 70

Most of the people in records belong to age group 38-60.

2. Gender Distribution -
More than 50% of information is about females in this dataset.

3. Education Level -
Majority of people in the dataset fall into levels 1 and 2, lower education often correlates with higher health risk due to less awareness/access to healthcare.

4. Smoking Behavior -
* 49.76% of individuals smoke, in which maximum count of cigarettes intake per day is 70.

5. BP Ranges -
* Diastolic BP: Range 48 – 142.5
* Systolic BP: Range 83.5 – 295.0

2/3rd of people categorize into elevated,hypertensive stage 1 and hypertensive stage 2, out of which inly 100 people takes BP medications

6. Cholesterol & Glucose -
* Total Cholesterol: Min 107, Max 696
* Glucose Levels: Min 40, Max 394

7. BMI Analysis -
* Range 15.96 – 56.8
* BMI categories shows, most of the individuals present are normal and overweight.

8. Prevalent conditions -
31.5 people are diagonosed with prevalent hypertension, while only a few(0.64%) are diagonosed with prevalent stroke.

9. CHD -
* 15.07% individuals are more likely to be prone to coronary heart disease in next 10 years.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [194]:
# Chart - 1 visualization code

# Age distribution

grouped_age = df.groupby('age').size().reset_index(name='count')

fig = px.area(grouped_age,x='age',y='count',title='<b>Age Distribution',color_discrete_sequence=['#688984'])
fig.update_layout(template='plotly_white',font=dict(size=12), width=1000,height=500,title_x=0.5 )
fig.show()

##### 1. Why did you pick the specific chart?

An area chart was chosen because it effectively highlights the distribution and concentration of individuals across various age groups in a continuous and smooth manner. It helps visualize how population density changes with age, which is especially relevant for analyzing health-related datasets where age is a critical factor.

##### 2. What is/are the insight(s) found from the chart?

* Most individuals fall within the 40–60 age range.
* There's a noticeable dip in the very young (30s) and older population (65+), suggesting that middle-aged adults dominate the dataset.
* This aligns with the age group that's most vulnerable to CHD (Coronary Heart Disease) risks.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* Understanding that most subjects are middle-aged enables targeted preventive healthcare campaigns, risk prediction models, and customized interventions for this group. It allows businesses (like insurance or health tech companies) to tailor products/services more effectively.

Negative Impact :
* The underrepresentation of younger and older individuals may limit the model's generalizability to those age groups.

#### Chart - 2

In [195]:
# Chart - 2 visualization code

# Age Distribution by CHD_status

fig = px.box(df,x='CHD_status',y='age',color='CHD_status',color_discrete_sequence=['#35625b','#9ab1ad'],title='<b>Age Distribution by CHD_status ')
fig.update_layout( template='plotly_white',font=dict(size=12),width=1000,height=500,title_x=0.5)
fig.show()

##### 1. Why did you pick the specific chart?

A box plot is perfect for visualizing the distribution, spread, and central tendency of age data across the two categories of CHD (Coronary Heart Disease) status. It also highlights outliers, medians, and variability, making it ideal to detect how age is associated with CHD.

##### 2. What is/are the insight(s) found from the chart?

* Individuals diagnosed with CHD tend to be older on average compared to those without CHD.
* The median age for CHD patients is clearly higher, confirming age as a significant risk factor.
* There’s also less age variability among those with CHD, suggesting it commonly occurs within a certain age band (roughly 50–65).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* These insights can inform preventive screening programs and early intervention campaigns targeting people above 50, who are statistically at higher risk.
* Healthcare providers and insurers can use this to stratify risk, prioritize older demographics, and allocate resources efficiently.


Negative Impact:
* If only the older population is  targeted, there might be missed early-risk cases in younger individuals. A balanced approach is essential to not overlook any early signs.

#### Chart - 3

In [196]:
# Chart - 3 visualization code

# Education Level Distribution

fig = px.pie(df,names='education',title='<b>Education Level Distribution',
             color_discrete_sequence=px.colors.qualitative.Vivid)
fig.update_layout(title_x=0.5,height=500,width=1000)
fig.show()

##### 1. Why did you pick the specific chart?

A pie chart is ideal for showing the proportion of each education level in the dataset. Since "education" is a categorical feature, a pie chart helps quickly visualize how individuals are distributed across various education levels.

##### 2. What is/are the insight(s) found from the chart?

* The majority of individuals fall under education levels 1 and 2, indicating lower to moderate education levels dominate the dataset.
* Higher education levels (3 and 4) represent a smaller portion of the population.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* This insight can guide targeted health education campaigns, especially for people with lower education, who may lack awareness about CHD risk and prevention.
* Organizations or healthcare startups can design accessible resources (like workshops in local languages) tailored to less-educated groups to improve health literacy and early diagnosis.

Negative Impact:

* Relying only on education as a filter for intervention might ignore high-risk individuals in higher education groups. So while impactful, this should be part of a multi-factor targeting strategy.

#### Chart - 4

In [197]:
# Chart - 4 visualization code

# Gender Distribution by CHD Status

gender_chd = df.groupby(['sex', 'CHD_status']).size().reset_index(name='count')

fig = px.bar( gender_chd, x='sex', y='count',
    color='CHD_status', text='count',
    title='<b>Gender Distribution by CHD Status</b>', color_discrete_sequence=['#35625b','#9ab1ad'], barmode='group' )

fig.update_layout(title_x=0.5,width=1000,height=500)
fig.update_traces(textposition='outside')
fig.show()

##### 1. Why did you pick the specific chart?

A grouped bar chart allows for side-by-side comparison between genders (male vs female) based on their CHD status. It's ideal for showing distributional differences across categories clearly and visually.

##### 2. What is/are the insight(s) found from the chart?

* Females slightly outnumber males in this dataset.
* The number of females without CHD is higher than males, but when it comes to those with CHD, the gap narrows significantly.
* This suggests that males may have a slightly higher proportion of CHD cases, or are at greater risk, even though fewer in number.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

This chart can inform gender-specific health interventions. For instance:
* Promote preventive screenings and lifestyle changes for males, who may be more vulnerable despite lower representation.
* Develop education campaigns tailored to each gender—perhaps using different channels (e.g., workplace health programs for males, community health drives for females).

Negative Impact:

* If focusing only on males due to higher CHD prevalence might overlook the absolute number of females at risk (which is still high).
* Gender bias in health service delivery might arise if insights aren’t used responsibly.

#### Chart - 5

In [198]:
# Chart - 5 visualization code

# CHD Status by Smoking Habit

df_smoke_chd = df.groupby(['is_smoking', 'CHD_status']).size().reset_index(name='count')

fig = px.bar( df_smoke_chd, x='is_smoking', y='count',
    color='CHD_status', barmode='stack', text='count',
    title='<b>CHD Status by Smoking Habit</b>', color_discrete_sequence=['#35625b','#9ab1ad'] )

fig.update_layout(title_x=0.5,width=1000,height=500)
fig.update_traces(textposition='outside')
fig.show()

##### 1. Why did you pick the specific chart?

A stacked bar chart is perfect for visualizing two categorical variables—in this case, smoking behavior (is_smoking) and CHD status. It allows us to see how CHD is distributed within smokers and non-smokers, and compare the relative risk visually.

##### 2. What is/are the insight(s) found from the chart?

* Smokers form a large chunk of the dataset.
* The number of CHD-positive cases is noticeably higher among smokers than non-smokers.
* While non-smokers still show some CHD cases, smoking significantly increases the likelihood of CHD.
* The visual proportion of CHD-positive individuals is higher in smokers, signaling a strong correlation between smoking and CHD risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* Healthcare providers and policymakers can design anti-smoking campaigns targeted at reducing heart disease risk, with backed-up visual evidence.
* Insurance companies might use this insight to design customized health premiums or risk ratings.
* The chart supports predictive model features—"is_smoking" is clearly a high-impact variable for classification.

Negative Impact:
* Could lead to criticization towards smokers in healthcare coverage or workplace policies if the data isn’t contextualized well, leading to discrimination and exclusion, rather than solutions.

#### Chart - 6

In [199]:
# Chart - 6 visualization code

# Cigarette Intake per Day by Smoking Status

fig = px.box( df, x='CHD_status', y='cigsPerDay',
    title='<b>Cigarette Intake per Day by Smoking Status</b>',
    color='CHD_status', color_discrete_sequence=['#35625b','#9ab1ad'])

fig.update_layout(template='plotly_white',font=dict(size=12), width=1000,height=500,title_x=0.5 )
fig.show()

##### 1. Why did you pick the specific chart?

Box plot was chosen to explore the relationship between cigarette consumption and CHD status. Box plots are ideal to visualize distribution, central tendency (median), and outliers, helping us understand whether higher smoking frequency aligns with increased CHD risk.

##### 2. What is/are the insight(s) found from the chart?

* People diagnosed with CHD show a wider spread and higher median for daily cigarette intake than those without CHD.
* There are significant outliers—some individuals with CHD are smoking over 40–60 cigarettes/day, which is extremely high.
* CHD cases are less clustered around lower intake values, suggesting more frequent/heavy smokers are at greater risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

* Targeted smoking cessation campaigns can be designed for high-risk populations.
* Healthcare providers can develop early screening and preventive intervention plans based on smoking patterns.
* Insurance firms can use this info for risk stratification, provided they avoid discriminatory practices.

Negative Growth Risk:

* If interpreted without context, insurers or employers might punish all smokers, even occasional or former ones, leading to unethical exclusions.
* There’s a risk of social stigmatization instead of supporting behavioral change.



#### Chart - 7

In [200]:
# Chart - 7 visualization code

# Systolic vs Diastolic Blood Pressure

fig = px.scatter( df, x='sysBP',  y='diaBP',
    color='CHD_status', color_discrete_sequence=['#35625b','#9ab1ad'],
    title='<b>Systolic vs Diastolic Blood Pressure</b>',
    hover_data=['age', 'BPMeds', 'BP_category'] )
fig.update_traces( marker=dict(opacity=0.6) )
fig.update_layout(title_x=0.5, width=1000, height=500)
fig.show()

##### 1. Why did you pick the specific chart?

Scatter plot was chosen to explore the relationship between systolic (sysBP) and diastolic (diaBP) blood pressure, and how they relate to CHD status. It helps detect patterns, clusters, and ranges of BP that are common in people with CHD vs without CHD. Hover data like age, meds, and BP category enrich the interpretation.

##### 2. What is/are the insight(s) found from the chart?

* A positive correlation is visible between systolic and diastolic BP – as one increases, the other tends to increase too.

* CHD-positive individuals appear more frequently in higher BP ranges, especially when systolic exceeds 140–160 mmHg, indicating stages of hypertension.

* There are also some individuals on BP medications (as seen in hover data) who still show elevated values, suggesting either ineffective control or late-stage disease.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

* This chart can guide doctors and public health planners to focus on people with Stage 1 or 2 hypertension as a high-risk CHD group.
* BP management programs can be prioritized for such individuals, possibly integrating remote monitoring or early alert systems.
* Pharma companies can use these clusters to target drug development or marketing more effectively.

Negative Impact:

* If data is generalized, people with temporarily elevated BP might get flagged unfairly.
* Overreliance on BP alone without considering other health indicators could cause false positives, leading to resource wastage or patient anxiety.

#### Chart - 8

In [201]:
# Chart - 8 visualization code

# Prevalence of Stroke & Hypertension by CHD Status

chd_group = df.groupby('CHD_status')[['prevalentStroke', 'prevalentHyp']].sum().reset_index()

melted = pd.melt( chd_group, id_vars='CHD_status', var_name='Condition', value_name='Count')

fig = px.bar( melted, x='CHD_status', y='Count', color='Condition', color_discrete_sequence=px.colors.qualitative.Vivid,
    barmode='group',
    text='Count', title='<b>Prevalence of Stroke & Hypertension by CHD Status</b>' )

fig.update_layout(title_x=0.5, width=1000, height=500)
fig.update_traces(textposition='outside')
fig.show()

##### 1. Why did you pick the specific chart?

Grouped bar chart was chosen to compare the prevalence of stroke and hypertension between individuals with and without Coronary Heart Disease (CHD). These two conditions are widely recognized as major risk factors for CHD, so visualizing their distribution helps to understand comorbidity patterns.

##### 2. What is/are the insight(s) found from the chart?

* Both prevalent stroke and prevalent hypertension are significantly more common in individuals with CHD compared to those without.
* Especially hypertension shows a steep difference, indicating that it is a more widespread co-condition than stroke in CHD-positive individuals.
* Stroke is also visibly more frequent in the CHD group, suggesting a potential bidirectional risk relationship.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

* Healthcare providers can use this insight to flag patients with stroke or hypertension as high CHD risk candidates, prompting early screening or intervention.
* Insurance companies could refine their risk stratification models, leading to better premium calculations and health coverage policies.
* Health tech startups building preventive care dashboards can incorporate such comorbidity logic for better predictive modeling.

Negative Impact:

* If stroke/hypertension patients are over-categorized as CHD-prone without other clinical data, it could lead to overdiagnosis or unnecessary anxiety.
* Also, underlying root causes of stroke/hypertension might get ignored if focus shifts solely toward CHD risk.

#### Chart - 9

In [202]:
# Chart - 9 visualization code

# Glucose Levels by Diabetes Status (Split by CHD)

fig = px.box( df, x='diabetes', y='glucose',
    color='diabetes', color_discrete_sequence=px.colors.qualitative.Vivid,
    facet_col='CHD_status',
    title='<b>Glucose Levels by Diabetes Status (Split by CHD)</b>' )

fig.update_layout(title_x=0.5,height=500, width=1000)
fig.show()

##### 1. Why did you pick the specific chart?

Facet box plot was chosen to visually compare glucose levels across diabetes status and how they differ between people with and without CHD. It helps uncover whether diabetes and glucose levels have a clear pattern or correlation with CHD risk.

##### 2. What is/are the insight(s) found from the chart?

* Individuals with diabetes consistently have higher glucose levels than non-diabetics — as expected.
* Among CHD-positive individuals, diabetics show wider and more extreme glucose variability, indicating poor glucose control in this subgroup.
* Even among non-diabetics, some high glucose outliers exist, which could suggest undiagnosed or pre-diabetic cases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

* Medical interventions can be tailored better: CHD patients with diabetes need tight glucose monitoring and possibly aggressive treatment plans.
* Public health campaigns can focus on early diabetes screening in patients at risk of or living with CHD.
* Health tech or wearable app developers can integrate this insight into algorithms that track glucose and flag cardiovascular risk.

Negative Impact:

* Solely relying on diabetes status without monitoring real-time glucose might miss early-stage metabolic risks.
* There's a risk of false security in “non-diabetics” who might have high glucose levels — this highlights the importance of continuous glucose monitoring.

#### Chart - 10

In [203]:
# Chart - 10 visualization code

# Cholesterol vs Age Colored by CHD Status

fig = px.scatter( df, x='age', y='totChol',
    color='CHD_status', color_discrete_sequence=['#35625b','#9ab1ad'],
    title='<b>Cholesterol vs Age Colored by CHD Status</b>',
    hover_data=['sex', 'is_smoking', 'diabetes'] )
fig.update_traces( marker=dict(opacity=0.8) )
fig.update_layout(title_x=0.5,height=500, width =1000)
fig.show()

##### 1. Why did you pick the specific chart?

Scatter plot was selected to explore the relationship between age and total cholesterol levels, and how it might differ based on CHD status. Coloring the data by CHD presence makes it easier to observe trends and clusters.

##### 2. What is/are the insight(s) found from the chart?

* No clear linear correlation between age and cholesterol is observed overall — cholesterol levels vary widely at all ages.
* However, CHD patients tend to cluster in the higher cholesterol ranges, especially in the middle to older age groups (45+).
* There are several younger individuals with high cholesterol but no CHD, indicating possible early intervention or other protective factors.
* Hover data shows many CHD patients with high cholesterol also have diabetes and/or smoke, hinting at multi-factor risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:
* Reinforces the idea that age alone isn’t a reliable predictor — it must be combined with cholesterol and lifestyle indicators (smoking, diabetes).
* Healthcare providers can use this insight to build age-specific risk profiles.
* This supports preventive screening for high cholesterol even in younger adults, especially those who smoke or have diabetes.

Negative Impact:
* Ignoring younger age groups just because they "don’t fit the profile" could delay critical treatment in early CHD development stages.
* Relying too much on cholesterol alone may lead to underdiagnosis in individuals with normal cholesterol but other elevated risk factors.

#### Chart - 11

In [204]:
# Chart - 11 visualization code

# BMI Distribution

fig = px.histogram( df, x='BMI', nbins=30,
    title='<b>BMI Distribution</b>', color_discrete_sequence=['#688984'] )

fig.update_layout(title_x=0.5,height=500, width =1000)
fig.show()

##### 1. Why did you pick the specific chart?

Histogram helps visualize the overall distribution of BMI values in the dataset. It's a great way to detect skewness, outliers, and clustering in body mass index, which is a major risk factor for cardiovascular diseases like CHD.

##### 2. What is/are the insight(s) found from the chart?

* The BMI distribution is slightly right-skewed, with the majority of individuals falling in the 24–30 range.
* A notable chunk of the population is overweight (BMI > 25) and even obese (BMI > 30).
* Very few individuals have a BMI under 18.5, indicating underweight is less common in this sample.
* This distribution suggests that weight-related risk factors are prevalent and may influence CHD occurrence.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

* Promotes targeted health awareness campaigns around obesity and its link to CHD.
* Helps healthcare professionals and policy makers design preventive programs focused on weight management.
* nsurance companies or fitness programs can stratify clients based on risk using BMI trends from such data.

Negative Impact:

* Solely focusing on BMI may overlook “skinny-fat” individuals with low BMI but high visceral fat or poor metabolic health.
* Also, overemphasizing BMI alone might lead to unnecessary stigmatization, especially since it doesn’t consider muscle mass or bone density.

#### Chart - 12

In [205]:
# Chart - 12 visualization code

# BMI Category Breakdown by CHD Status

fig = px.sunburst( df,
    path=['BMI_category', 'CHD_status'],
    title='<b>BMI Category Breakdown by CHD Status</b>',
    color_discrete_sequence=px.colors.qualitative.Vivid)

fig.update_layout(title_x=0.5,height=500,width=1000)
fig.show()

##### 1. Why did you pick the specific chart?

A sunburst chart is perfect for visualizing hierarchical relationships, like how CHD status is distributed within each BMI category. It gives a clear nested view of both proportions and subgroups.

##### 2. What is/are the insight(s) found from the chart?

* The “Overweight” and “Obese” BMI categories dominate the inner circle, indicating a large portion of the population falls in these higher-risk weight groups.
* Within the Obese category, there’s a relatively higher proportion of CHD cases, suggesting a strong association between obesity and heart disease.
* The Normal BMI group still shows some presence of CHD, highlighting that CHD isn't exclusive to overweight individuals — other risk factors like blood pressure, cholesterol, or genetics may also play a role.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

* Healthcare providers can target interventions by BMI segment, especially focusing on overweight/obese groups.
* The chart gives actionable intel for insurance risk models — higher premiums or preventive care outreach could be aligned with BMI segments.
* Supports BMI as a key feature in predictive modeling for CHD.

Negative Impact:

* May inadvertently neglect CHD cases in normal BMI individuals, leading to under-screening in that group.
* If not interpreted carefully, it might promote a one-size-fits-all message about obesity, ignoring other nuanced risk contributors.



#### Chart - 13

In [206]:
# Chart - 13 visualization code

# Heart Rate Distribution by CHD Status

fig = px.violin( df, x='CHD_status', y='heartRate',
    title='<b>Heart Rate Distribution by CHD Status</b>',
    color_discrete_sequence=['#35625b','#9ab1ad'], color='CHD_status',
    box=True,  points='all',
    hover_data= ['age','sex','diabetes','is_smoking'] )

fig.update_layout( title_x=0.5,height=500, width =1000 )
fig.update_traces( marker=dict(size=5,opacity=0.6))
fig.show()

##### 1. Why did you pick the specific chart?

A violin plot was chosen because it combines the strengths of both box plots and KDE (kernel density estimation). It allows us to:
* Compare the distribution shape of heart rate for individuals with and without CHD.
*  Visualize outliers, spread, median, and how densely heart rate values are clustered for each CHD status.

##### 2. What is/are the insight(s) found from the chart?

* People without CHD tend to have a more concentrated and slightly lower range of heart rates.
* The CHD group has a wider and flatter distribution, showing: Greater variability in heart rate, and possible shift toward higher heart rate values for some individuals.
* The median heart rate is similar in both groups, but the spread is larger in the CHD group — suggesting irregular heart rhythms or stress-related fluctuations might be more common in CHD-positive individuals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

* Could support the development of heart rate monitoring tools (e.g., wearables) to flag abnormal patterns in at-risk individuals.
* Insurance and health analytics teams can use heart rate variability as an input in CHD prediction models.
* Physicians can dig deeper into abnormal heart rate trends during routine checkups, even in the absence of other symptoms.

Negative Impact:

* If models over-prioritize heart rate without considering age, medication (like BPMeds), or lifestyle factors, it may lead to false positives or unnecessary anxiety for patients.
* Might result in over-monitoring healthy individuals with naturally high heart rates (e.g., athletes).

#### Chart - 14 - Correlation Heatmap

In [207]:
# Correlation Heatmap visualization code
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

corr_matrix = df[numeric_cols].corr()

fig = px.imshow(corr_matrix, text_auto=True, color_continuous_scale='Mint', aspect='auto', title='Correlation Heatmap of Numeric Features')
fig.update_layout( title_font_size=20, xaxis_tickangle=45, height=1000, width=1200 )
fig.show()


##### 1. Why did you pick the specific chart?

Correlation help quantify linear relationships between variables, detect strong multicollinearity, helps identify which features are positively or negatively associated with the target variable (TenYearCHD).Also, guides feature selection by highlighting redundant or informative variables before modeling.

##### 2. What is/are the insight(s) found from the chart?

The correlation heatmap confirms that age, BP, glucose, and existing hypertension are among the strongest contributors to CHD risk, while also flagging potential multicollinearity between BP measures.
This insight helps in refining feature selection and building a more robust model.

#### Chart - 15 - Pair Plot

In [208]:
# Pair Plot visualization code
df['CHD_status'] = df['TenYearCHD'].map({0: 'No CHD', 1: 'CHD Risk'})

features = ['age', 'totChol', 'sysBP', 'diaBP', 'BMI', 'glucose', 'CHD_status']

fig = px.scatter_matrix( df[features], dimensions=features[:-1], color='CHD_status',
    title="Pairplot Colored by CHD Risk",
     color_discrete_sequence=['#35625b','#9ab1ad'])

fig.update_traces( marker=dict(opacity=0.6, size=5) )

fig.update_layout(
    dragmode='select', template='plotly_white',font=dict(size=12),
    width=1200,height=1000)
fig.show()

##### 1. Why did you pick the specific chart?

Pairplot helps visualize bivariate relationships between features, Spot clusters or separability between the target classes ( TenYearCHD : 0 or 1 ).
It also detect correlations and non-linear patterns that can guide feature selection and identify outliers and distribution skewness.

##### 2. What is/are the insight(s) found from the chart?

People more likely to develop CHD in next ten years (positive class) are often older(50+), hypertensive, have higher cholesterol, glucose, and BMI.

## ***5. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [209]:
# Handling Missing Values & Missing Value Imputation

# already handled null values in the beginning, now no null values are present

df.isnull().sum()

Unnamed: 0,0
age,0
education,0
sex,0
is_smoking,0
cigsPerDay,0
BPMeds,0
prevalentStroke,0
prevalentHyp,0
diabetes,0
totChol,0


#### What all missing value imputation techniques have you used and why did you use those techniques?

We had already handled null values in the beginning of this project. The following are the techniques used :-

1. Mode Imputation (education) - Replace missing values with the most frequent category. Used this as it preserves distribution for non-numeric labels.

2. Median Imputation (BPMeds, totChol, BMI, heartRate, glucose) - Replacing missing values with mean of other values present. Used this as median is less affected by outliers than mean.

3. Conditional Imputation (cigsPerDay) - Replace cigsPerDay with 0 if is_smoking == 'NO', else fill with median of smokers only. Using conditional logic for behavior-dependent values, cigsPerDay is related to is_smoking.


### 2. Handling Outliers

In [210]:
columns_to_winsorize = ['cigsPerDay', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose']

In [211]:
# describing values before handling outliers

df[columns_to_winsorize].describe()

Unnamed: 0,cigsPerDay,totChol,sysBP,diaBP,BMI,heartRate,glucose
count,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0
mean,9.140413,237.039823,132.60118,82.883038,25.793251,75.976991,81.720059
std,11.872952,44.994205,22.29203,12.023581,4.107026,11.970114,23.161265
min,0.0,107.0,83.5,48.0,15.96,45.0,40.0
25%,0.0,206.0,117.0,74.5,23.03,68.0,72.0
50%,0.0,234.0,128.5,82.0,25.38,75.0,78.0
75%,20.0,264.0,144.0,90.0,27.9975,83.0,85.0
max,70.0,696.0,295.0,142.5,56.8,143.0,394.0


In [212]:
# Handling Outliers & Outlier treatments

# Function to apply IQR-based capping (winsorization technique )
def winsorize_iqr(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1

        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        df[col] = np.where(df[col] < lower_bound, lower_bound,
                           np.where(df[col] > upper_bound, upper_bound, df[col]))
    return df

df = winsorize_iqr(df, columns_to_winsorize)

In [213]:
# describing values after handling outliers

df[columns_to_winsorize].describe()

Unnamed: 0,cigsPerDay,totChol,sysBP,diaBP,BMI,heartRate,glucose
count,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0,3390.0
mean,9.110914,236.538938,132.117257,82.730236,25.705042,75.840413,79.49056
std,11.755881,42.839398,20.751844,11.52617,3.81111,11.543453,11.525738
min,0.0,119.0,83.5,51.25,15.96,45.5,52.5
25%,0.0,206.0,117.0,74.5,23.03,68.0,72.0
50%,0.0,234.0,128.5,82.0,25.38,75.0,78.0
75%,20.0,264.0,144.0,90.0,27.9975,83.0,85.0
max,50.0,351.0,184.5,113.25,35.44875,105.5,104.5


In [214]:
# dropping 'BP_category', 'BMI_category','CHD_status' columns to reduce data redudancy

df.drop(['BP_category', 'BMI_category','CHD_status' ], axis=1, inplace=True)

Now, we will just drop these columns to reduce redudancy as these category columns (BP_category, BMI_category,CHD_status) are literally derived from sysBP, BMI and TenYearCHD  .

##### What all outlier treatment techniques have you used and why did you use those techniques?

Winsorization :-

Winsorization is a technique where outliers are capped (not removed) .We used this technique to reduce their influence on the data without losing valuable records, keeping full dataset intact and transformation reversible.

### 3. Categorical Encoding

In [215]:
# Encode your categorical columns

In [216]:
# Label encoding 'sex' and 'is_smoking' column

df['sex'] = df['sex'].map({'F': 1, 'M': 0})
df['is_smoking'] = df['is_smoking'].map({'NO': 0, 'YES': 1})

In [217]:
df.head()

Unnamed: 0,age,education,sex,is_smoking,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,64,2,1,1,3.0,0,0,0,0,221.0,148.0,85.0,25.38,90.0,80.0,1
1,36,4,0,0,0.0,0,0,1,0,212.0,168.0,98.0,29.77,72.0,75.0,0
2,46,1,1,1,10.0,0,0,0,0,250.0,116.0,71.0,20.35,88.0,94.0,0
3,50,1,0,1,20.0,0,0,1,0,233.0,158.0,88.0,28.26,68.0,94.0,1
4,64,1,1,1,30.0,0,0,0,0,241.0,136.5,85.0,26.42,70.0,77.0,0


#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding :-

'sex' and 'is_smoking' column are binary categorical features (only 2 possible values).
Thus, label encoding is simple and perfect for yes/no and M/F values, keeping the data compact and ML-model-friendly.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [218]:
# Manipulate Features to minimize feature correlation and create new features

# already done

Before, already added new columns ('BP_category', 'BMI_category','CHD_status' - but now deleted to reduce data redudancy as they were derived from their nnumerical value columns ,which are better for modelling)

#### 2. Feature Selection

In [219]:
# Select your features wisely to avoid overfitting

corr = df.corr(numeric_only=True)
f_corr = corr['TenYearCHD'].sort_values(ascending=False)
print(f_corr)

TenYearCHD         1.000000
age                0.224927
sysBP              0.203905
prevalentHyp       0.166544
diaBP              0.131855
diabetes           0.103681
totChol            0.091825
BPMeds             0.087349
glucose            0.070922
prevalentStroke    0.068627
cigsPerDay         0.065044
BMI                0.064416
is_smoking         0.034143
heartRate          0.019719
education         -0.051388
sex               -0.084647
Name: TenYearCHD, dtype: float64


In [220]:
# dropping TenYearCHD, is_smoking, sysBP, diaBP

f_corr = f_corr.drop(['TenYearCHD','is_smoking'])

In [221]:
selected_features = f_corr.index.tolist()
selected_features

['age',
 'sysBP',
 'prevalentHyp',
 'diaBP',
 'diabetes',
 'totChol',
 'BPMeds',
 'glucose',
 'prevalentStroke',
 'cigsPerDay',
 'BMI',
 'heartRate',
 'education',
 'sex']

##### What all feature selection methods have you used  and why?

1. Correlation Analysis -

checked Pearson correlation of all features with the target variable TenYearCHD which helped identify linear relationships between features and target.
Features with very low or negative correlation (like is_smoking,heartRate, education, sex) were marked for possible removal .

2. Redundancy Elimination -

Removed is_smoking as we already have cigsPerDay present and cigsPerDay =0 means no smoking.



##### Which all features you found important and why?

* age	- Highest correlation (~0.22), CHD risk increases with age
* sysBP	- Strong correlation (~0.20), systolic BP is a primary CHD risk factor
* diaBP	- Complements systolic BP, gives a fuller view of BP health
* prevalentHyp	- Indicates historical hypertension, directly related to heart conditions
* diabetes - Chronic disease linked to blood vessel damage & CHD
* totChol	- High cholesterol = plaque buildup in arteries
* BPMeds	- Suggests prior medical history relevant to cardiovascular risk
* glucose	- High glucose signals prediabetes/diabetes — major CHD contributor
* prevalentStroke	- Stroke and CHD share vascular pathways
* cigsPerDay - Smoking is a top-tier predictor of cardiovascular issues
* BMI	- Obesity strains the heart and elevates BP, cholesterol, etc.
* heartRate	- Weak correlation
* education	- Slightly negative corr, less clinically impactful
* sex -	Weak negative corr

**Features Dropped** :
* TenYearCHD - Target variable
* is_smoking - Redundant as we already have cigsPerDay


### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [222]:
# Transform Your data

No data transformation is required.

### 6. Data Scaling

In [223]:
# initializing variable x and y

x = df[selected_features]
y = df['TenYearCHD']

In [224]:
# Scaling your data

# Standardization (Z-Score Scaling)

scaler = StandardScaler()
x= scaler.fit_transform(x)

##### Which method have you used to scale you data and why?

Z-Score Scaling (Standardization method) :

This method scales values so they have mean = 0 and std = 1 , appropiate for models such asLinear Regression

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

There is no need for dimensionality reduction, as we have limited countable number of columns.

In [225]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Not Applicable

### 8. Data Splitting

In [226]:
# Split your data to train and test. Choose Splitting ratio wisely.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)

In [227]:
# shape of x_train, x_test, y_train, y_test

print ("Shape of x_train : ",x_train.shape)
print ("Shape of y_train : ",y_train.shape)
print ("Shape of x_test : ",x_test.shape)
print ("Shape of y_test : ",y_test.shape)

Shape of x_train :  (2712, 14)
Shape of y_train :  (2712,)
Shape of x_test :  (678, 14)
Shape of y_test :  (678,)


##### What data splitting ratio have you used and why?

We have used, 80/20 ratio (considered as "Golden Ratio") as

There is enough data to learn ,as 80% gives the model a solid amount of examples to train effectively and sufficient test size 20% gives a decent chunk to evaluate performance reliably
Avoids overfitting, as too much data for training might be too small to spot overfitting.

### 9. Handling Imbalanced Dataset

In [228]:
# Handling Imbalanced Dataset (If needed)

df['TenYearCHD'].value_counts()

Unnamed: 0_level_0,count
TenYearCHD,Unnamed: 1_level_1
0,2879
1,511


In [229]:
# Using SMOTE
smt = SMOTE(random_state=10)
x, y = smt.fit_resample(x,y)

# splitting data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)

# shape of resampled x_train, x_test, y_train, y_test
print ("Shape of x_train : ",x_train.shape)
print ("Shape of y_train : ",y_train.shape)
print ("Shape of x_test : ",x_test.shape)
print ("Shape of y_test : ",y_test.shape)

Shape of x_train :  (4606, 14)
Shape of y_train :  (4606,)
Shape of x_test :  (1152, 14)
Shape of y_test :  (1152,)


##### Do you think the dataset is imbalanced? Explain Why.

Yes the data is highly imbalanced as ,CHD = 0 is way more frequent than the other- CHD = 1.

Ideally, a balanced dataset would have close to a 50-50 (or even 60-40) split.

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

SMOTE = Synthetic Minority Oversampling Technique :-

SMOTE generates synthetic samples of the minority class (TenYearCHD = 1) by interpolating between nearest neighbors, balancing the dataset by equalizing both classes.

Also, it improves model’s ability to detect true positive cases, which is critical in predicting CHD risk.

## ***6. ML Model Implementation***

### ML Model - 1

In [230]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

# initializing model
logistic_regression = LogisticRegression(solver='liblinear')

# fitting data
logistic_regression.fit(x_train, y_train)

# predicting on test data
y_train_pred_lr = logistic_regression.predict(x_train)
y_test_pred_lr = logistic_regression.predict(x_test)

# Training and testing accuracy
train_accuracy = accuracy_score(y_train, y_train_pred_lr)
test_accuracy = accuracy_score(y_test, y_test_pred_lr)
# printing training and testing accuracy
print(f"Training Accuracy: {train_accuracy}")
print("Testing Accuracy:", test_accuracy,'\n')

# Get scores - precision, recall, f1 score, roc_auc_score, confusion matrix
# precision,
lr_tp = precision_score(y_test,y_test_pred_lr)
# recall,
lr_tr = recall_score(y_test,y_test_pred_lr)
# f1 score
lr_fs= f1_score(y_test,y_test_pred_lr)
# roc_auc_score
lr_ras = roc_auc_score(y_test,y_test_pred_lr)
# confusion matrix
lr_cm = confusion_matrix(y_test,y_test_pred_lr)
# Printing all these matrices
print('Precision score of logistic model:',lr_tp)
print('Recall score of logistic model:', lr_tr)
print('F1 score of logistic model: ', lr_fs)
print('ROC AUC score of logistic model: ',lr_ras)
print('Confusion matrix of logistic model : \n',lr_cm)
labels = ['No CHD', 'CHD']

Training Accuracy: 0.6804168475900999
Testing Accuracy: 0.6545138888888888 

Precision score of logistic model: 0.6254125412541254
Recall score of logistic model: 0.6890909090909091
F1 score of logistic model:  0.6557093425605537
ROC AUC score of logistic model:  0.6560072485653881
Confusion matrix of logistic model : 
 [[375 227]
 [171 379]]


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [231]:
# Visualizing evaluation Metric Score chart

fig = go.Figure(data=go.Heatmap ( z=lr_cm, x=labels, y=labels,
    hoverongaps=False, colorscale='Mint', showscale=True ) )

# Adding annotations
for i in range(len(lr_cm)):
    for j in range(len(lr_cm[0])):
        fig.add_annotation(
            x=labels[j],
            y=labels[i],
            text=str(lr_cm[i][j]),
            showarrow=False,
            font=dict(color="white" if lr_cm[i][j] > lr_cm.max() / 2 else "black")
        )

fig.update_layout( title="<b>Confusion Matrix - Logistic Regression", height=500,width=800,
    xaxis_title="Predicted Label", yaxis_title="True Label",
    template='plotly_white',title_x=0.5 )
fig.show()

Logistic Regression :

Logistic Regression is a linear classification model used to predict the probability of a binary outcome — in this case, whether a person is likely to develop Coronary Heart Disease (CHD) in the next 10 years (TenYearCHD: 0 or 1). It works well as a baseline model and is great for imbalanced classification when combined with techniques like SMOTE and class weights.

* Precision - Out of all predicted positives, 62.5% were actually correct. So, there are a few false alarms
* Recall - The model caught ~69% of actual positives.
* F1 score - A balance between precision and recall
* Roc AUC  - Area under the ROC curve — a value over 0.6 shows the model has some discriminative power.

#### 2. Cross- Validation & Hyperparameter Tuning

In [232]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

# intitializing param_grid
param_grid = {'penalty':['l1','l2'], 'C' : [0.0001,0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 10] }   #set the parmeter

# initializing grid model and fitting data
logistic_grid_model = GridSearchCV(logistic_regression, param_grid, scoring = 'recall',n_jobs = -1, verbose = 3, cv = 3)
logistic_grid_model.fit(x_train,y_train)

# predicting on test data
y_train_pred_lrg = logistic_grid_model.predict(x_train)
y_test_pred_lrg = logistic_grid_model.predict(x_test)

# Training and testing accuracy
train_accuracy = accuracy_score(y_train, y_train_pred_lrg)
test_accuracy = accuracy_score(y_test, y_test_pred_lrg)
# printing training and testing accuracy
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy,'\n')

# Get scores - precision, recall, f1 score, roc_auc_score, confusion matrix
# precision,
lrg_tp = precision_score(y_test,y_test_pred_lrg)
# recall,
lrg_tr = recall_score(y_test,y_test_pred_lrg)
# f1 score
lrg_fs = f1_score(y_test,y_test_pred_lrg)
# roc_auc_score
lrg_ras = roc_auc_score(y_test,y_test_pred_lrg)
# confusion matrix
lrg_cm = confusion_matrix(y_test,y_test_pred_lrg)
# Printing all these matrices
print('Precision score of logistic grid model:',lrg_tp)
print('Recall score of logistic grid model:', lrg_tr)
print('F1 score of logistic grid model: ', lrg_fs)
print('ROC AUC score of logistic grid model: ',lrg_ras)
print('Confusion matrix of logistic grid model : \n',lrg_cm)
labels = ['No CHD', 'CHD']

Fitting 3 folds for each of 28 candidates, totalling 84 fits
Training Accuracy: 0.6732522796352584
Testing Accuracy: 0.6484375 

Precision score of logistic grid model: 0.6120556414219475
Recall score of logistic grid model: 0.72
F1 score of logistic grid model:  0.6616541353383458
ROC AUC score of logistic grid model:  0.6515282392026578
Confusion matrix of logistic grid model : 
 [[351 251]
 [154 396]]


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV :-

GridSearchCV exhaustively searches over a manually specified grid of hyperparameters.
Ensures all combinations are evaluated via cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [233]:
# Visualizing updtaed evaluation Metric Score chart

fig = go.Figure(data=go.Heatmap ( z=lrg_cm, x=labels, y=labels,
    hoverongaps=False, colorscale='Mint', showscale=True ) )

# Adding annotations
for i in range(len(lrg_cm)):
    for j in range(len(lrg_cm[0])):
        fig.add_annotation(
            x=labels[j],
            y=labels[i],
            text=str(lrg_cm[i][j]),
            showarrow=False,
            font=dict(color="white" if lrg_cm[i][j] > lrg_cm.max() / 2 else "black")
        )

fig.update_layout( title="<b>Confusion Matrix - Logistic Regression (Tuned)", height=500,width=800,
    xaxis_title="Predicted Label", yaxis_title="True Label",
    template='plotly_white',title_x=0.5 )
fig.show()

* Precision - Slight dip
* Recall - Slight improvement
* F1 score - Slight improvement
* Roc AUC  - Slight drop

After tuning, the logistic regression model showed a clear improvement. Grid search tuning improved model’s recall and F1 score, which is crucial for a medical prediction problem. Even though some metrics dipped a bit, this tradeoff is totally valid when your goal is to minimize false negatives.

### ML Model - 2

In [234]:
# ML Model - 2 Implementation

# Fit the Algorithm

# Predict on the model

# initializing model
rf_classifier = RandomForestClassifier(n_estimators = 100,max_depth=10, min_samples_split=10, min_samples_leaf=5, max_features='sqrt', random_state=42,class_weight='balanced',)

# fitting data
rf_classifier.fit(x_train, y_train)

# predicting on test data
y_train_pred_rf = rf_classifier.predict(x_train)
y_test_pred_rf = rf_classifier.predict(x_test)

# Training and testing accuracy
train_accuracy = accuracy_score(y_train, y_train_pred_rf)
test_accuracy = accuracy_score(y_test, y_test_pred_rf)
# printing training and testing accuracy
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy,'\n')

# Get scores - precision, recall, f1 score, roc_auc_score, confusion matrix
# precision,
rf_tp = precision_score(y_test,y_test_pred_rf)
# recall,
rf_tr = recall_score(y_test,y_test_pred_rf)
# f1 score
rf_fs= f1_score(y_test,y_test_pred_rf)
# roc_auc_score
rf_ras = roc_auc_score(y_test,y_test_pred_rf)
# confusion matrix
rf_cm = confusion_matrix(y_test,y_test_pred_rf)
# Printing all these matrices
print('Precision score of RandomForest model:',rf_tp)
print('Recall score of RandomForest model:', rf_tr)
print('F1 score of RandomForest model: ', rf_fs)
print('ROC AUC score of RandomForest model: ',rf_ras)
print('Confusion matrix of RandomForest model : \n',rf_cm)
labels = ['No CHD', 'CHD']

Training Accuracy: 0.9103343465045592
Testing Accuracy: 0.8055555555555556 

Precision score of RandomForest model: 0.7734899328859061
Recall score of RandomForest model: 0.8381818181818181
F1 score of RandomForest model:  0.8045375218150087
ROC AUC score of RandomForest model:  0.806964663243733
Confusion matrix of RandomForest model : 
 [[467 135]
 [ 89 461]]


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [235]:
# Visualizing evaluation Metric Score chart

fig = go.Figure(data=go.Heatmap ( z=rf_cm, x=labels, y=labels,
    hoverongaps=False, colorscale='Mint', showscale=True ) )

# Adding annotations
for i in range(len(rf_cm)):
    for j in range(len(rf_cm[0])):
        fig.add_annotation(
            x=labels[j],
            y=labels[i],
            text=str(rf_cm[i][j]),
            showarrow=False,
            font=dict(color="white" if rf_cm[i][j] > rf_cm.max() / 2 else "black")
        )

fig.update_layout( title="<b>Confusion Matrix - RandomForest", height=500,width=800,
    xaxis_title="Predicted Label", yaxis_title="True Label",
    template='plotly_white',title_x=0.5 )
fig.show()

Random Forest Classifier :

Random Forest is a powerful ensemble learning method that builds a “forest” of decision trees and combines their predictions for better generalization. It handles non-linear data really well and is resistant to overfitting (usually).

* Precision - ~77% of predicted positives were correct
* Recall - Excellent recall — caught 83.8% of all true CHD cases
* F1 score - Sweet balance between precision & recall
* Roc AUC  - Model is good at distinguishing classes

#### 2. Cross- Validation & Hyperparameter Tuning

In [236]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

# intitializing param_grid
param_grid = {'n_estimators' : [100,150],'max_depth' : [10,20]}

# initializing grid model and fitting data
rf_classifier_grid = GridSearchCV(rf_classifier, param_grid, scoring = 'recall',n_jobs = -1, verbose = 3, cv = 3)
rf_classifier_grid.fit(x_train,y_train)

# predicting on test data
y_train_pred_rfg = rf_classifier_grid.predict(x_train)
y_test_pred_rfg = rf_classifier_grid.predict(x_test)

# Training and testing accuracy
train_accuracy = accuracy_score(y_train, y_train_pred_rfg)
test_accuracy = accuracy_score(y_test, y_test_pred_rfg)
# printing training and testing accuracy
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy,'\n')

# Get scores - precision, recall, f1 score, roc_auc_score, confusion matrix
# precision,
rfg_tp = precision_score(y_test,y_test_pred_rfg)
# recall,
rfg_tr = recall_score(y_test,y_test_pred_rfg)
# f1 score
rfg_fs= f1_score(y_test,y_test_pred_rfg)
# roc_auc_score
rfg_ras = roc_auc_score(y_test,y_test_pred_rfg)
# confusion matrix
rfg_cm = confusion_matrix(y_test,y_test_pred_rfg)
# Printing all these matrices
print('Precision score of RandomForest grid model:',rfg_tp)
print('Recall score of RandomForest grid model:', rfg_tr)
print('F1 score of RandomForest grid model: ', rfg_fs)
print('ROC AUC score of RandomForest grid model: ',rfg_ras)
print('Confusion matrix of RandomForest grid model : \n',rfg_cm)
labels = ['No CHD', 'CHD']

Fitting 3 folds for each of 4 candidates, totalling 12 fits
Training Accuracy: 0.9622231871471993
Testing Accuracy: 0.8420138888888888 

Precision score of RandomForest grid model: 0.8129251700680272
Recall score of RandomForest grid model: 0.8690909090909091
F1 score of RandomForest grid model:  0.8400702987697716
ROC AUC score of RandomForest grid model:  0.8431833282996074
Confusion matrix of RandomForest grid model : 
 [[492 110]
 [ 72 478]]


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV :-

GridSearchCV exhaustively searches over a manually specified grid of hyperparameters. Ensures all combinations are evaluated via cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [237]:
# Visualizing updated evaluation Metric Score chart

fig = go.Figure(data=go.Heatmap ( z=rfg_cm, x=labels, y=labels,
    hoverongaps=False, colorscale='Mint', showscale=True ) )

# Adding annotations
for i in range(len(rfg_cm)):
    for j in range(len(rfg_cm[0])):
        fig.add_annotation(
            x=labels[j],
            y=labels[i],
            text=str(rfg_cm[i][j]),
            showarrow=False,
            font=dict(color="white" if rfg_cm[i][j] > rfg_cm.max() / 2 else "black")
        )

fig.update_layout( title="<b>Confusion Matrix - RandomForest (Tuned) ", height=500,width=800,
    xaxis_title="Predicted Label", yaxis_title="True Label",
    template='plotly_white',title_x=0.5 )
fig.show()

* Precision - Slight improvement
* Recall -  Slight improvement
* F1 score - Slight improvement
* Roc AUC - Slight improvement

After hyperparameter tuning, the Random Forest model saw a performance boost. More accurate and precise. More sensitive to catching positives (recall) and balanced overall (F1 score). Better at class separation (ROC AUC).

#### Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

* Recall - Most important for healthcare safety, catching true CHD cases
* Precision	- Helps reduce cost & panic from false alarms
* F1 Score	- Useful when balancing risk vs. resources
* ROC AUC	- For threshold-based alert systems

### ML Model - 3

In [238]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

# Initialize the model
xgb = XGBClassifier(max_depth=3, learning_rate=0.1, verbosity=0, eval_metric='auc', use_label_encoder=False, random_state=42)

# Fit the model
xgb.fit(x_train, y_train)

# predicting on test data
y_train_pred_xgb = xgb.predict(x_train)
y_test_pred_xgb = xgb.predict(x_test)

# Training and testing accuracy
train_accuracy = accuracy_score(y_train, y_train_pred_xgb)
test_accuracy = accuracy_score(y_test, y_test_pred_xgb)
# printing training and testing accuracy
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy,'\n')

# Get scores - precision, recall, f1 score, roc_auc_score, confusion matrix
# precision,
xgb_tp = precision_score(y_test,y_test_pred_xgb)
# recall,
xgb_tr = recall_score(y_test,y_test_pred_xgb)
# f1 score
xgb_fs = f1_score(y_test,y_test_pred_xgb)
# roc_auc_score
xgb_ras = roc_auc_score(y_test,y_test_pred_xgb)
# confusion matrix
xgb_cm = confusion_matrix(y_test,y_test_pred_xgb)
# Printing all these matrices
print('Precision score of XGBoost model:',xgb_tp)
print('Recall score of XGBoost model:', xgb_tr)
print('F1 score of XGBoost model: ', xgb_fs)
print('ROC AUC score of XGBoost model: ',xgb_ras)
print('Confusion matrix of XGBoost model : \n',xgb_cm)
labels = ['No CHD', 'CHD']

Training Accuracy: 0.8415110725141121
Testing Accuracy: 0.8081597222222222 

Precision score of XGBoost model: 0.7911504424778761
Recall score of XGBoost model: 0.8127272727272727
F1 score of XGBoost model:  0.8017937219730942
ROC AUC score of XGBoost model:  0.8083569918453639
Confusion matrix of XGBoost model : 
 [[484 118]
 [103 447]]


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [239]:
# Visualizing evaluation Metric Score chart

fig = go.Figure(data=go.Heatmap ( z=xgb_cm, x=labels, y=labels,
    hoverongaps=False, colorscale='Mint', showscale=True ) )

# Adding annotations
for i in range(len(xgb_cm)):
    for j in range(len(xgb_cm[0])):
        fig.add_annotation(
            x=labels[j],
            y=labels[i],
            text=str(xgb_cm[i][j]),
            showarrow=False,
            font=dict(color="white" if xgb_cm[i][j] > xgb_cm.max() / 2 else "black")
        )

fig.update_layout( title="<b>Confusion Matrix - XGBoost", height=500,width=800,
    xaxis_title="Predicted Label", yaxis_title="True Label",
    template='plotly_white',title_x=0.5 )
fig.show()

XGBoost :

XGBoost (Extreme Gradient Boosting) is a powerful ensemble algorithm based on decision trees, optimized for both speed and performance. It uses gradient boosting, meaning each new tree corrects errors made by the previous ones.

* Precision - 	79.1% of predicted CHD cases were actually positive
* Recall -  81.3% of actual CHD cases were caught
* F1 score - Excellent balance between precision and recall.
* Roc AUC  - Good class separation power

#### 2. Cross- Validation & Hyperparameter Tuning

In [240]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

# intitializing param_grid
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'n_estimators': [50, 100, 150],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'scale_pos_weight': [1, (len(y_train) - sum(y_train)) / sum(y_train)]  # balance for imbalanced class
}

# initializing grid model and fitting data
xgb_grid_model = GridSearchCV(xgb, param_grid, scoring = 'recall',n_jobs = -1, verbose = 2, cv = 3)
xgb_grid_model.fit(x_train,y_train)

# predicting on test data
y_train_pred_xgbg = xgb_grid_model.predict(x_train)
y_test_pred_xgbg = xgb_grid_model.predict(x_test)

# Training and testing accuracy
train_accuracy = accuracy_score(y_train, y_train_pred_xgbg)
test_accuracy = accuracy_score(y_test, y_test_pred_xgbg)
# printing training and testing accuracy
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy,'\n')

# Get scores - precision, recall, f1 score, roc_auc_score, confusion matrix
# precision,
xgbg_tp = precision_score(y_test,y_test_pred_xgbg)
# recall,
xgbg_tr = recall_score(y_test,y_test_pred_xgbg)
# f1 score
xgbg_fs = f1_score(y_test,y_test_pred_xgbg)
# roc_auc_score
xgbg_ras = roc_auc_score(y_test,y_test_pred_xgbg)
# confusion matrix
xgbg_cm = confusion_matrix(y_test,y_test_pred_xgbg)
# Printing all these matrices
print('Precision score of XGBoost grid model:',xgbg_tp)
print('Recall score of XGBoost grid model:', xgbg_tr)
print('F1 score of XGBoost grid model: ', xgbg_fs)
print('ROC AUC score of XGBoost grid model: ',xgbg_ras)
print('Confusion matrix of XGBoost grid model : \n',xgbg_cm)
labels = ['No CHD', 'CHD']

Fitting 3 folds for each of 216 candidates, totalling 648 fits
Training Accuracy: 1.0
Testing Accuracy: 0.8854166666666666 

Precision score of XGBoost grid model: 0.8758992805755396
Recall score of XGBoost grid model: 0.8854545454545455
F1 score of XGBoost grid model:  0.8806509945750453
ROC AUC score of XGBoost grid model:  0.885418302627605
Confusion matrix of XGBoost grid model : 
 [[533  69]
 [ 63 487]]


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV :-

GridSearchCV exhaustively searches over a manually specified grid of hyperparameters. Ensures all combinations are evaluated via cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [241]:
# Visualizing updated evaluation Metric Score chart

fig = go.Figure(data=go.Heatmap ( z=xgbg_cm, x=labels, y=labels,
    hoverongaps=False, colorscale='Mint', showscale=True ) )

# Adding annotations
for i in range(len(xgbg_cm)):
    for j in range(len(xgbg_cm[0])):
        fig.add_annotation(
            x=labels[j],
            y=labels[i],
            text=str(xgbg_cm[i][j]),
            showarrow=False,
            font=dict(color="white" if xgbg_cm[i][j] > xgbg_cm.max() / 2 else "black")
        )

fig.update_layout( title="<b>Confusion Matrix - XGBoost (tuned)", height=500,width=800,
    xaxis_title="Predicted Label", yaxis_title="True Label",
    template='plotly_white',title_x=0.5 )
fig.show()

* Precision - More accurate predictions
* Recall - Better at catching positives
* F1 score - Better balance of precision & recall.
* Roc AUC  - Better class separation

After tuning, your XGBoost model improved definitely. This is now one of your strongest models, achieving the best recall, precision, F1, and AUC. If you prioritize recall and balanced performance, this model is perfect.

### ML Model - 4

In [242]:
# ML Model - 4 Implementation

# Fit the Algorithm

# Predict on the model

# Initialize the model
dt = DecisionTreeClassifier( criterion='gini', max_depth=5, min_samples_split=10, min_samples_leaf=5, class_weight='balanced', random_state=42 )

# Fit the model
dt.fit(x_train, y_train)

# predicting on test data
y_train_pred_dt = dt.predict(x_train)
y_test_pred_dt = dt.predict(x_test)

# Training and testing accuracy
train_accuracy = accuracy_score(y_train, y_train_pred_dt)
test_accuracy = accuracy_score(y_test, y_test_pred_dt)
# printing training and testing accuracy
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy,'\n')

# Get scores - precision, recall, f1 score, roc_auc_score, confusion matrix
# precision,
dt_tp = precision_score(y_test,y_test_pred_dt)
# recall,
dt_tr = recall_score(y_test,y_test_pred_dt)
# f1 score
dt_fs = f1_score(y_test,y_test_pred_dt)
# roc_auc_score
dt_ras = roc_auc_score(y_test,y_test_pred_dt)
# confusion matrix
dt_cm = confusion_matrix(y_test,y_test_pred_dt)
# Printing all these matrices
print('Precision score of Decision Tree model:',dt_tp)
print('Recall score of Decision Tree model:',dt_tr)
print('F1 score of Decision Tree model: ', dt_fs)
print('ROC AUC score of Decision Tree model: ',dt_ras)
print('Confusion matrix of Decision Tree model : \n',dt_cm)
labels = ['No CHD', 'CHD']

Training Accuracy: 0.7073382544507164
Testing Accuracy: 0.6848958333333334 

Precision score of Decision Tree model: 0.6896551724137931
Recall score of Decision Tree model: 0.6181818181818182
F1 score of Decision Tree model:  0.6519654841802492
ROC AUC score of Decision Tree model:  0.6820144971307762
Confusion matrix of Decision Tree model : 
 [[449 153]
 [210 340]]


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [243]:
# Visualizing evaluation Metric Score chart

fig = go.Figure(data=go.Heatmap ( z=dt_cm, x=labels, y=labels,
    hoverongaps=False, colorscale='Mint', showscale=True ) )

# Adding annotations
for i in range(len(dt_cm)):
    for j in range(len(dt_cm[0])):
        fig.add_annotation(
            x=labels[j],
            y=labels[i],
            text=str(dt_cm[i][j]),
            showarrow=False,
            font=dict(color="white" if dt_cm[i][j] > dt_cm.max() / 2 else "black")
        )

fig.update_layout( title="<b>Confusion Matrix - Decision Tree", height=500,width=800,
    xaxis_title="Predicted Label", yaxis_title="True Label",
    template='plotly_white',title_x=0.5 )
fig.show()

Decision Tree Classifier :

Decision Trees are flowchart-like structures where internal nodes represent decisions based on features, branches represent outcomes, and leaves represent class predictions. It’s interpretable, quick to train, and handles non-linear patterns well — perfect when you're just getting started with predictive models.

* Precision - ~69% of predicted CHD cases were actually correct
* Recall - Missed quite a few actual CHD cases
* F1 score - Fair balance of precision and recall.
* Roc AUC  - Some class separation ability — but could be better

#### 2. Cross- Validation & Hyperparameter Tuning

In [244]:
# ML Model - 4 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

# intitializing param_grid
param_grid = { 'max_depth': [3, 5, 7, 9], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5], 'criterion': ['gini', 'entropy']}

# initializing grid model and fitting data
dt_grid_model = GridSearchCV(dt, param_grid, scoring = 'recall',n_jobs = -1, verbose = 2, cv = 3)
dt_grid_model.fit(x_train,y_train)

# predicting on test data
y_train_pred_dtg = dt_grid_model.predict(x_train)
y_test_pred_dtg = dt_grid_model.predict(x_test)

# Training and testing accuracy
train_accuracy = accuracy_score(y_train, y_train_pred_dtg)
test_accuracy = accuracy_score(y_test, y_test_pred_dtg)
# printing training and testing accuracy
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy,'\n')

# Get scores - precision, recall, f1 score, roc_auc_score, confusion matrix
# precision,
dtg_tp = precision_score(y_test,y_test_pred_dtg)
# recall,
dtg_tr = recall_score(y_test,y_test_pred_dtg)
# f1 score
dtg_fs = f1_score(y_test,y_test_pred_dtg)
# roc_auc_score
dtg_ras = roc_auc_score(y_test,y_test_pred_dtg)
# confusion matrix
dtg_cm = confusion_matrix(y_test,y_test_pred_dtg)
# Printing all these matrices
print('Precision score of Decision Tree grid model:',dtg_tp)
print('Recall score of Decision Tree grid model:', dtg_tr)
print('F1 score of Decision Tree grid model: ', dtg_fs)
print('ROC AUC score of Decision Tree grid model: ',dtg_ras)
print('Confusion matrix of Decision Tree grid model : \n',dtg_cm)
labels = ['No CHD', 'CHD']

Fitting 3 folds for each of 72 candidates, totalling 216 fits
Training Accuracy: 0.8017802865827182
Testing Accuracy: 0.7361111111111112 

Precision score of Decision Tree grid model: 0.7383720930232558
Recall score of Decision Tree grid model: 0.6927272727272727
F1 score of Decision Tree grid model:  0.7148217636022514
ROC AUC score of Decision Tree grid model:  0.7342373905164603
Confusion matrix of Decision Tree grid model : 
 [[467 135]
 [169 381]]


##### Which hyperparameter optimization technique have you used and why?

GridSearchCV :-

GridSearchCV exhaustively searches over a manually specified grid of hyperparameters. Ensures all combinations are evaluated via cross-validation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [245]:
# Visualizing updated evaluation Metric Score chart

fig = go.Figure(data=go.Heatmap ( z=dtg_cm, x=labels, y=labels,
    hoverongaps=False, colorscale='Mint', showscale=True ) )

# Adding annotations
for i in range(len(dtg_cm)):
    for j in range(len(dtg_cm[0])):
        fig.add_annotation(
            x=labels[j],
            y=labels[i],
            text=str(dtg_cm[i][j]),
            showarrow=False,
            font=dict(color="white" if dtg_cm[i][j] > dtg_cm.max() / 2 else "black")
        )

fig.update_layout( title="<b>Confusion Matrix - XGBoost (tuned)", height=500,width=800,
    xaxis_title="Predicted Label", yaxis_title="True Label",
    template='plotly_white',title_x=0.5 )
fig.show()

* Precision - Better at reducing false positives
* Recall -  Catches more real CHD cases — great
* F1 score - Much better balance
* Roc AUC - Better class separation

After tuning, Decision Tree model became more accurate, more sensitive, and overall more dependable for CHD prediction. Although still not quite as strong as your tuned Random Forest or XGBoost.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

* Precision -

Measures the accuracy of positive predictions.It is important as too many false positives (low precision) can: cause unnecessary stress and testing, waste medical resources and increase costs

* Recall -

Measures the ability of the model to correctly identify all actual positive cases .It is important as in a healthcare setting, missing a true CHD patient (false negative) could lead to no diagnosis, delayed treatment, and severe consequences.

* F1 score -

The harmonic mean of precision and recall. It is important, since the dataset is imbalanced (i.e., fewer CHD cases), F1 score gives a balanced view — how well the model performs on the minority class (CHD) without favoring only precision or recall.

* ROC AUC score -

Measures the model’s ability to differentiate between classes across all thresholds. It is important,as it’s threshold-independent, making it a reliable global metric. Even if precision/recall fluctuate at a threshold, AUC shows the true separation power.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

**XGBoost (After Tuning)**

 Although training accuracy is 100%, which screams “overfitting”, the high test performance across all metrics proves that regularization and boosting are doing their job. Still, we’d monitor its performance on new/real-world data.

 In a healthcare setting, recall is crucial — we need to catch as many true CHD patients as possible, and XGBoost excels here.

 After evaluating multiple machine learning models, the tuned XGBoost classifier was selected as the final model. It achieved the highest testing accuracy (88.5%), F1 score (0.88), and recall (0.885), while minimizing false negatives — making it the most effective and reliable choice for predicting Coronary Heart Disease (CHD). Its strong performance in both precision and recall ensures accurate and impactful healthcare predictions.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

XGBoost (Extreme Gradient Boosting) is a high-performance, scalable implementation of gradient boosting that builds sequential decision trees — each one learning from the errors of the last. It’s popular because:
* It’s fast thanks to parallelization
* It’s smart with built-in regularization
* It handles missing values & imbalance like a boss
* Can easily overfit if not tuned, but very powerful when controlled

In [247]:
# Use best estimator from GridSearchCV
best_xgb_model = xgb_grid_model.best_estimator_

# Create SHAP explainer
explainer = shap.Explainer(best_xgb_model, x_train_df)

# Calculate SHAP values on test data
shap_values = explainer(x_test_df)

# Convert SHAP values to DataFrame
shap_df = pd.DataFrame(shap_values.values, columns=x_train_df.columns)

# Calculate mean absolute SHAP values
mean_abs_shap = shap_df.abs().mean().sort_values(ascending=True).reset_index()
mean_abs_shap.columns = ['Feature', 'Mean |SHAP Value|']

# Plot using Plotly
fig = px.bar(
    mean_abs_shap,
    x='Mean |SHAP Value|',
    y='Feature',
    orientation='h',
    title='XGBoost Feature Importance (SHAP)',
    color='Mean |SHAP Value|',
    color_continuous_scale='Mint'
)

fig.update_layout(title_x=0.5, yaxis=dict(categoryorder='total ascending'))
fig.show()




**SHAP** was used to explain the Decision Tree model’s decisions. Key features influencing CHD predictions were age, cigsperday and systolic blood pressure. SHAP helped visualize both the direction and magnitude of these features on each individual prediction, enhancing model interpretability in a healthcare context.

# **Conclusion**

The aim of this project was to develop a machine learning model capable of predicting the risk of Coronary Heart Disease (CHD) based on individual health and lifestyle factors.

On this dataset, extensive preprocessing was performed, including handling missing values, scaling features, and balancing the dataset with SMOTE to address class imbalance. Several classification models were implemented and compared, including Logistic Regression, Decision Tree, Random Forest, and XGBoost, with both baseline and hyperparameter-tuned versions evaluated.

After a thorough analysis, the tuned XGBoost model emerged as the best-performing model, achieving a testing accuracy of 88.5%, precision of 87.6%, recall of 88.5%, F1 score of 88.0%, and ROC AUC of 88.5%. Its ability to minimize false negatives made it particularly suitable for the healthcare domain, where missing a CHD case could be critical.

To interpret the model’s predictions and ensure transparency, SHAP (SHapley Additive Explanations) was used to analyze feature importance. The SHAP summary revealed that age, education, cigarettes per day, heartrate, systolic blood pressure, sex, glucose levels, diastolic blood pressure, BMI, and total cholesterol were among the most influential features contributing to CHD risk.

These insights not only reinforced the model’s reliability but also aligned well with established medical understanding. The project successfully demonstrated how machine learning, when combined with interpretability tools, can offer powerful and trustworthy solutions in the healthcare space. Overall, the tuned XGBoost model proved to be both accurate and explainable, making it an excellent candidate for real-world applications in early CHD detection and preventive care.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***