<a href="https://colab.research.google.com/github/AkshayAI007/Cardiovascular-disease-risk-prediction-using-Machine-learning/blob/main/Cardiovascular_Risk_Prediction_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Cardiovascular Risk Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** Akshay Bawaliwale


# **Project Summary -**

Cardiovascular disease is a leading cause of death worldwide, and early prediction of cardiovascular risk can help in timely intervention and prevention of the disease. Machine learning techniques have shown promising results in predicting cardiovascular risk by analyzing various risk factors.

The goal of this project is to develop a machine learning model to predict the 10-year risk of cardiovascular disease in individuals using a dataset of demographic, clinical, and laboratory data.

The dataset used in this project is the Framingham Heart Study dataset, which is a widely used dataset for cardiovascular risk prediction. It contains data on 3,390 participants, who were followed up for ten years to track cardiovascular events. The dataset includes 17 variables such as age, sex, blood pressure, cholesterol levels, smoking status, and diabetes status.

The first step in this project is to perform data preprocessing, which includes handling missing values, encoding categorical variables, and scaling numerical variables. After preprocessing, the dataset is split into training and testing sets using a 70:20 ratio.

Various machine learning algorithms are applied to the training data, including logistic regression, KNN, XGBoost, SVC, and random forest, Naive Bayes Classifier. These algorithms are chosen as they have been shown to perform well in cardiovascular risk prediction. The algorithms are trained on the training data, and their performance is evaluated using the testing data.

The evaluation metrics used in this project include accuracy, precision, recall, and area under the receiver operating characteristic curve (AUC-ROC). These metrics help in assessing the performance of the machine learning model.

The results show that the XGBoost performs the best, with an accuracy of 0.89, precision of 0.92, recall of 0.85, and AUC-ROC of 0.89. This indicates that the model has a good overall performance in predicting cardiovascular risk.

Further analysis is performed to identify the most important features in the dataset. The feature importance plot shows that age, education, prevalentHyp,and cigarettes per day are the top important features in predicting cardiovascular risk. This information can help in identifying high-risk individuals and implementing preventive measures.

In conclusion, this project demonstrates the effectiveness of machine learning techniques in predicting cardiovascular risk using the Framingham Heart Study dataset. The developed machine learning model can be used by healthcare professionals to identify individuals at high risk of cardiovascular disease and take preventive measures to reduce the risk.

# **GitHub Link -**

https://github.com/AkshayAI007/Cardiovascular-disease-risk-prediction-using-Machine-learning.git

# **Problem Statement**


**Cardiovascular disease is a major cause of morbidity and mortality worldwide. Early identification and management of individuals at high risk of developing cardiovascular disease is crucial for the prevention of the disease. Traditional risk prediction models, such as the Framingham Risk Score, have limitations in their accuracy and do not account for the complex interactions between various risk factors. Machine learning techniques have shown promising results in improving the accuracy of cardiovascular risk prediction by integrating various risk factors and identifying non-linear interactions. However, there is a need for developing and validating machine learning models that can accurately predict cardiovascular risk using demographic, clinical, and laboratory data. The goal of this project is to address this need by developing and evaluating a machine learning model for predicting the 10-year risk of cardiovascular disease using the Framingham Heart Study dataset.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#compatible versions of modules
!sudo apt-get install python3.9
!pip install scikit-learn==1.1.2

In [None]:
# Import Libraries
# Import Libraries

## Data Maipulation Libraries
import numpy as np
import pandas as pd

## Data Visualisation Libraray
import matplotlib.pyplot as plt
%matplotlib inline
import pylab
import seaborn as sns

## Machine Learning
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.svm import SVC

## Importing essential libraries to check the accuracy
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_precision_recall_curve, plot_roc_curve

## Warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
# import drive
from google.colab import drive
drive.mount('/content/drive')

# Load Dataset
path='/content/drive/MyDrive/Projects/Cardiovascular_disease_risk_prediction/data_cardiovascular_risk.csv'
df = pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
#Last 5 entries
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Dataset Size")
print("Rows = {} and  Columns = {}".format(df.shape[0], df.shape[1]))

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar = False)

### What did you know about your dataset?

Datasets for predicting cardiovascular risk typically encompass a variety of risk factors that can influence an individual's likelihood of developing cardiovascular disease. These factors encompass aspects such as age, gender, blood pressure, cholesterol levels, smoking habits, and a history of cardiovascular disease. Additionally, these datasets may encompass variables like body mass index and diabetes. It's worth noting that these datasets often exhibit some missing values, with glucose and education variables being particularly notable in this regard.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**Demographic:**

1) Age: Age of the patient.

2) Sex: male or female("M" or "F")

**Behavioral:**

3) is_smoking: whether or not the patient is a current smoker ("YES" or "NO").

4) CigsPerDay: the number of cigarettes that the person smoked on average in one day.(countinous type feature because a person can smoke 'n' times a day)

**Medical(history):**

5) BPMeds: whether or not the patient was on blood pressure medication.

6) Prevalent Stroke: whether or not the patient had previously had a stroke.

7) Prevalent Hyp: whether or not the patient was hypertensive.

8) Diabetes: whether or not the patient had diabetes.

**Medical(current):**

9) Tot Chol: total cholesterol level.

10) Sys BP: systolic blood pressure.

11) Dia BP: diastolic blood pressure.

12) BMI: Body Mass Index.

13) Heart Rate: heart rate.

14) Glucose: glucose level.

**Target feature(class of risk):**

15) TenYearCHD: 10-year risk of coronary heart disease CHD (“1”, means “Yes”, “0” means “No”)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ", i , "is" , df[i].nunique(), ".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Separating the categorical and continous variable and storing them
categorical_variable=[]
continous_variable=[]

for i in df.columns:
  if i == 'id':
    pass
  elif df[i].nunique() <5:
    categorical_variable.append(i)
  elif df[i].nunique() >= 5:
    continous_variable.append(i)

print(categorical_variable)
print(continous_variable)

In [None]:
# Summing null values
print('Missing Data Count')
df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False)

In [None]:
print('Missing Data Percentage')
print(round(df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False)/len(df)*100,2))

In [None]:
# storing the column that contains null values
null_column_list= ['glucose','education','BPMeds','totChol','cigsPerDay','BMI','heartRate']
# plotting box plot
plt.figure(figsize=(15,8))
df[null_column_list].boxplot()

In [None]:
# Define a list of colors
colors = sns.color_palette("husl", len(null_column_list))

# Create a figure with 8 subplots (2 rows, 4 columns)
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))

# Flatten the axes array to make it easier to iterate over
axes = axes.flatten()

# Iterate over the null column list and plot each column's distribution
for i, column in enumerate(null_column_list):
    # Select the current axis
    ax = axes[i]
    # Plot a distplot of the current column with a different color
    sns.distplot(df[column], ax=ax, color=colors[i])
    # Add a title to the plot
    ax.set_title(column)

# Remove any unused subplots
for j in range(len(null_column_list), len(axes)):
    axes[j].remove()

# Display the plots
plt.show()

It is a well-known fact that the appropriate measure of central tendency depends on the nature of the data. Typically, the mean is used for data that follows a normal distribution and does not contain any outliers. On the other hand, when dealing with numerical, continuous data that contains extreme values or outliers, the median is the preferred measure of central tendency. For categorical data, the mode is used.

Based on the outliers and distribution of the data, we have determined that the following measures of central tendency are appropriate for imputing the null values in the following columns:

**"education" , "BPMeds"** -> mode: As "education" and "BPMeds" is a categorical variable, the mode is the most appropriate measure of central tendency. The mode represents the most frequently occurring value in the distribution and can provide insight into the most common level of education in the dataset.

**"glucose","totChol", "cigsPerDay", "BMI", "heartRate"** -> median: Since this are numerical, continuous variable that contain extreme values or outliers, we have chosen the median as the appropriate measure of central tendency. The median is less sensitive to extreme values than the mean and provides a representative value for the central tendency of the distribution.

In [None]:
# Imputing missing values with median or mode
df.fillna({'glucose': df['glucose'].median(),
           'education': df['education'].mode()[0],
           'BPMeds': df['BPMeds'].mode()[0],
           'totChol': df['totChol'].median(),
           'cigsPerDay': df['cigsPerDay'].median(),
           'BMI': df['BMI'].median(),
           'heartRate': df['heartRate'].median()}, inplace=True)

### What all manipulations have you done and insights you found?

We addressed the issue of missing data by employing a combined approach of imputation using median and mode values. Specifically, for the glucose and totChol columns, as well as cigsPerDay, BMI, and heartRate, we substituted the missing values with the median of the available non-missing values. Conversely, for the education and BPMeds columns, we filled in missing values with the mode, which represents the most frequently occurring value among the non-missing data points.

These methods of imputation, utilizing median and mode values, are widely recognized and commonly employed to handle missing data. Median imputation is a preferred choice for continuous variables due to its robustness against outliers when compared to mean imputation. On the other hand, mode imputation is commonly used for categorical variables or discrete variables with a limited number of possible values."

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **Chart - 1**
 **Which age group is more susceptible to developing coronary heart disease?**

In [None]:
# Chart - 1 visualization code

# Set the figure size
fig, ax = plt.subplots(figsize=(10, 10))
# Create a boxplot to compare the age distribution of patients by sex and CHD risk level
sns.boxplot(x="sex", y="age", hue="TenYearCHD", data= df, ax=ax)
# Set the title and labels
ax.set_title("Age Distribution of Patients by Sex and CHD Risk Level")
ax.set_xlabel("Sex")
ax.set_ylabel("Age")
# Adding a legend with appropriate labels
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, ["No Risk", "At Risk"], loc="best")
# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

This chart is a boxplot that visualizes the age distribution of patients by sex and CHD (coronary heart disease) risk level. It was likely chosen to gain insights into how age, sex, and CHD risk level may be related in this dataset.

##### 2. What is/are the insight(s) found from the chart?

There is a noticeable difference in the age distribution of patients who are at risk for CHD compared to those who are not at risk. Patients at risk for CHD tend to be older than those who are not at risk, regardless of sex.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The information derived from this chart could prove valuable for businesses operating within the healthcare industry. For instance, companies specializing in the manufacture of medical equipment or medications for coronary heart disease (CHD) might contemplate tailoring their products towards older patients or individuals with a heightened risk of CHD. Nonetheless, it is crucial to emphasize that this chart in isolation may lack the depth required for making informed business decisions. A more comprehensive analysis is necessary to gain a thorough understanding of the interplay between age, gender, CHD risk, and other pertinent variables.
It's important to highlight that there are no indications of adverse growth trends evident in this chart.

#### **Chart - 2**
**Does gender affect the risk of coronary heart disease in the dataset?**####

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(8,8))
sns.countplot(x='sex', hue='TenYearCHD', data= df)
plt.title('Frequency of CHD cases by gender')
plt.legend(['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

This chart is a countplot that visualizes the frequency of CHD (coronary heart disease) cases by gender in the dataset. It was likely chosen to investigate whether gender affects the risk of CHD in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that there are more cases of CHD among men than women in the dataset. However, this difference is not drastic, as the number of cases of CHD is relatively similar between men and women. Additionally, the chart shows that there are more cases of no risk for CHD among women compared to men.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The findings derived from this chart hold potential value for healthcare service and product providers. For instance, businesses involved in the manufacturing of medical devices or medications for coronary heart disease (CHD) might find it advantageous to target both genders. However, a more substantial emphasis on men may be warranted, given their seemingly higher risk for CHD as indicated by this dataset.

#### **Chart - 3**

**Do smokers have a higher risk of developing coronary heart disease?**

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(8,8))
sns.countplot(x='is_smoking', hue='TenYearCHD', data= df)
plt.title('A Comparison of Smokers and Non-Smokers')
plt.legend(['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

This chart is a countplot that visualizes the frequency of CHD (coronary heart disease) cases among smokers and non-smokers. It was likely chosen to gain insights into how smoking may be related to the risk of CHD in this dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates that individuals who engage in smoking appear to exhibit a heightened risk of coronary heart disease (CHD) compared to their non-smoking counterparts within this dataset. Precisely, a greater percentage of smoking individuals are identified as being at risk for CHD when contrasted with those who abstain from smoking. These observations indicate that smoking may play a role in influencing the CHD risk within this dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart does not reveal any indications of adverse growth trends. Its focus is solely on depicting the incidence of coronary heart disease (CHD) cases among both smokers and non-smokers, omitting insights into other potentially pertinent factors like age or various lifestyle variables. Furthermore, it's worth noting that the dataset's representativeness may be limited, which could curtail the applicability of the insights gleaned from this chart to the broader population.

#### **Chart - 4**
**How much smoking affect coronary heart disease?**

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(15,10))
sns.countplot(x= df['cigsPerDay'],hue= df['TenYearCHD'])
plt.title('How much smoking affect CHD?')
plt.legend(['No Risk','At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

This countplot visually represents the correlation between the daily cigarette consumption and the risk of coronary heart disease (CHD) within this dataset. The selection of this chart type is likely aimed at obtaining a better understanding of the potential link between the intensity of smoking and the likelihood of CHD risk

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates that individuals who either smoke a greater number of cigarettes per day or do not smoke at all seem to face a heightened risk of coronary heart disease (CHD) in comparison to those who smoke fewer cigarettes daily. Concretely, a larger percentage of individuals who smoke 20 or more cigarettes per day are identified as being at risk for CHD when contrasted with those who smoke fewer cigarettes per day. These observations imply that the intensity of smoking may have a role in influencing the risk of CHD within this dataset

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses involved in the production of smoking cessation products or medications for coronary heart disease (CHD) may find it advantageous to contemplate a focus on heavy smokers, given their apparent elevated risk for CHD within this dataset.

#### **Chart - 5**
**Do patients taking medication for blood pressure have a higher risk of developing coronary heart disease?**


In [None]:
# Chart - 5 visualization code

# Compute the cross-tabulation of BP medication and CHD risk
ct = pd.crosstab(df['BPMeds'], df['TenYearCHD'], normalize='index')
# Plot a stacked bar chart
ct.plot(kind='bar', stacked=True, figsize=(8, 8))
plt.title('Relationship between BP Medication and CHD Risk')
plt.xlabel('BP Medication')
plt.xticks(rotation=0)
plt.ylabel('Proportion')
plt.legend(['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

This stacked bar chart visually represents the correlation between patients' use of blood pressure medication and their susceptibility to coronary heart disease (CHD). It is probable that this chart was selected to explore the potential association between the usage of blood pressure medication and CHD risk within this dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that individuals who are prescribed blood pressure medication appear to exhibit an elevated risk of coronary heart disease (CHD) when compared to those who do not receive such medication. More precisely, there is a noticeable disparity in the proportion of individuals at risk for CHD between those who are on blood pressure medication and those who are not. These observations imply that the usage of blood pressure medication may play a substantial role in influencing the CHD risk within this dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Firms specializing in the development of blood pressure (BP) medication or other remedies for hypertension may find it beneficial to focus on individuals with elevated blood pressure levels who also face a risk of coronary heart disease (CHD), irrespective of their current use of BP medication. This strategy could aid in the identification of patients who could benefit from more intensive treatment to mitigate their CHD risk.

#### Chart - 6
**Is a person who has had a stroke more susceptible to coronary heart disease?**

In [None]:
# Chart - 6 visualization code

plt.figure(figsize=(10,10))
sns.countplot(x=df['prevalentStroke'], hue=df['TenYearCHD'])
plt.title('Are people who had a stroke earlier more prone to CHD?')
plt.legend(['No Risk', 'At Risk'], loc='best')
plt.show()

##### 1. Why did you pick the specific chart?

This chart is a countplot illustrating a comparison of CHD risk levels among patients with a prior stroke history and those without such a history. The selection of this chart is likely motivated by an interest in exploring a potential link between experiencing a stroke and an increased susceptibility to CHD.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates an association between a prior history of stroke and an elevated risk of coronary heart disease (CHD) within this dataset. More precisely, the percentage of patients at risk for CHD is notably higher among those with a history of stroke compared to those without. These observations indicate a potential link between experiencing a stroke and an increased susceptibility to CHD within the dataset

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The information gleaned from this chart holds potential significance for businesses operating within the realm of healthcare services or products associated with stroke or coronary heart disease (CHD). For instance, manufacturers of medications or treatments for stroke or CHD could contemplate directing their efforts towards patients who have experienced a stroke, recognizing them as a high-risk demographic for CHD.

Furthermore, healthcare providers may consider implementing screening protocols for CHD risk among individuals who have a history of stroke. This could enable the delivery of targeted preventative measures or treatments to address potential risks and promote better patient outcomes.

#### Chart - 7
**Does having hypertension increase the risk of developing coronary heart disease?**

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(8,8))
sns.countplot(x=df['prevalentHyp'], hue=df['TenYearCHD'])
plt.title('Are hypertensive patients at more risk of CHD?')
plt.legend(title='CHD Risk', labels=['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

We selected this chart to visually represent the correlation between the presence of hypertension and the likelihood of developing coronary heart disease within the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates a correlation between prevalent hypertension and an increased likelihood of developing coronary heart disease (CHD) when compared to individuals without hypertension. More precisely, it indicates that the proportion of patients at risk for CHD is comparable among those with prevalent hypertension and those without this condition.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart demonstrates a connection between the presence of prevalent hypertension and an elevated probability of developing coronary heart disease (CHD) when contrasted with individuals lacking hypertension. Specifically, it highlights that the percentage of patients at risk for CHD is similar between those with prevalent hypertension and those without this condition

#### **Chart - 8**
**Do individuals with diabetes have a higher risk of developing coronary heart disease?**

In [None]:
# Chart - 8 visualization code

plt.figure(figsize=(8,8))
sns.barplot(x=df['diabetes'], y=df['TenYearCHD'], hue=df['TenYearCHD'], estimator=lambda x: len(x) / len(df) * 100)
plt.title('Proportion of patients with and without diabetes at CHD risk')
plt.xlabel('Diabetes')
plt.ylabel('Percentage')
plt.legend(title='CHD Risk', labels=['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

We selected this chart to represent the distribution of individuals in the dataset, categorizing them based on the presence or absence of diabetes, and examining their respective risks of developing coronary heart disease.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that individuals with diabetes have a higher likelihood of being susceptible to coronary heart disease in contrast to those who do not have diabetes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Indeed, the insights obtained can assist healthcare enterprises and practitioners in identifying patients with diabetes at an elevated risk level, necessitating additional evaluation, ongoing monitoring, and comprehensive management to mitigate the onset or advancement of coronary heart disease.


#### **Chart - 9**
**Is there a correlation between total cholesterol levels and coronary heart disease?**

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8,8))
sns.boxplot(x='TenYearCHD', y='totChol', data=df)
plt.title('Total Cholesterol Levels and CHD')
plt.xlabel('TenYearCHD')
plt.ylabel('Total Cholesterol Levels')
plt.legend(['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

We selected the particular box plot as a means to address the query concerning the potential correlation between total cholesterol levels and the susceptibility to coronary heart disease development

##### 2. What is/are the insight(s) found from the chart?

The box plot reveals that individuals at risk of developing coronary heart disease tend to exhibit slightly elevated average total cholesterol levels compared to those not at risk. However, it's important to note that there is a degree of overlap in the cholesterol level ranges between these two groups

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights obtained can assist healthcare providers in assessing the influence of total cholesterol levels on the susceptibility to coronary heart disease (CHD) among their patients. The identification of individuals with elevated cholesterol levels enables the implementation of targeted interventions to mitigate their risk of CHD development. Such proactive measures can yield favorable effects on patient health outcomes and potentially result in cost reductions for healthcare providers over time

#### **Chart - 10**
**What is the pairwise relationship between glucose levels, systolic blood pressure, diastolic blood pressure, and the risk of developing coronary heart disease?**

In [None]:
# Chart - 10 visualization code
# select the columns of interest
cols = ['glucose', 'sysBP', 'diaBP', 'TenYearCHD']

# create the scatter plot matrix
sns.pairplot(df[cols], hue='TenYearCHD', markers=['o', 's'])

High glucose levels appear to be associated with an increased risk of developing coronary heart disease, as indicated by a higher concentration of orange (high-risk) points in the upper right quadrant of the glucose vs. TenYearCHD plot.

High blood pressure levels (both systolic and diastolic) also appear to be associated with an increased risk of developing coronary heart disease, as indicated by a higher concentration of orange points in the upper right quadrants of the sysBP vs. TenYearCHD and diaBP vs. TenYearCHD plots.

There may be some interaction effects between glucose and blood pressure on the risk of developing coronary heart disease, as indicated by the patterns of orange points in the glucose vs. sysBP and glucose vs. diaBP plots. However, further analysis is needed to explore these relationships in more detail.

##### 1. Why did you pick the specific chart?

This chart was chosen to visualize the pairwise relationships between four variables: glucose levels, systolic blood pressure, diastolic blood pressure, and the risk of developing coronary heart disease. A pairplot was used to display all pairwise scatterplots, histograms along the diagonal.

##### 2. What is/are the insight(s) found from the chart?

The pairplot provides a visual representation of the interrelationships between glucose levels, systolic blood pressure, diastolic blood pressure, and the likelihood of coronary heart disease development. Along the diagonal, histograms display the distribution of each variable, while scatter plots illustrate the associations between pairs of variables. Notably, it's evident from the plot that individuals with elevated glucose levels tend to exhibit an increased risk of developing coronary heart disease. Likewise, individuals with elevated systolic and diastolic blood pressure levels appear to have an elevated risk of developing coronary heart disease

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights drawn from this chart do not point toward negative growth; rather, they have the potential to foster positive growth by aiding healthcare enterprises in the development of more efficacious prevention and treatment approaches. These strategies have the capacity to enhance patient outcomes and, in turn, may lead to potential reductions in healthcare expenditures

#### **Chart - 11**
**Does cigarette smoking have a differential impact on the risk of developing coronary heart disease between males and females?**

In [None]:
# Chart - 11 visualization code
# select the columns of interest
cols = ['sex', 'cigsPerDay', 'TenYearCHD']

# create a grouped scatter plot of TenYearCHD by cigsPerDay and sex
sns.scatterplot(x='cigsPerDay', y='TenYearCHD', hue='sex', data=df)

# show the plot
plt.show()

##### 1. Why did you pick the specific chart?

We selected this chart because it effectively visualizes the interplay between daily cigarette consumption, the probability of developing coronary heart disease, and how gender influences this dynamic. A scatter plot was employed to depict the data's distribution and to discern any underlying patterns or trends.

##### 2. What is/are the insight(s) found from the chart?

In the chart, it is evident that individuals of both genders who smoke experience an elevated risk of developing coronary heart disease as the daily cigarette consumption rises. Nevertheless, the association between cigarette smoking and CHD risk exhibits greater prominence among males than females. Specifically, among males, those who consume more than 10 cigarettes per day demonstrate a notably heightened risk of CHD compared to their female counterparts."

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The information gleaned from this chart offers valuable insights that can inform the efforts of public health organizations and businesses in designing tailored interventions aimed at reducing smoking prevalence and mitigating the onset of coronary heart disease (CHD). Particularly, there is an opportunity to focus on male smokers who exhibit an elevated risk. For instance, public health initiatives can be strategized to heighten awareness about the health hazards associated with smoking and offer assistance and resources to individuals seeking to quit. Similarly, businesses can consider implementing smoking cessation programs for their employees, promoting improved health and overall well-being.

####**Chart - 12**
**Are there differences in the age and sex distributions between individuals with and without prevalent stroke?**

In [None]:
# Chart - 12 visualization code
sns.violinplot(x='prevalentStroke',y="age",data=df, hue='sex', split='True', palette='rainbow')

It is made clear that most of the prevalent strokes were shown by patients abbove age 45 and most of those patients are females.

##### 1. Why did you pick the specific chart?

I opted for a violin plot as it offers an efficient means to visualize the age distribution within two distinct groups (individuals with and without a prior stroke). Additionally, it facilitates the comparison of gender distributions within each of these groups

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that individuals who have experienced a prevalent stroke tend to have a higher average age compared to those without a history of stroke. Furthermore, the chart reveals a greater presence of males in both groups, with a notably higher proportion of males observed among individuals with a prevalent stroke

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights obtained have the potential to assist healthcare institutions and insurance providers in formulating policy decisions concerning stroke prevention and treatment. For instance, these findings could contribute to informed choices regarding the allocation of specific preventive measures, like the use of blood thinners, or the structuring of stroke rehabilitation programs. Moreover, insurance firms may employ this data to shape their policies pertaining to stroke coverage and premium rates.

#### **Chart - 13**
**Is there any relation between individual with hypertensive and cigsperday?**

In [None]:
# Chart - 13 visualization code
# create a scatter plot of sysBP against cigsPerDay, colored by hypertension status
sns.scatterplot(x='cigsPerDay', y='sysBP', hue='prevalentHyp', data=df)

# add a title and axis labels
plt.title('Relationship between Systolic Blood Pressure and Cigarettes Smoked per Day, by Hypertension Status')
plt.xlabel('Cigarettes Smoked per Day')
plt.ylabel('Systolic Blood Pressure')

# display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I opted for a scatterplot as it is a suitable choice for visualizing the connection between two continuous variables, aligning with our specific interest in understanding the relationship between cigsPerDay and sysBP. Furthermore, the incorporation of color to signify hypertensive status provides a straightforward means of discerning potential data patterns or trends associated with hypertension status.

##### 2. What is/are the insight(s) found from the chart?

The scatterplot reveals a noticeable positive correlation between cigsPerDay and sysBP, irrespective of hypertension status. This implies that individuals who smoke a greater number of cigarettes per day tend to exhibit elevated systolic blood pressure levels. Furthermore, it is evident that individuals with prevalent hypertension generally demonstrate higher systolic blood pressure levels in comparison to those who do not have hypertension."

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gleaned from this chart hold promise for application in healthcare and wellness contexts, where the need to monitor blood pressure levels and mitigate cardiovascular risk factors, such as smoking, is paramount. By discerning the positive association between smoking and systolic blood pressure, healthcare providers have an opportunity to promote smoking cessation as a means to reduce blood pressure levels and mitigate the risk of hypertension and associated cardiovascular ailments.

Notably, this chart does not reveal any indications of adverse trends. Nevertheless, should smoking cessation initiatives be implemented effectively and yield positive results, there could potentially be adverse repercussions for tobacco companies and the broader tobacco industry.

#### Chart - 14 - Correlation Heatmap

In [None]:
df.corr()

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,15))
correlation = df.corr()
sns.heatmap((correlation), annot=True, cmap=sns.color_palette("mako", as_cmap=True))

##### 1. Why did you pick the specific chart?

I chose the Correlation Heatmap because it provides an efficient means of visually representing the relationships between various pairs of features within a dataset. This visualization employs a color scheme to depict the strength of the correlation coefficient, facilitating the rapid identification of strongly correlated features

##### 2. What is/are the insight(s) found from the chart?

The Correlation Heatmap shows the pairwise correlation between all numerical features in the dataset.

1) From corelation chart we can see that age is highly correlated with TenYearCHD by 22%. This suggests that these features may be important predictors of CHD risk.

2) From the heatmap, we can see that age, systolic blood pressure, and diastolic blood pressure have a relatively strong correlation with the TenYearCHD target variable.  

3) Additionally, we can see that there is a moderate positive correlation between systolic and diastolic blood pressure, by 78%.

4) As well as diabetes and glucose are correlated by 61%.

5) Also prevalent hypertension highly correlated with systolic blood pressure, and diastolic blood pressure by 70% and 61% respectively.

6) And age is negatively correlated with education and cigarettes per day with 17% and 19% respectively.

####**Chart - 15 - Pair Plot**

In [None]:
# Pair Plot visualization code
sns.pairplot(df[continous_variable])

##### 1. Why did you pick the specific chart?

The pair plot serves as a valuable visualization tool for comprehending the interrelationships among the continuous variables present within the dataset. It facilitates the detection of both linear and non-linear correlations among these variables and aids in the recognition of potential outliers or unconventional patterns within the dataset

##### 2. What is/are the insight(s) found from the chart?

Observing the pair plot, it becomes evident that several variables exhibit positive correlations. Notably, age displays a positive association with systolic blood pressure, while BMI shows a similar correlation with glucose levels. Additionally, systolic blood pressure and diastolic blood pressure exhibit a linear relationship. A subtle positive correlation between cigsPerDay and sysBP is also discernible. Nevertheless, no distinct linear connection emerges between any of the variables and the target variable, TenYearCHD.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.



1. Patients with diabetes face an elevated risk of CHD compared to those without diabetes.
2. Elevated total cholesterol levels are linked to a greater likelihood of coronary heart disease (CHD).

3. The likelihood of being at risk for TenYearCHD is higher for individuals aged 50 and above.

### Hypothetical Statement - 1
Patients with diabetes face an elevated risk of CHD compared to those without diabetes.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis**: There exists no substantial disparity in the likelihood of coronary heart disease (CHD) development between individuals with diabetes and those without the condition.

**Alternative Hypothesis**: Individuals with diabetes exhibit an elevated risk of CHD development compared to those without diabetes.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats

# Separate the dataset into two groups based on diabetic status
diabetic = df[df['diabetes'] == 1]
non_diabetic = df[df['diabetes'] == 0]

# Perform a two-sample t-test to compare the mean TenYearCHD rates of the two groups
t_stat, p_val = stats.ttest_ind(diabetic['TenYearCHD'], non_diabetic['TenYearCHD'], equal_var=False)

print('t_stat=%.3f, p_val=%.3f' % (t_stat, p_val))
if p_val > 0.05:
    print('Accept Null Hypothesis')
else:
    print('Reject Null Hypothesis')

# Print the p-value
print('p-value:', p_val)

The t-statistic serves as a metric for quantifying the disparity in means between patients with diabetes and those without, normalized by the standard error of this difference. Meanwhile, the p-value reflects the likelihood of encountering such a disparity in means purely due to random chance, assuming the null hypothesis holds.

With a calculated p-value of 0.000, which is less than the predetermined significance level of 0.05, it strongly suggests that the probability of observing such a discrepancy in means by random chance alone is exceedingly low. Consequently, we reject the null hypothesis and infer that patients with diabetes face an elevated risk of developing coronary heart disease compared to their non-diabetic counterparts.

##### Which statistical test have you done to obtain P-Value?

The two-sample t-test was used to obtain the p-value for the hypothesis "Patients with diabetes face an elevated risk of CHD compared to those without diabetes."



##### Why did you choose the specific statistical test?

We opted for the two-sample t-test in this analysis because it is suitable for comparing the means of two distinct and independent groups, namely the diabetic and non-diabetic populations, in relation to the binary outcome variable of coronary heart disease (CHD) risk. This statistical test is well-suited for our purposes as it enables us to assess whether there exists a statistically significant disparity between the means of these two groups. Moreover, given the relatively substantial sample sizes of both groups, the t-test emerges as a resilient and dependable choice for conducting our analysis.

### **Hypothetical Statement - 2**
Elevated total cholesterol levels are linked to a greater likelihood of coronary heart disease (CHD).

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0)** - The mean total cholesterol levels do not differ significantly between the two groups.

**Alternate Hypothesis (H1)** - There is a statistically significant difference in the mean total cholesterol levels between the two groups.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

# Import the required statistical test module from scipy
import scipy.stats as stats

# Separate the dataset into two groups based on CHD status
chd = df[df['TenYearCHD'] == 1] # Patients with CHD
no_chd = df[df['TenYearCHD'] == 0] # Patients without CHD

# Perform a two-sample t-test to compare the mean total cholesterol levels of the two groups
t_stat, p_val = stats.ttest_ind(chd['totChol'], no_chd['totChol'], equal_var=False)

# Print the calculated t-statistic and p-value
print('t_stat=%.3f, p_val=%.3f' % (t_stat, p_val))

# Determine if the null hypothesis should be rejected based on the p-value
if p_val < 0.05:
    print('Reject the null hypothesis')
else:
    print('Fail to reject the null hypothesis')

# Print the p-value
print('p-value:', p_val)

* The observed p-value is highly significant (p_val=5.310852329016078e-07), well below the conventional significance threshold of 0.05.

* Consequently, we have sufficient evidence to reject the null hypothesis, which posits no disparity in total cholesterol levels between the CHD and non-CHD groups.

* These results imply an association between elevated total cholesterol levels and an increased susceptibility to coronary heart disease (CHD).

* Additionally, the t-statistic of 5.065 reinforces this conclusion, signifying a noteworthy distinction in the means of the two groups

##### Which statistical test have you done to obtain P-Value?

We conducted a two-sample t-test to calculate the p-value. This t-test was utilized to assess whether there exists a statistically significant distinction in the mean total cholesterol levels between two distinct groups: individuals with CHD and those without CHD.

##### Why did you choose the specific statistical test?

Considering the hypothesis that 'Elevated total cholesterol levels are linked to an increased risk of coronary heart disease (CHD),' the suitable statistical analysis to conduct would be a two-sample t-test. This choice is made because we are contrasting the mean total cholesterol levels between two distinct and independent groups: individuals with CHD and those without CHD. Given the dichotomous nature of the outcome variable (CHD status), it is imperative to assess whether a significant difference exists in total cholesterol levels between these two groups. The two-sample t-test is a widely employed statistical method for comparing the means of two independent groups. However, it is essential to note that this test assumes normal data distribution and equal variances in the two groups.

### **Hypothetical Statement - 3**
The likelihood of being at risk for TenYearCHD is higher for individuals aged 50 and above.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis**: "There is no significant impact of age on the risk of TenYearCHD."

**Alternative Hypothesis**: "Individuals aged 50 and above exhibit a greater TenYearCHD risk compared to those below the age of 50."

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import statsmodels.stats.proportion as smp

# Separate the dataset into two groups based on age
above_50 = df[df['age'] > 50]
below_50 = df[df['age'] <= 50]

# Calculate the proportion of patients with TenYearCHD in each group
prop_above_50 = above_50['TenYearCHD'].mean()
prop_below_50 = below_50['TenYearCHD'].mean()

# Perform a one-tailed z-test to compare the proportions of the two groups
z_score, p_val = smp.proportions_ztest([prop_above_50 * len(above_50), prop_below_50 * len(below_50)], [len(above_50), len(below_50)], alternative='larger')

print('z_score=%.3f, p_val=%.3f' % (z_score, p_val))

if p_val < 0.05:
    print('Reject Null Hypothesis')
else:
    print('Accept Null Hypothesis')

# Print the p-value
print('p-value:', p_val)

The results of the test indicate that the probability of observing a difference in the proportion of TenYearCHD risk between patients above 50 years of age and those below 50 years of age due to chance is very low.

Rejected the null hypothesis and conclude that patients who are above 50 years of age are at a significantly higher risk of TenYearCHD than those who are below 50 years of age.

##### Which statistical test have you done to obtain P-Value?

I conducted a one-tailed Z-test to assess the differences in proportions between patients with TenYearCHD above and below the age of 50.

##### Why did you choose the specific statistical test?

I opted for a one-tailed z-test to assess the proportions between the two groups. The primary focus of our investigation is to determine whether the percentage of patients with TenYearCHD in the group aged above 50 years exceeds that of the group below 50 years. Utilizing a z-test is suitable in situations where we possess a substantial sample size and aim to compare proportions between two distinct groups

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

There are no remaining null values in our dataset as we have already processed and handled them in data wrangling.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# Handling Outliers & Outlier treatments
fig, axes = plt.subplots(2, 4, figsize=(30, 15))
axes = axes.flatten()
for ax, col in zip(axes, continous_variable):
    sns.boxplot(df[col], ax=ax)
    ax.set_title(col.title(), weight='bold')
plt.tight_layout()

In [None]:
## fuction to create dataframe of total outliers and percentage of outliers
def outliers_df(df, continuous_features):
    outlier_df = pd.DataFrame(columns=['feature', 'lower_limit', 'upper_limit',
                                       'IQR', 'total_outliers', 'percentage_outliers(%)'])
    for feature in continuous_features:
        values = df[feature]
        q1, q2, q3 = values.quantile([0.25, 0.5, 0.75])
        iqr = q3 - q1
        Lower_limit = q1 - 1.5 * iqr
        Upper_limit = q3 + 1.5 * iqr
        outliers = values[(values < Lower_limit) | (values > Upper_limit)]
        total_outliers = len(outliers)
        percentage_outliers = round(total_outliers * 100 / len(values), 2)
        outlier_df = outlier_df.append({'feature': feature,
                                        'lower_limit': Lower_limit,
                                        'upper_limit': Upper_limit,
                                        'IQR': iqr,
                                        'total_outliers': total_outliers,
                                        'percentage_outliers(%)': percentage_outliers},
                                        ignore_index=True)
    return outlier_df.sort_values(by=['percentage_outliers(%)'], ascending=False)

In [None]:
outliers_df(df,continous_variable)

* Applying a blanket strategy of relocating all outliers into the 25-75 interquartile range may not be advisable for this dataset, especially when considering the possibility that some of these outliers might pertain to critically ill patients.

* There exist several techniques for handling outliers within a dataset, including outlier removal, Winsorization, utilization of robust statistical methods, and data transformation.

* Outlier removal involves the exclusion of data points identified as outliers. However, this approach comes with the drawback of potential information loss and a reduction in the sample size.

* Therefore, in this context, we opt for data transformation, which entails applying mathematical functions such as logarithmic, square root, or reciprocal transformations. This approach can aid in normalizing the data distribution and mitigating the impact of outliers.

In [None]:
# applying transformation for treating outlier
df[continous_variable] = np.log(df[continous_variable] +1 )

##### What all outlier treatment techniques have you used and why did you use those techniques?

* I applied the LOG TRANSFORMATION technique to address outliers within the dataset.

* I chose this approach due to its statistical nature and straightforward implementation, which has proven to yield effective results.

* Additionally, this transformation can assist in normalizing the data distribution, rendering it more symmetrical

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

df['sex'] = pd.get_dummies(df['sex'], drop_first=True)
df['is_smoking'] = pd.get_dummies(df['is_smoking'], drop_first=True)

In [None]:
df.info()

#### What all categorical encoding techniques have you used & why did you use those techniques?

* The rationale behind employing this technique lies in the fact that categorical variables are typically non-numeric in nature, whereas machine learning algorithms necessitate numerical input. Therefore, we have opted for one-hot encoding to convert the categorical variables 'sex' and 'is_smoking' into numerical equivalents, manifesting as binary values (0 or 1).

* More precisely, we leverage the **get_dummies()** function provided by the pandas library to generate dummy variables, each serving as a distinct binary column representing the categories within each variable.

* Furthermore, we utilize the **drop_first=True parameter** as a measure to mitigate potential multicollinearity issues within the dataset, which can arise when two dummy variables exhibit high correlation.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

In correlation Heatmap we already seen that Systolic Blood Pressure and Diastolic Pressure are highly correlated.

So we are Creating a new feature out of it to indicate whether an individual has a blood pressure issue or not.

In [None]:
df.head()


After conducting a more in-depth examination of heart-related factors, it became evident that pulse pressure, defined as the disparity between systolic and diastolic blood pressure, exerts a significant influence on coronary heart disease (CHD). Consequently, we have the opportunity to construct a novel variable named 'PP' (pulse pressure) that amalgamates the systolic and diastolic blood pressure measurements into a unified column.

In [None]:
 #Adding pulse pressure as a column
df['pulsePressure'] = df['sysBP'] - df['diaBP']

In [None]:
df.head()

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
# Let's see how trip_duration and other features are related
for col in df.describe().columns.tolist():
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = df[col]
    label = df['TenYearCHD']
    correlation = feature.corr(label)
    sns.scatterplot(x=feature, y=label, color="gray")
    plt.xlabel(col)
    plt.ylabel('TenYearCHD')
    ax.set_title('TenYearCHD vs ' + col + '- correlation: ' + str(correlation))
    z = np.polyfit(df[col], df['TenYearCHD'], 1)
    y_hat = np.poly1d(z)(df[col])
    plt.plot(df[col], y_hat, "r--", lw=1)
    plt.show()

In [None]:
f,ax = plt.subplots(figsize=(12, 12))
sns.heatmap(abs(round(df.corr(),3)), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
plt.show()

##### What all feature selection methods have you used  and why?

Correlation analysis is a technique that entails assessing the degree of correlation between each feature and the target variable. Features exhibiting a strong correlation with the target variable are typically regarded as effective predictors and consequently included in the selection process

##### Which all features you found important and why?

* "Upon inspecting the heatmap, a conspicuous correlation emerges between sysBP and diaBP. Additionally, considering that we've already computed a new feature, namely pulsePressure, from these variables, we've decided to remove both 'sysBP' and 'diaBP' from the analysis."

* "We've opted to exclude the 'id' feature from our analysis as it doesn't carry significant importance for our analytical purposes."

* "Observing a substantial correlation between the 'is_smoking' and 'cigsPerDay' columns, we've chosen to eliminate one of them, particularly the one with a lesser influence on the target variable.

* Furthermore, when the daily cigarette consumption exceeds zero, the 'smoking' column is assigned a value of 1, signifying a positive smoking status. Consequently, both statements convey equivalent information, leading us to drop the 'is_smoking' column."

###**Creating Final DataFrame**

In [None]:
df.columns

In [None]:
final_df = df[['age', 'education', 'sex','cigsPerDay', 'BPMeds',
               'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol',
               'BMI', 'heartRate', 'glucose', 'pulsePressure', 'TenYearCHD']]

In [None]:
# check for heatmap if anything remains to avoid multicollinearity
plt.figure(figsize=(15,15))
correlation = final_df.corr()
sns.heatmap((correlation), annot=True, cmap=sns.color_palette("mako", as_cmap=True))

It is evident that 'pulsePressure,' 'glucose,' and 'prevalentHyp' exhibit a moderate level of correlation with each other, suggesting their suitability for retention in the analysis.

### 5. Data Transformation

Data Transformation is not required because we already did the transformation when treating outliers.

But, we also updated our dataset, we added new feature as "pulse pressure".

So, we will check for it, if it needs a transformation.

In [None]:
# Checking the distribution of pulse pressure
plt.figure(figsize=(10,5))
print("Before Applying Transformation")
sns.distplot(df['pulsePressure'])
plt.title('Distribution of pulsePressure')

In [None]:
#### If you want to check whether feature is guassian or normal distributed
#### Q-Q plot
stats.probplot(df['pulsePressure'],dist='norm',plot=pylab)

In [None]:
# Creating 5 different copies to check the distribution of the variable
test_df1=final_df.copy()
test_df2=final_df.copy()
test_df3=final_df.copy()
test_df4=final_df.copy()

## **Logarithmic Transformation**

In [None]:
# Applying transformation on the considered column
test_df1['pulsePressure']=np.log(test_df1['pulsePressure']+1)

# Checking the distribution of continous variable
plt.figure(figsize=(10,5))
print("After Applying Transformation")
sns.distplot(df['pulsePressure'])
plt.title('Distribution of pulsePressure')

In [None]:
#### Q-Q plot
stats.probplot(df['pulsePressure'],dist='norm',plot=pylab)

### **Reciprocal Transformation**

In [None]:
# Applying transformation on the considered column
test_df2['pulsePressure']=1/(test_df2['pulsePressure']+1)

# Checking the distribution of continous variable
plt.figure(figsize=(10,5))
print("After Applying Transformation")
sns.distplot(df['pulsePressure'])
plt.title('Distribution of pulsePressure')

In [None]:
#### Q-Q plot
stats.probplot(df['pulsePressure'],dist='norm',plot=pylab)

#### **Square Root Transformation**

In [None]:
# Applying transformation on the considered column
test_df3['pulsePressure']=(test_df3['pulsePressure'])**(1/2)

# Checking the distribution of continous variable
plt.figure(figsize=(10,5))
print("After Applying Transformation")
sns.distplot(df['pulsePressure'])
plt.title('Distribution of pulsePressure')

In [None]:
#### Q-Q plot
stats.probplot(df['pulsePressure'],dist='norm',plot=pylab)

**Exponential Transformation**

In [None]:
# Applying transformation on the considered column
test_df4['pulsePressure']=(test_df4['pulsePressure'])**(1/1.2)

# Checking the distribution of continous variable
plt.figure(figsize=(10,5))
print("After Applying Transformation")
sns.distplot(df['pulsePressure'])
plt.title('Distribution of pulsePressure')

In [None]:
#### Q-Q plot
stats.probplot(df['pulsePressure'],dist='norm',plot=pylab)

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

Yes, however, it's important to note that we initially applied a logarithmic transformation to the data to address outliers. Subsequently, when dealing with the 'Pulse pressure' feature, we conducted further analysis and determined that it also benefitted from a logarithmic transformation to achieve optimal results."

**Applying transformation**

In [None]:
# Transform Your data
# Applying transformation on the considered column
## Logarithmic transformation
final_df['pulsePressure']=np.log(final_df['pulsePressure']+1)

### 6. Data Scaling

In [None]:
# Scaling your data


x= final_df.drop('TenYearCHD',axis=1)
y= final_df[['TenYearCHD']]
print(x.shape)
print(y.shape)


# Creating object
std_regressor= StandardScaler()

# Fit and Transform
x= std_regressor.fit_transform(x)

##### Which method have you used to scale you data and why?

The StandardScaler is a data scaling technique that transforms the data in such a way that its mean becomes 0, and its standard deviation becomes 1. This method is widely employed in machine learning for data scaling purposes due to its ability to maintain the original distribution's shape. It is particularly well-suited for most machine learning algorithms, especially those relying on distance-based metrics. Moreover, StandardScaler proves valuable when dealing with datasets featuring features that exhibit substantial differences in scale, as it aids in rendering these features more comparable."



### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

No, it is not a required step in this context.

In the context of the cardiovascular risk prediction dataset, there is no imperative need for dimensionality reduction. The dataset exhibits a relatively small number of features in comparison to the sample size, mitigating the risk of overfitting. Moreover, the dataset's size is modest, so the computational training time for machine learning models does not present a significant concern.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Not needed

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

Now let's split data in the ratio of 80:20 where 80 % will be in training set and 20 % will be in testing set by using train_test_split function available in sklearn library

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [None]:
print(f'The shape of x_train is: {x_train.shape}')
print(f'The shape of y_train is: {y_train.shape}')
print(f'The shape of x_test is: {x_test.shape}')
print(f'The shape of y_test is: {y_test.shape}')

##### What data splitting ratio have you used and why?

To mitigate the risk of overfitting and enhance the generalization capability of our model, we partitioned the data into a training set, comprising 80% of the data, and a testing set, which contained the remaining 20%. We employed the 'train_test_split' function available in the scikit-learn library for this purpose, as it is a widely adopted technique to facilitate model training and evaluation on distinct data subsets

### 9. Handling Imbalanced Dataset

In [None]:
print(df.TenYearCHD.value_counts())

In [None]:
# calculate value counts of 'TenYearCHD' column
counts =df['TenYearCHD'].value_counts()

# set labels and colors for the pie chart
labels = ['NO','YES']
colors = ['skyblue','red']

# create pie chart
plt.figure(figsize=(15,6))
plt.pie(counts, labels=labels, colors=colors, autopct= "%1.1f%%",
        startangle=90, shadow=True, explode=[0,0])


#add title to the chart
plt.title('TenYearCHD Distribution', fontsize=16)

#display the chart
plt.show

##### Do you think the dataset is imbalanced? Explain Why.

YES

The pie chart clearly indicates that the target variable, which is the 10-year risk of coronary heart disease (CHD), is highly imbalanced. Out of the total sample population, 84.9% or 2879 individuals do not have the risk of CHD, while only 15.1% or 511 individuals are at risk. This significant class imbalance in the data could lead to biased predictions and can negatively impact the performance of machine learning models. Therefore, it is necessary to balance the data by applying appropriate techniques such as undersampling or oversampling to improve the accuracy and reliability of the models.

In [None]:
# Handling Imbalanced Dataset (If needed)
# from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Fit and apply SMOTE to the data
x_resampled, y_resampled = smote.fit_resample(x, y)

# Print the original and resampled dataset shapes
print('Original dataset shape:', df.shape)
print('Resampled dataset shape:', x_resampled.shape)

# Count the number of samples in each class in the resampled dataset
print('Class distribution in the resampled dataset:', y_resampled.value_counts())

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x_resampled, y_resampled,test_size=0.2,random_state=42)

In [None]:
print(f'The shape of x_train is: {x_train.shape}')
print(f'The shape of y_train is: {y_train.shape}')
print(f'The shape of x_test is: {x_test.shape}')
print(f'The shape of y_test is: {y_test.shape}')

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I employed the Synthetic Minority Over-sampling Technique (SMOTE) to address the issue of imbalanced dataset. SMOTE is an oversampling method that creates synthetic data points for the minority class by interpolating new instances between the existing ones. This approach serves to balance the distribution of classes, mitigating the bias typically observed towards the majority class in imbalanced datasets. As a result, it can enhance the effectiveness of machine learning models when dealing with imbalanced datasets.

## ***7. ML Model Implementation***

In [None]:
def model_metrics(y_train, y_test, train_preds, test_preds):
    train_accuracy = accuracy_score(y_train, train_preds)
    test_accuracy = accuracy_score(y_test, test_preds)
    train_precision = precision_score(y_train, train_preds)
    test_precision = precision_score(y_test, test_preds)
    train_recall = recall_score(y_train, train_preds)
    test_recall = recall_score(y_test, test_preds)
    train_roc_auc = roc_auc_score(y_train, train_preds)
    test_roc_auc = roc_auc_score(y_test, test_preds)

    print(f"{'Train Accuracy':<20}{train_accuracy:.4f}")
    print(f"{'Test Accuracy':<20}{test_accuracy:.4f}")
    print(f"{'Train Precision':<20}{train_precision:.4f}")
    print(f"{'Test Precision':<20}{test_precision:.4f}")
    print(f"{'Train Recall':<20}{train_recall:.4f}")
    print(f"{'Test Recall':<20}{test_recall:.4f}")
    print(f"{'Train ROC AUC':<20}{train_roc_auc:.4f}")
    print(f"{'Test ROC AUC':<20}{test_roc_auc:.4f}")
    print("-"*50)

    train_confusion_matrix = confusion_matrix(y_train, train_preds)
    test_confusion_matrix = confusion_matrix(y_test, test_preds)

    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    labels = ['0', '1']
    sns.heatmap(train_confusion_matrix, annot=True, cmap='Blues', ax=axes[0], fmt="d", xticklabels=labels, yticklabels=labels)
    axes[0].set_xlabel('Predicted labels')
    axes[0].set_ylabel('True labels')
    axes[0].set_title('Train Confusion Matrix')
    sns.heatmap(test_confusion_matrix, annot=True, cmap='Blues', ax=axes[1], fmt="d", xticklabels=labels, yticklabels=labels)
    axes[1].set_xlabel('Predicted labels')
    axes[1].set_ylabel('True labels')
    axes[1].set_title('Test Confusion Matrix')

    plt.show()

### **ML Model - 1 Logistic Regression**

In [None]:
# ML Model - 1 Implementation
logistic_classifier= LogisticRegression()
# Fit the Algorithm
logistic_classifier.fit(x_train,y_train)
# Predict on the model
y_train_logistic_pred= logistic_classifier.predict(x_train)
y_test_logistic_pred= logistic_classifier.predict(x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_logistic_pred, y_test_logistic_pred)

In [None]:
from sklearn import  metrics
metrics.plot_roc_curve(logistic_classifier,x_test, y_test)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
from sklearn.model_selection import cross_val_score

# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
logistic_regression = LogisticRegression()
# set up the parameter grid for hyperparameter tuning
param_grid = {'penalty': ['l1', 'l2'],
              'C': [0.1, 1.0, 10.0],
              'solver': ['liblinear', 'saga']}
# Fit the Algorithm
grid_search = GridSearchCV(logistic_regression, param_grid, cv=5)
grid_search.fit(x_train, y_train)
# get the best hyperparameters and print them
best_params = grid_search.best_params_
print('Best hyperparameters:', best_params)
# use the best hyperparameters to fit the model and make predictions
logistic_regression_best = LogisticRegression(**best_params)
# perform cross-validation on the model with the best hyperparameters
cv_scores = cross_val_score(logistic_regression_best, x_train, y_train, cv=5)
# fit the final model using all the training data and the best hyperparameters
logistic_regression_best.fit(x_train, y_train)
y_train_logistic_pred_cv = logistic_regression_best.predict(x_train)
y_test_logistic_pred_cv  = logistic_regression_best.predict(x_test)
y_score_logistic_pred_cv = logistic_regression_best.predict_proba(x_test)[:, 1]

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV stands out as a potent method for fine-tuning the hyperparameters of machine learning models. This technique systematically assesses every conceivable combination of hyperparameters and their respective values. Subsequently, GridSearchCV identifies the optimal combination through performance evaluation, leading to enhanced model accuracy and improved overall performance

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

We enhanced the performance of our machine learning model through the utilization of GridSearchCV, a technique employed to explore and identify the optimal hyperparameters. This approach exhaustively tested all possible combinations of hyperparameter values, ultimately selecting those that yielded the highest level of accuracy.

Despite these efforts, we did not observe any significant enhancements in our results, with a **test accuracy of 67.97%**. **The test precision and recall stood at 66.55% and 69.27%, respectively**. Additionally, the area under the curve **(ROC AUC) score was 0.68**, which falls short of our desired benchmark.

As a result, we plan to explore alternative models such as **Random Forest and XGBoost in pursuit of improved accuracy and a higher AUC score**

### **ML Model 2 - Random Forest Classifier**

In [None]:
# ML Model - 2  Implementation
random_forest = RandomForestClassifier(n_estimators=100, max_depth=10, min_samples_split=2, min_samples_leaf=1)

# Fit the Algorithm
random_forest.fit(x_train, y_train)

# Predict on the model
y_train_rf_pred = random_forest.predict(x_train)
y_test_rf_pred = random_forest.predict(x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_rf_pred, y_test_rf_pred)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
random_forest = RandomForestClassifier()
param_grid = {'n_estimators': [100, 200, 300],
              'max_depth': [5, 10, 15, None],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4]}
# Fit the Algorithm
grid_search = GridSearchCV(random_forest, param_grid, cv=5)
grid_search.fit(x_train, y_train)
# get the best hyperparameters and print them
best_params = grid_search.best_params_
print('Best hyperparameters:', best_params)
# use the best hyperparameters to fit the model to the training data
random_forest_best = RandomForestClassifier(**best_params)
random_forest_best.fit(x_train, y_train)
# Predict on the model
y_train_rf_pred_gs = random_forest_best.predict(x_train)
y_test_rf_pred_gs  = random_forest_best.predict(x_test)
y_score_rf_pred_gs = random_forest_best.predict_proba(x_test)[:, 1]

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_rf_pred_gs, y_test_rf_pred_gs)

##### Which hyperparameter optimization technique have you used and why?

The use of GridSearchCV is a powerful method for fine-tuning the hyperparameters of machine learning models. By exhaustively searching through all possible combinations of hyperparameters and their values, GridSearchCV can identify the best combination for maximizing model performance, leading to more accurate results.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

* Through the use of GridSearchCV, we enhanced the performance of our machine learning model by identifying the optimal hyperparameters. GridSearchCV systematically evaluates all possible combinations of hyperparameters, enabling the selection of values that maximize model performance, ultimately leading to improved accuracy.

* Following hyperparameter tuning, we determined the best parameters to be **'min_samples_leaf': 1, 'min_samples_split': 2, and 'n_estimators': 200.**

* While hyperparameter tuning yielded a 100% train accuracy, it did not necessarily translate to the same level of performance on the test data. However, we were able to enhance the test accuracy significantly, raising it from **83.07% to 88.89%**.

* Furthermore, our efforts also resulted in an **improved ROC AUC score, increasing from 0.8311 to 0.8890**

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

The assessment of an ML model is pivotal in gauging the precision of its predictions. We utilized a range of metrics, encompassing Accuracy, Precision, Recall, and the ROC AUC score, to evaluate the alignment between the predicted values and the ground truth. The outcomes indicated that the model demonstrated an accuracy of roughly 88.89% in predicting Ten Year CHD. This degree of precision holds substantial importance, particularly considering that the target variable, TenYearCHD, bears direct relevance to business operations

### ML Model 3 - XGBoost Classifier

In [None]:
# ML Model - 3 Implementation
xgb = XGBClassifier()
# Fit the Algorithm
xgb.fit(x_train, y_train)
# Predict on the model
y_train_xgb_pred = xgb.predict(x_train)
y_test_xgb_pred = xgb.predict(x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_xgb_pred, y_test_xgb_pred)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# set up the parameter grid for hyperparameter tuning
param_grid = {'max_depth': [3, 5, 7],
              'learning_rate': [0.01, 0.1, 0.3],
              'n_estimators': [50, 100, 200]}
# Fit the Algorithm
grid_search = GridSearchCV(xgb, param_grid, cv=5, n_jobs=-1)
grid_search.fit(x_train, y_train)
# print the best hyperparameters
print('Best hyperparameters:', grid_search.best_params_)
# Predict on the model
best_estimator = grid_search.best_estimator_
y_train_xgb_pred_gs = best_estimator.predict(x_train)
y_test_xgb_pred_gs  = best_estimator.predict(x_test)
y_score_xgb_pred_gs = best_estimator.predict_proba(x_test)[:, 1]

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_xgb_pred_gs, y_test_xgb_pred_gs)

##### Which hyperparameter optimization technique have you used and why?

In order to optimize the hyperparameters of our machine learning model, we employed the use of GridSearchCV. This method is highly effective as it evaluates all possible combinations of hyperparameters and their values, ultimately selecting the best combination based on performance calculations. This results in improved model performance and more accurate results.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

We utilized GridSearchCV to optimize the performance of our machine learning model by exhaustively evaluating all possible hyperparameter combinations to identify the optimal values. This led to more accurate results and improved model performance.

We got best parameters as 'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 200 after hyperparameter tuning.

The accuracy of our model improved significantly from 82.55% to 89.67%. We also saw improvements in the Precision and Recall metrics to 92.69% and 85.61%, respectively. Additionally, the ROC AUC score improved to 0.8958, which is considered good.

### ML Model 4 - K-Nearest Neighbors (KNN)

In [None]:
# ML Model - 4 Implementation
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the Algorithm
knn.fit(x_train, y_train)

# Predict on the model
y_train_knn_pred = knn.predict(x_train)
y_test_knn_pred = knn.predict(x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_knn_pred, y_test_knn_pred)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 4  Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# set up the parameter grid for hyperparameter tuning
param_grid = {'n_neighbors': [3, 5, 7],
              'weights': ['uniform', 'distance']}
# Fit the Algorithm
grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(x_train, y_train)
# get the best hyperparameters and print them
best_params = grid_search.best_params_
print('Best hyperparameters:', best_params)
# train the classifier with the best hyperparameters on the full training set
knn_best = KNeighborsClassifier(**best_params)
knn_best.fit(x_train, y_train)
# Predict on the model
y_test_knn_pred_gs  = knn_best.predict(x_test)
y_train_knn_pred_gs = knn_best.predict(x_train)
y_score_knn_pred_gs = knn_best.predict_proba(x_test)[:, 1]

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_knn_pred_gs, y_test_knn_pred_gs)

##### Which hyperparameter optimization technique have you used and why?

To improve the performance of our machine learning model, we utilized GridSearchCV to optimize the hyperparameters. This technique exhaustively evaluates all possible combinations of hyperparameters and their values, ultimately selecting the best combination for maximizing model performance. This approach leads to more accurate results and improved model performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

By leveraging GridSearchCV, we were able to optimize the performance of our machine learning model by exhaustively searching for the best hyperparameters through all possible combinations. As a result of selecting optimal values, our model's performance improved significantly.

In the KNN model, we observed an improvement in accuracy from 78.56% to 82.12%, and a Precision of 74.02%, Recall of 97.69%, and ROC AUC score of 0.8246, which is higher after hyperparameter tuning, but lower than the previous model.

### ML Model - 5) Support Vector Machine Classifier (SVC)

In [None]:
# ML Model - 5 Implementation
svc = SVC(kernel='rbf', C=1, gamma='scale')

# Fit the Algorithm
svc.fit(x_train, y_train)

# Predict on the model
y_train_svc_pred = svc.predict(x_train)
y_test_svc_pred = svc.predict(x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_svc_pred, y_test_svc_pred)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 5  Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
svc = SVC(probability=True)
# set up the parameter grid for hyperparameter tuning
param_grid = {'C': [0.1, 1, 10],
              'kernel': ['linear', 'rbf'],
              'gamma': ['scale', 'auto']}
# perform a grid search with 5-fold cross-validation to find the best hyperparameters
grid_search = GridSearchCV(svc, param_grid, cv=5)
grid_search.fit(x_train, y_train)
# get the best hyperparameters and print them
best_params = grid_search.best_params_
print('Best hyperparameters:', best_params)
# train the classifier with the best hyperparameters on the full training set
svc_best = SVC(**best_params, probability=True)
svc_best.fit(x_train, y_train)
# Predict on the model
y_test_svc_pred_gs = svc_best.predict(x_test)
y_train_svc_pred_gs = svc_best.predict(x_train)
y_score_svc_pred_gs = svc_best.predict_proba(x_test)[:, 1]

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_svc_pred_gs, y_test_svc_pred_gs)

##### Which hyperparameter optimization technique have you used and why?

By using GridSearchCV to optimize the hyperparameters of our machine learning model, we were able to fine-tune the model for optimal performance. GridSearchCV evaluates all possible combinations of hyperparameters and their values to identify the best combination for maximizing model performance, leading to more accurate results and improved model performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The performance of our machine learning model was optimized by employing GridSearchCV to search for the best hyperparameters. GridSearchCV evaluates all possible combinations of hyperparameters and selects the optimal values to improve model performance and produce the most accurate results.

After performing hyperparameter tuning, we observed a slight improvement in our model's performance. The accuracy increased from 70.14% to 76.74%, precision improved from 68.65% to 73.30%, and recall increased from 71.58% to 82.42%. We also achieved an AUC ROC of 76.86%.

### ML Model 6- Naive Bayes Classifier

In [None]:
# ML Model - 6 Implementation
# create an instance of the Gaussian Naive Bayes classifier
nb = GaussianNB()

# Fit the Algorithm
nb.fit(x_train, y_train)

# Predict on the model
y_train_nb_pred = nb.predict(x_train)
y_test_nb_pred = nb.predict(x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_nb_pred, y_test_nb_pred)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 6  Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import cross_val_score
# create an instance of the Gaussian Naive Bayes classifier
nb = GaussianNB()
# set up the parameter grid for hyperparameter tuning
param_grid = {'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]}
# perform a grid search with cross-validation to find the best hyperparameters
grid_search = GridSearchCV(nb, param_grid, cv=5)
grid_search.fit(x_train, y_train)
# get the best hyperparameters and print them
best_params = grid_search.best_params_
print('Best hyperparameters:', best_params)
# create a new instance of the classifier using the best hyperparameters
nb_best = GaussianNB(**best_params)
# evaluate the classifier using cross-validation
scores = cross_val_score(nb_best, x_train, y_train, cv=5)
# print the cross-validation scores
print('Cross-validation scores:', scores)
# train the classifier on the entire training set using the best hyperparameters
nb_best.fit(x_train, y_train)
# make predictions on the training and test sets
y_train_nb_pred_gs = nb_best.predict(x_train)
y_test_nb_pred_gs = nb_best.predict(x_test)
y_score_nb_pred_gs = nb_best.predict_proba(x_test)[:, 1]

In [None]:
# Visualizing evaluation Metric Score chart
model_metrics(y_train, y_test, y_train_nb_pred_gs, y_test_nb_pred_gs)

##### Which hyperparameter optimization technique have you used and why?

We utilized GridSearchCV to fine-tune the hyperparameters of our machine learning model and improve its performance. GridSearchCV exhaustively searched through all possible combinations of hyperparameters to identify the best values for maximizing model performance.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Performance is almost similar, no significant improvement

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

"In the context of Predicting TenYearCHD, this is treated as a classification problem. In this scenario, the primary objective is to predict an outcome variable (TenYearCHD) based on one or more predictor variables.

Evaluation metrics commonly employed for assessing the performance of a classification model include:

**Accuracy**: This metric measures the proportion of correctly classified instances among all instances. A higher accuracy score signifies the model's superior ability to correctly predict the class for each instance.

**Precision:** Precision quantifies the ratio of true positive predictions to all positive predictions made by the model. It is computed by dividing the number of true positives by the sum of true positives and false positives. A higher precision score indicates a lower rate of false positives, which holds significance in applications where the cost of false positives is substantial.

**Recall:** Also known as sensitivity or true positive rate, recall measures the proportion of true positive predictions among all instances genuinely belonging to the positive class. It is determined by dividing the number of true positives by the sum of true positives and false negatives. A higher recall score signifies a lower rate of false negatives, which is particularly important in situations where false negatives carry a high cost.

**AUC ROC (Area Under the Receiver Operating Characteristic Curve):** AUC ROC serves as a metric to assess the performance of binary classification models. It evaluates the model's capability to differentiate between positive and negative classes at various probability thresholds. The AUC ROC score ranges from 0 to 1, with 0.5 indicating a random model and 1 signifying a perfect model. A higher AUC ROC score indicates the model's superior ability to distinguish between positive and negative classes."

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

In [None]:
# define the classifiers
classifiers = [ ("Logistic Regression", LogisticRegression()),
                ("Random Forest Classifier", RandomForestClassifier()),
                ("XGB Classifier", XGBClassifier()),
                ("KNN", KNeighborsClassifier()),
                ("SVC", SVC(probability=True)),
                ("NB Classifier", GaussianNB())]

# iterate through classifiers and plot ROC curves
plt.figure(figsize=(8, 6))
for name, classifier in classifiers:
    classifier.fit(x_train, y_train)
    y_score = classifier.predict_proba(x_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_score)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{name} (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

**After cross validation and hyperparameter tuning**

In [None]:
# Storing metrics in order to make dataframe
# (after cross validation and hyperparameter tuning)
Model = ["Logistic Regression", "Random Forest Classifier", "XGBoost", "KNN", "SVC","NBClassifier"]
Y_SCORE = [y_score_logistic_pred_cv, y_score_rf_pred_gs, y_score_xgb_pred_gs,
           y_score_knn_pred_gs, y_score_svc_pred_gs,y_score_nb_pred_gs]

# Create dataframe from the lists
data = {'MODEL': Model, 'Y_SCORE': Y_SCORE}
Metric_df = pd.DataFrame(data)

# plot the ROC curves for each model
plt.figure(figsize=(8, 6))
for i, row in Metric_df.iterrows():
    fpr, tpr, _ = roc_curve(y_test, row['Y_SCORE'])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{row['MODEL']} (AUC = {roc_auc:.2f})", alpha=0.8)
plt.plot([0, 1], [0, 1], color='grey', linestyle='--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

In [None]:
# Storing metrics in order to make dataframe of metrics
# (after cross validation and hyperparameter tuning)
Model          = ["Logistic Regression", "Random Forest Classifier", "XGBoost", "KNN", "SVC", "NBClassifier"]
Test_Accuracy  = [0.6797,0.8889,0.8967,0.8212,0.7674,0.5885]
Test_Precision = [0.6655,0.8796,0.9269,0.7402,0.7330,0.7330]
Test_Recall    = [0.6927,0.8952,0.8561,0.9769,0.8242,0.2487]
Test_ROC_AUC   = [0.6800,0.8890,0.8958,0.8246,0.7686,0.5810]
# Create dataframe from the lists
data = {'Model' : Model,
        'Test_Accuracy'  : Test_Accuracy,
        'Test_Precision' : Test_Precision,
        'Test_Recall'    : Test_Recall,
        'Test_ROC_AUC'   : Test_ROC_AUC}
Metric_df = pd.DataFrame(data)

# Printing dataframe
Metric_df

* Considering the outcomes from various models evaluated within the cardiovascular risk prediction project, it becomes evident that the Random Forest Classifier and XGBoost models emerge as the most favorable choices for constructing the final prediction model. Both models exhibit impressive accuracy scores of 0.8889 and 0.8967, respectively, which are crucial attributes for real-time prediction systems. Furthermore, these models showcase commendable precision and recall scores, signifying their ability to accurately predict both positive and negative cases.

* Although the KNN model demonstrates a relatively high recall score, its accuracy and precision metrics fall short of those achieved by the Random Forest Classifier and XGBoost models. Similarly, the SVC model displays lower accuracy and ROC AUC scores, implying its potential unsuitability for this specific classification task.

* However, it's noteworthy that the XGBoost model outperforms the Random Forest Classifier marginally, boasting superior test accuracy and precision scores. This suggests that the XGBoost model may be the superior choice for forecasting cardiovascular risk.

* Furthermore, the XGBoost model exhibits a higher ROC AUC score, indicating its enhanced ability to discriminate between positive and negative cases. Therefore, based on these findings, the XGBoost model stands out as the most fitting classification model for predicting cardiovascular risk in this project.








### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Although tree-based algorithms can be less interpretable, interpretability can be improved using tools like LIME and SHAP.

Model interpretability can be approached globally and locally.

1.  Global interpretability refers to understanding the overall relationship between features and prediction results. eg. Linear regression

2. Local interpretability focuses on understanding the individual impact of each feature on a specific prediction. e.g. SHAP and LIME

**Global Explainability**

In [None]:
# Plotting the barplot to determine which feature is contributing the most
features = final_df.columns
importances = best_estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8,10))
plt.grid(zorder=0)
plt.title('Feature Importances', fontsize=20)
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')

## ***8.*** ***Future Work (Optional)***

The ultimate model instance will be saved using the 'pickle' module for future reference. Pickling involves the conversion of a Python object into a byte stream, while unpickling is the reverse operation, which transforms a byte stream back into a Python object. This procedure is commonly referred to as serialization, marshalling, or flattening. The 'pickle' module facilitates the implementation of binary protocols to serialize and deserialize a Python object structure, rendering it valuable for purposes such as storing objects in a file, maintaining program state across sessions, or transmitting data across a network.

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Importing pickle module
import pickle
# Save the File
filename='Cardiovascular_Risk_Prediction_Classification.pkl'
# serialize process (wb=write byte)
pickle.dump(best_estimator,open(filename,'wb'))

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

# unserialize process (rb=read byte)
Classification_model= pickle.load(open(filename,'rb'))

# Predicting the unseen data(test set)
Classification_model.predict(x_test)

In [None]:
# Checking if we are getting the same predicted values
y_test_xgb_pred_gs

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

**Conclusion From EDA**

* Age emerges as a noteworthy factor influencing the risk of coronary heart disease (CHD).

* The dataset suggests a higher predisposition for CHD among men compared to women.

* Smoking emerges as a risk factor for CHD, with the intensity of smoking playing a role in determining this risk.

* Patients with elevated blood pressure, a history of stroke, or diabetes exhibit a heightened risk for CHD.

* Individuals with a history of stroke or hypertension are particularly susceptible to CHD.

* Patients diagnosed with diabetes also demonstrate an elevated risk for CHD.

* Total cholesterol levels show a modest elevation in patients at risk for CHD.

* Noteworthy positive correlations are observed between specific variables, such as age and systolic blood pressure, as well as BMI and glucose levels.Write the conclusion here.

**Conclusion From Model Implementation**

1. Out of the six models that underwent testing, the Random Forest Classifier and XGBoost models demonstrated superior performance, boasting notably high accuracy, precision, and recall scores."

2. "While the KNN model exhibited a relatively commendable recall score, its accuracy and precision scores fell below those achieved by the Random Forest Classifier and XGBoost models."

3. "The SVC model, with its lower accuracy and ROC AUC score, appears less well-suited for this specific classification task."

4. "Comparatively, the XGBoost model outperformed the Random Forest Classifier, yielding slightly higher test accuracy and precision scores along with a superior ROC AUC score, implying its potential as a more effective choice for cardiovascular risk prediction."

5. "Based on the outcomes presented, the **XGBoost model** was selected as the optimal classification model for the cardiovascular risk prediction dataset, delivering an **impressive accuracy rate of 89.67%**"

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***