<a href="https://colab.research.google.com/github/AkshayAI007/Cardiovascular-disease-risk-prediction-using-Machine-learning/blob/main/Cardiovascular_Risk_Prediction_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Cardiovascular Risk Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Team Member 1 -** Akshay Bawaliwale


# **Project Summary -**

Cardiovascular disease is a leading cause of death worldwide, and early prediction of cardiovascular risk can help in timely intervention and prevention of the disease. Machine learning techniques have shown promising results in predicting cardiovascular risk by analyzing various risk factors.

The goal of this project is to develop a machine learning model to predict the 10-year risk of cardiovascular disease in individuals using a dataset of demographic, clinical, and laboratory data.

The dataset used in this project is the Framingham Heart Study dataset, which is a widely used dataset for cardiovascular risk prediction. It contains data on 3,390 participants, who were followed up for ten years to track cardiovascular events. The dataset includes 17 variables such as age, sex, blood pressure, cholesterol levels, smoking status, and diabetes status.

The first step in this project is to perform data preprocessing, which includes handling missing values, encoding categorical variables, and scaling numerical variables. After preprocessing, the dataset is split into training and testing sets using a 70:20 ratio.

Various machine learning algorithms are applied to the training data, including logistic regression, KNN, XGBoost, SVC, and random forest, Naive Bayes Classifier. These algorithms are chosen as they have been shown to perform well in cardiovascular risk prediction. The algorithms are trained on the training data, and their performance is evaluated using the testing data.

The evaluation metrics used in this project include accuracy, precision, recall, and area under the receiver operating characteristic curve (AUC-ROC). These metrics help in assessing the performance of the machine learning model.

The results show that the XGBoost performs the best, with an accuracy of 0.89, precision of 0.92, recall of 0.85, and AUC-ROC of 0.89. This indicates that the model has a good overall performance in predicting cardiovascular risk.

Further analysis is performed to identify the most important features in the dataset. The feature importance plot shows that age, education, prevalentHyp,and cigarettes per day are the top important features in predicting cardiovascular risk. This information can help in identifying high-risk individuals and implementing preventive measures.

In conclusion, this project demonstrates the effectiveness of machine learning techniques in predicting cardiovascular risk using the Framingham Heart Study dataset. The developed machine learning model can be used by healthcare professionals to identify individuals at high risk of cardiovascular disease and take preventive measures to reduce the risk.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Cardiovascular disease is a major cause of morbidity and mortality worldwide. Early identification and management of individuals at high risk of developing cardiovascular disease is crucial for the prevention of the disease. Traditional risk prediction models, such as the Framingham Risk Score, have limitations in their accuracy and do not account for the complex interactions between various risk factors. Machine learning techniques have shown promising results in improving the accuracy of cardiovascular risk prediction by integrating various risk factors and identifying non-linear interactions. However, there is a need for developing and validating machine learning models that can accurately predict cardiovascular risk using demographic, clinical, and laboratory data. The goal of this project is to address this need by developing and evaluating a machine learning model for predicting the 10-year risk of cardiovascular disease using the Framingham Heart Study dataset.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
#compatible versions of modules
!sudo apt-get install python3.9
!pip install scikit-learn==1.1.2

In [None]:
# Import Libraries
# Import Libraries

## Data Maipulation Libraries
import numpy as np
import pandas as pd

## Data Visualisation Libraray
import matplotlib.pyplot as plt
%matplotlib inline
import pylab
import seaborn as sns

## Machine Learning
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.svm import SVC

## Importing essential libraries to check the accuracy
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_precision_recall_curve, plot_roc_curve

## Warnings
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
# import drive
from google.colab import drive
drive.mount('/content/drive')

# Load Dataset
path='/content/drive/MyDrive/Projects/Cardiovascular_disease_risk_prediction/data_cardiovascular_risk.csv'
df = pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
#Last 5 entries
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Dataset Size")
print("Rows = {} and  Columns = {}".format(df.shape[0], df.shape[1]))

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar = False)

### What did you know about your dataset?

Datasets for predicting cardiovascular risk typically encompass a variety of risk factors that can influence an individual's likelihood of developing cardiovascular disease. These factors encompass aspects such as age, gender, blood pressure, cholesterol levels, smoking habits, and a history of cardiovascular disease. Additionally, these datasets may encompass variables like body mass index and diabetes. It's worth noting that these datasets often exhibit some missing values, with glucose and education variables being particularly notable in this regard.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**Demographic:**

1) Age: Age of the patient.

2) Sex: male or female("M" or "F")

**Behavioral:**

3) is_smoking: whether or not the patient is a current smoker ("YES" or "NO").

4) CigsPerDay: the number of cigarettes that the person smoked on average in one day.(countinous type feature because a person can smoke 'n' times a day)

**Medical(history):**

5) BPMeds: whether or not the patient was on blood pressure medication.

6) Prevalent Stroke: whether or not the patient had previously had a stroke.

7) Prevalent Hyp: whether or not the patient was hypertensive.

8) Diabetes: whether or not the patient had diabetes.

**Medical(current):**

9) Tot Chol: total cholesterol level.

10) Sys BP: systolic blood pressure.

11) Dia BP: diastolic blood pressure.

12) BMI: Body Mass Index.

13) Heart Rate: heart rate.

14) Glucose: glucose level.

**Target feature(class of risk):**

15) TenYearCHD: 10-year risk of coronary heart disease CHD (“1”, means “Yes”, “0” means “No”)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ", i , "is" , df[i].nunique(), ".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Separating the categorical and continous variable and storing them
categorical_variable=[]
continous_variable=[]

for i in df.columns:
  if i == 'id':
    pass
  elif df[i].nunique() <5:
    categorical_variable.append(i)
  elif df[i].nunique() >= 5:
    continous_variable.append(i)

print(categorical_variable)
print(continous_variable)

In [None]:
# Summing null values
print('Missing Data Count')
df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False)

In [None]:
print('Missing Data Percentage')
print(round(df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False)/len(df)*100,2))

In [None]:
# storing the column that contains null values
null_column_list= ['glucose','education','BPMeds','totChol','cigsPerDay','BMI','heartRate']
# plotting box plot
plt.figure(figsize=(15,8))
df[null_column_list].boxplot()

In [None]:
# Define a list of colors
colors = sns.color_palette("husl", len(null_column_list))

# Create a figure with 8 subplots (2 rows, 4 columns)
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))

# Flatten the axes array to make it easier to iterate over
axes = axes.flatten()

# Iterate over the null column list and plot each column's distribution
for i, column in enumerate(null_column_list):
    # Select the current axis
    ax = axes[i]
    # Plot a distplot of the current column with a different color
    sns.distplot(df[column], ax=ax, color=colors[i])
    # Add a title to the plot
    ax.set_title(column)

# Remove any unused subplots
for j in range(len(null_column_list), len(axes)):
    axes[j].remove()

# Display the plots
plt.show()

It is a well-known fact that the appropriate measure of central tendency depends on the nature of the data. Typically, the mean is used for data that follows a normal distribution and does not contain any outliers. On the other hand, when dealing with numerical, continuous data that contains extreme values or outliers, the median is the preferred measure of central tendency. For categorical data, the mode is used.

Based on the outliers and distribution of the data, we have determined that the following measures of central tendency are appropriate for imputing the null values in the following columns:

**"education" , "BPMeds"** -> mode: As "education" and "BPMeds" is a categorical variable, the mode is the most appropriate measure of central tendency. The mode represents the most frequently occurring value in the distribution and can provide insight into the most common level of education in the dataset.

**"glucose","totChol", "cigsPerDay", "BMI", "heartRate"** -> median: Since this are numerical, continuous variable that contain extreme values or outliers, we have chosen the median as the appropriate measure of central tendency. The median is less sensitive to extreme values than the mean and provides a representative value for the central tendency of the distribution.

In [None]:
# Imputing missing values with median or mode
df.fillna({'glucose': df['glucose'].median(),
           'education': df['education'].mode()[0],
           'BPMeds': df['BPMeds'].mode()[0],
           'totChol': df['totChol'].median(),
           'cigsPerDay': df['cigsPerDay'].median(),
           'BMI': df['BMI'].median(),
           'heartRate': df['heartRate'].median()}, inplace=True)

### What all manipulations have you done and insights you found?

We addressed the issue of missing data by employing a combined approach of imputation using median and mode values. Specifically, for the glucose and totChol columns, as well as cigsPerDay, BMI, and heartRate, we substituted the missing values with the median of the available non-missing values. Conversely, for the education and BPMeds columns, we filled in missing values with the mode, which represents the most frequently occurring value among the non-missing data points.

These methods of imputation, utilizing median and mode values, are widely recognized and commonly employed to handle missing data. Median imputation is a preferred choice for continuous variables due to its robustness against outliers when compared to mean imputation. On the other hand, mode imputation is commonly used for categorical variables or discrete variables with a limited number of possible values."

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **Chart - 1**
 **Which age group is more susceptible to developing coronary heart disease?**

In [None]:
# Chart - 1 visualization code

# Set the figure size
fig, ax = plt.subplots(figsize=(10, 10))
# Create a boxplot to compare the age distribution of patients by sex and CHD risk level
sns.boxplot(x="sex", y="age", hue="TenYearCHD", data= df, ax=ax)
# Set the title and labels
ax.set_title("Age Distribution of Patients by Sex and CHD Risk Level")
ax.set_xlabel("Sex")
ax.set_ylabel("Age")
# Adding a legend with appropriate labels
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, ["No Risk", "At Risk"], loc="best")
# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

This chart is a boxplot that visualizes the age distribution of patients by sex and CHD (coronary heart disease) risk level. It was likely chosen to gain insights into how age, sex, and CHD risk level may be related in this dataset.

##### 2. What is/are the insight(s) found from the chart?

There is a noticeable difference in the age distribution of patients who are at risk for CHD compared to those who are not at risk. Patients at risk for CHD tend to be older than those who are not at risk, regardless of sex.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The information derived from this chart could prove valuable for businesses operating within the healthcare industry. For instance, companies specializing in the manufacture of medical equipment or medications for coronary heart disease (CHD) might contemplate tailoring their products towards older patients or individuals with a heightened risk of CHD. Nonetheless, it is crucial to emphasize that this chart in isolation may lack the depth required for making informed business decisions. A more comprehensive analysis is necessary to gain a thorough understanding of the interplay between age, gender, CHD risk, and other pertinent variables.
It's important to highlight that there are no indications of adverse growth trends evident in this chart.

#### **Chart - 2**
**Does gender affect the risk of coronary heart disease in the dataset?**####

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(8,8))
sns.countplot(x='sex', hue='TenYearCHD', data= df)
plt.title('Frequency of CHD cases by gender')
plt.legend(['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

This chart is a countplot that visualizes the frequency of CHD (coronary heart disease) cases by gender in the dataset. It was likely chosen to investigate whether gender affects the risk of CHD in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that there are more cases of CHD among men than women in the dataset. However, this difference is not drastic, as the number of cases of CHD is relatively similar between men and women. Additionally, the chart shows that there are more cases of no risk for CHD among women compared to men.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The findings derived from this chart hold potential value for healthcare service and product providers. For instance, businesses involved in the manufacturing of medical devices or medications for coronary heart disease (CHD) might find it advantageous to target both genders. However, a more substantial emphasis on men may be warranted, given their seemingly higher risk for CHD as indicated by this dataset.

#### **Chart - 3**

**Do smokers have a higher risk of developing coronary heart disease?**

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(8,8))
sns.countplot(x='is_smoking', hue='TenYearCHD', data= df)
plt.title('A Comparison of Smokers and Non-Smokers')
plt.legend(['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

This chart is a countplot that visualizes the frequency of CHD (coronary heart disease) cases among smokers and non-smokers. It was likely chosen to gain insights into how smoking may be related to the risk of CHD in this dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates that individuals who engage in smoking appear to exhibit a heightened risk of coronary heart disease (CHD) compared to their non-smoking counterparts within this dataset. Precisely, a greater percentage of smoking individuals are identified as being at risk for CHD when contrasted with those who abstain from smoking. These observations indicate that smoking may play a role in influencing the CHD risk within this dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart does not reveal any indications of adverse growth trends. Its focus is solely on depicting the incidence of coronary heart disease (CHD) cases among both smokers and non-smokers, omitting insights into other potentially pertinent factors like age or various lifestyle variables. Furthermore, it's worth noting that the dataset's representativeness may be limited, which could curtail the applicability of the insights gleaned from this chart to the broader population.

#### **Chart - 4**
**How much smoking affect coronary heart disease?**

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(15,10))
sns.countplot(x= df['cigsPerDay'],hue= df['TenYearCHD'])
plt.title('How much smoking affect CHD?')
plt.legend(['No Risk','At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

This countplot visually represents the correlation between the daily cigarette consumption and the risk of coronary heart disease (CHD) within this dataset. The selection of this chart type is likely aimed at obtaining a better understanding of the potential link between the intensity of smoking and the likelihood of CHD risk

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates that individuals who either smoke a greater number of cigarettes per day or do not smoke at all seem to face a heightened risk of coronary heart disease (CHD) in comparison to those who smoke fewer cigarettes daily. Concretely, a larger percentage of individuals who smoke 20 or more cigarettes per day are identified as being at risk for CHD when contrasted with those who smoke fewer cigarettes per day. These observations imply that the intensity of smoking may have a role in influencing the risk of CHD within this dataset

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Businesses involved in the production of smoking cessation products or medications for coronary heart disease (CHD) may find it advantageous to contemplate a focus on heavy smokers, given their apparent elevated risk for CHD within this dataset.

#### **Chart - 5**
**Do patients taking medication for blood pressure have a higher risk of developing coronary heart disease?**


In [None]:
# Chart - 5 visualization code

# Compute the cross-tabulation of BP medication and CHD risk
ct = pd.crosstab(df['BPMeds'], df['TenYearCHD'], normalize='index')
# Plot a stacked bar chart
ct.plot(kind='bar', stacked=True, figsize=(8, 8))
plt.title('Relationship between BP Medication and CHD Risk')
plt.xlabel('BP Medication')
plt.xticks(rotation=0)
plt.ylabel('Proportion')
plt.legend(['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

This stacked bar chart visually represents the correlation between patients' use of blood pressure medication and their susceptibility to coronary heart disease (CHD). It is probable that this chart was selected to explore the potential association between the usage of blood pressure medication and CHD risk within this dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that individuals who are prescribed blood pressure medication appear to exhibit an elevated risk of coronary heart disease (CHD) when compared to those who do not receive such medication. More precisely, there is a noticeable disparity in the proportion of individuals at risk for CHD between those who are on blood pressure medication and those who are not. These observations imply that the usage of blood pressure medication may play a substantial role in influencing the CHD risk within this dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Firms specializing in the development of blood pressure (BP) medication or other remedies for hypertension may find it beneficial to focus on individuals with elevated blood pressure levels who also face a risk of coronary heart disease (CHD), irrespective of their current use of BP medication. This strategy could aid in the identification of patients who could benefit from more intensive treatment to mitigate their CHD risk.

#### Chart - 6
**Is a person who has had a stroke more susceptible to coronary heart disease?**

In [None]:
# Chart - 6 visualization code

plt.figure(figsize=(10,10))
sns.countplot(x=df['prevalentStroke'], hue=df['TenYearCHD'])
plt.title('Are people who had a stroke earlier more prone to CHD?')
plt.legend(['No Risk', 'At Risk'], loc='best')
plt.show()

##### 1. Why did you pick the specific chart?

This chart is a countplot illustrating a comparison of CHD risk levels among patients with a prior stroke history and those without such a history. The selection of this chart is likely motivated by an interest in exploring a potential link between experiencing a stroke and an increased susceptibility to CHD.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates an association between a prior history of stroke and an elevated risk of coronary heart disease (CHD) within this dataset. More precisely, the percentage of patients at risk for CHD is notably higher among those with a history of stroke compared to those without. These observations indicate a potential link between experiencing a stroke and an increased susceptibility to CHD within the dataset

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The information gleaned from this chart holds potential significance for businesses operating within the realm of healthcare services or products associated with stroke or coronary heart disease (CHD). For instance, manufacturers of medications or treatments for stroke or CHD could contemplate directing their efforts towards patients who have experienced a stroke, recognizing them as a high-risk demographic for CHD.

Furthermore, healthcare providers may consider implementing screening protocols for CHD risk among individuals who have a history of stroke. This could enable the delivery of targeted preventative measures or treatments to address potential risks and promote better patient outcomes.

#### Chart - 7
**Does having hypertension increase the risk of developing coronary heart disease?**

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(8,8))
sns.countplot(x=df['prevalentHyp'], hue=df['TenYearCHD'])
plt.title('Are hypertensive patients at more risk of CHD?')
plt.legend(title='CHD Risk', labels=['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

We selected this chart to visually represent the correlation between the presence of hypertension and the likelihood of developing coronary heart disease within the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart illustrates a correlation between prevalent hypertension and an increased likelihood of developing coronary heart disease (CHD) when compared to individuals without hypertension. More precisely, it indicates that the proportion of patients at risk for CHD is comparable among those with prevalent hypertension and those without this condition.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart demonstrates a connection between the presence of prevalent hypertension and an elevated probability of developing coronary heart disease (CHD) when contrasted with individuals lacking hypertension. Specifically, it highlights that the percentage of patients at risk for CHD is similar between those with prevalent hypertension and those without this condition

#### **Chart - 8**
**Do individuals with diabetes have a higher risk of developing coronary heart disease?**

In [None]:
# Chart - 8 visualization code

plt.figure(figsize=(8,8))
sns.barplot(x=df['diabetes'], y=df['TenYearCHD'], hue=df['TenYearCHD'], estimator=lambda x: len(x) / len(df) * 100)
plt.title('Proportion of patients with and without diabetes at CHD risk')
plt.xlabel('Diabetes')
plt.ylabel('Percentage')
plt.legend(title='CHD Risk', labels=['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

We selected this chart to represent the distribution of individuals in the dataset, categorizing them based on the presence or absence of diabetes, and examining their respective risks of developing coronary heart disease.

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that individuals with diabetes have a higher likelihood of being susceptible to coronary heart disease in contrast to those who do not have diabetes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Indeed, the insights obtained can assist healthcare enterprises and practitioners in identifying patients with diabetes at an elevated risk level, necessitating additional evaluation, ongoing monitoring, and comprehensive management to mitigate the onset or advancement of coronary heart disease.


#### **Chart - 9**
**Is there a correlation between total cholesterol levels and coronary heart disease?**

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(8,8))
sns.boxplot(x='TenYearCHD', y='totChol', data=df)
plt.title('Total Cholesterol Levels and CHD')
plt.xlabel('TenYearCHD')
plt.ylabel('Total Cholesterol Levels')
plt.legend(['No Risk', 'At Risk'])
plt.show()

##### 1. Why did you pick the specific chart?

We selected the particular box plot as a means to address the query concerning the potential correlation between total cholesterol levels and the susceptibility to coronary heart disease development

##### 2. What is/are the insight(s) found from the chart?

The box plot reveals that individuals at risk of developing coronary heart disease tend to exhibit slightly elevated average total cholesterol levels compared to those not at risk. However, it's important to note that there is a degree of overlap in the cholesterol level ranges between these two groups

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights obtained can assist healthcare providers in assessing the influence of total cholesterol levels on the susceptibility to coronary heart disease (CHD) among their patients. The identification of individuals with elevated cholesterol levels enables the implementation of targeted interventions to mitigate their risk of CHD development. Such proactive measures can yield favorable effects on patient health outcomes and potentially result in cost reductions for healthcare providers over time

#### **Chart - 10**
**What is the pairwise relationship between glucose levels, systolic blood pressure, diastolic blood pressure, and the risk of developing coronary heart disease?**

In [None]:
# Chart - 10 visualization code
# select the columns of interest
cols = ['glucose', 'sysBP', 'diaBP', 'TenYearCHD']

# create the scatter plot matrix
sns.pairplot(df[cols], hue='TenYearCHD', markers=['o', 's'])

High glucose levels appear to be associated with an increased risk of developing coronary heart disease, as indicated by a higher concentration of orange (high-risk) points in the upper right quadrant of the glucose vs. TenYearCHD plot.

High blood pressure levels (both systolic and diastolic) also appear to be associated with an increased risk of developing coronary heart disease, as indicated by a higher concentration of orange points in the upper right quadrants of the sysBP vs. TenYearCHD and diaBP vs. TenYearCHD plots.

There may be some interaction effects between glucose and blood pressure on the risk of developing coronary heart disease, as indicated by the patterns of orange points in the glucose vs. sysBP and glucose vs. diaBP plots. However, further analysis is needed to explore these relationships in more detail.

##### 1. Why did you pick the specific chart?

This chart was chosen to visualize the pairwise relationships between four variables: glucose levels, systolic blood pressure, diastolic blood pressure, and the risk of developing coronary heart disease. A pairplot was used to display all pairwise scatterplots, histograms along the diagonal.

##### 2. What is/are the insight(s) found from the chart?

The pairplot provides a visual representation of the interrelationships between glucose levels, systolic blood pressure, diastolic blood pressure, and the likelihood of coronary heart disease development. Along the diagonal, histograms display the distribution of each variable, while scatter plots illustrate the associations between pairs of variables. Notably, it's evident from the plot that individuals with elevated glucose levels tend to exhibit an increased risk of developing coronary heart disease. Likewise, individuals with elevated systolic and diastolic blood pressure levels appear to have an elevated risk of developing coronary heart disease

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights drawn from this chart do not point toward negative growth; rather, they have the potential to foster positive growth by aiding healthcare enterprises in the development of more efficacious prevention and treatment approaches. These strategies have the capacity to enhance patient outcomes and, in turn, may lead to potential reductions in healthcare expenditures

#### **Chart - 11**
**Does cigarette smoking have a differential impact on the risk of developing coronary heart disease between males and females?**

In [None]:
# Chart - 11 visualization code
# select the columns of interest
cols = ['sex', 'cigsPerDay', 'TenYearCHD']

# create a grouped scatter plot of TenYearCHD by cigsPerDay and sex
sns.scatterplot(x='cigsPerDay', y='TenYearCHD', hue='sex', data=df)

# show the plot
plt.show()

##### 1. Why did you pick the specific chart?

We selected this chart because it effectively visualizes the interplay between daily cigarette consumption, the probability of developing coronary heart disease, and how gender influences this dynamic. A scatter plot was employed to depict the data's distribution and to discern any underlying patterns or trends.

##### 2. What is/are the insight(s) found from the chart?

In the chart, it is evident that individuals of both genders who smoke experience an elevated risk of developing coronary heart disease as the daily cigarette consumption rises. Nevertheless, the association between cigarette smoking and CHD risk exhibits greater prominence among males than females. Specifically, among males, those who consume more than 10 cigarettes per day demonstrate a notably heightened risk of CHD compared to their female counterparts."

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The information gleaned from this chart offers valuable insights that can inform the efforts of public health organizations and businesses in designing tailored interventions aimed at reducing smoking prevalence and mitigating the onset of coronary heart disease (CHD). Particularly, there is an opportunity to focus on male smokers who exhibit an elevated risk. For instance, public health initiatives can be strategized to heighten awareness about the health hazards associated with smoking and offer assistance and resources to individuals seeking to quit. Similarly, businesses can consider implementing smoking cessation programs for their employees, promoting improved health and overall well-being.

**Chart - 12**
**Are there differences in the age and sex distributions between individuals with and without prevalent stroke?**

In [None]:
# Chart - 12 visualization code
sns.violinplot(x='prevalentStroke',y="age",data=df, hue='sex', split='True', palette='rainbow')

It is made clear that most of the prevalent strokes were shown by patients abbove age 45 and most of those patients are females.

##### 1. Why did you pick the specific chart?

I opted for a violin plot as it offers an efficient means to visualize the age distribution within two distinct groups (individuals with and without a prior stroke). Additionally, it facilitates the comparison of gender distributions within each of these groups

##### 2. What is/are the insight(s) found from the chart?

The chart indicates that individuals who have experienced a prevalent stroke tend to have a higher average age compared to those without a history of stroke. Furthermore, the chart reveals a greater presence of males in both groups, with a notably higher proportion of males observed among individuals with a prevalent stroke

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights obtained have the potential to assist healthcare institutions and insurance providers in formulating policy decisions concerning stroke prevention and treatment. For instance, these findings could contribute to informed choices regarding the allocation of specific preventive measures, like the use of blood thinners, or the structuring of stroke rehabilitation programs. Moreover, insurance firms may employ this data to shape their policies pertaining to stroke coverage and premium rates.

#### **Chart - 13**
**Is there any relation between individual with hypertensive and cigsperday?**

In [None]:
# Chart - 13 visualization code
# create a scatter plot of sysBP against cigsPerDay, colored by hypertension status
sns.scatterplot(x='cigsPerDay', y='sysBP', hue='prevalentHyp', data=df)

# add a title and axis labels
plt.title('Relationship between Systolic Blood Pressure and Cigarettes Smoked per Day, by Hypertension Status')
plt.xlabel('Cigarettes Smoked per Day')
plt.ylabel('Systolic Blood Pressure')

# display the plot
plt.show()

##### 1. Why did you pick the specific chart?

I opted for a scatterplot as it is a suitable choice for visualizing the connection between two continuous variables, aligning with our specific interest in understanding the relationship between cigsPerDay and sysBP. Furthermore, the incorporation of color to signify hypertensive status provides a straightforward means of discerning potential data patterns or trends associated with hypertension status.

##### 2. What is/are the insight(s) found from the chart?

The scatterplot reveals a noticeable positive correlation between cigsPerDay and sysBP, irrespective of hypertension status. This implies that individuals who smoke a greater number of cigarettes per day tend to exhibit elevated systolic blood pressure levels. Furthermore, it is evident that individuals with prevalent hypertension generally demonstrate higher systolic blood pressure levels in comparison to those who do not have hypertension."

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gleaned from this chart hold promise for application in healthcare and wellness contexts, where the need to monitor blood pressure levels and mitigate cardiovascular risk factors, such as smoking, is paramount. By discerning the positive association between smoking and systolic blood pressure, healthcare providers have an opportunity to promote smoking cessation as a means to reduce blood pressure levels and mitigate the risk of hypertension and associated cardiovascular ailments.

Notably, this chart does not reveal any indications of adverse trends. Nevertheless, should smoking cessation initiatives be implemented effectively and yield positive results, there could potentially be adverse repercussions for tobacco companies and the broader tobacco industry.

#### Chart - 14 - Correlation Heatmap

In [None]:
df.corr()

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,15))
correlation = df.corr()
sns.heatmap((correlation), annot=True, cmap=sns.color_palette("mako", as_cmap=True))

##### 1. Why did you pick the specific chart?

I chose the Correlation Heatmap because it provides an efficient means of visually representing the relationships between various pairs of features within a dataset. This visualization employs a color scheme to depict the strength of the correlation coefficient, facilitating the rapid identification of strongly correlated features

##### 2. What is/are the insight(s) found from the chart?

The Correlation Heatmap shows the pairwise correlation between all numerical features in the dataset.

1) From corelation chart we can see that age is highly correlated with TenYearCHD by 22%. This suggests that these features may be important predictors of CHD risk.

2) From the heatmap, we can see that age, systolic blood pressure, and diastolic blood pressure have a relatively strong correlation with the TenYearCHD target variable.  

3) Additionally, we can see that there is a moderate positive correlation between systolic and diastolic blood pressure, by 78%.

4) As well as diabetes and glucose are correlated by 61%.

5) Also prevalent hypertension highly correlated with systolic blood pressure, and diastolic blood pressure by 70% and 61% respectively.

6) And age is negatively correlated with education and cigarettes per day with 17% and 19% respectively.

####**Chart - 15 - Pair Plot**

In [None]:
# Pair Plot visualization code
sns.pairplot(df[continous_variable])

##### 1. Why did you pick the specific chart?

The pair plot serves as a valuable visualization tool for comprehending the interrelationships among the continuous variables present within the dataset. It facilitates the detection of both linear and non-linear correlations among these variables and aids in the recognition of potential outliers or unconventional patterns within the dataset

##### 2. What is/are the insight(s) found from the chart?

Observing the pair plot, it becomes evident that several variables exhibit positive correlations. Notably, age displays a positive association with systolic blood pressure, while BMI shows a similar correlation with glucose levels. Additionally, systolic blood pressure and diastolic blood pressure exhibit a linear relationship. A subtle positive correlation between cigsPerDay and sysBP is also discernible. Nevertheless, no distinct linear connection emerges between any of the variables and the target variable, TenYearCHD.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***