# **Project Name**    - Cardiovascular Risk Prediction




##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

https://github.com/Akash1141/Cardiovascular-Risk-Prediction-Classification-Project

# **Problem Statement**


**The goal of this project is to develop a predictive model that can accurately classify whether a patient from the Framingham Massachusetts area has a 10-year risk of developing coronary heart disease (CHD). The dataset consists of over 4,000 records, each containing 15 attributes representing potential risk factors including demographic, behavioral, and medical information. By leveraging this dataset, our objective is to create a model that can effectively analyze the provided patient information and accurately predict the likelihood of future CHD in order to aid in early identification and proactive management of cardiovascular health in the Framingham population.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np # NumPy is used for scientific computing and provides functions for efficient array operations, linear algebra, and mathematical calculations.
import pandas as pd # Pandas is used for data manipulation and analysis. It provides data structures and functions to work with structured data, such as data frames.
from numpy import math # The math module from NumPy provides various mathematical functions that can be used for calculations.
from scipy.stats import * # The scipy.stats module provides a wide range of statistical functions and distributions for statistical analysis and hypothesis testing.
import math
from numpy import loadtxt # The loadtxt function from NumPy is used to load data from a text file into an array or variables.

from sklearn.preprocessing import MinMaxScaler # MinMaxScaler is used for scaling numerical features to a specific range, typically between 0 and 1, to ensure that all features have a similar scale.
from sklearn.model_selection import train_test_split # train_test_split is used to split the dataset into training and testing sets for model evaluation and validation.
from sklearn.linear_model import LinearRegression # LinearRegression is used to perform linear regression analysis and build linear regression models.
from sklearn.metrics import r2_score # r2_score is used to calculate the coefficient of determination (R-squared) to evaluate the performance of regression models.
from sklearn.metrics import mean_squared_error # mean_squared_error is used to calculate the mean squared error (MSE) to measure the performance of regression models.


import matplotlib.pyplot as plt # Matplotlib is a plotting library used to create visualizations and graphs. The %matplotlib inline command is used in Jupyter Notebook to display plots inline.
%matplotlib inline

import seaborn as sns # Seaborn is a data visualization library based on Matplotlib. It provides a high-level interface for creating informative and visually appealing statistical graphics.

from sklearn.linear_model import Ridge, RidgeCV # Ridge and Lasso are regularization techniques used in linear regression to reduce overfitting.
from sklearn.linear_model import Lasso, LassoCV # RidgeCV and LassoCV are versions of Ridge and Lasso with built-in cross-validation for hyperparameter tuning.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler # StandardScaler is used to standardize numerical features by removing the mean and scaling to unit variance.
from imblearn.over_sampling import SMOTE # SMOTE (Synthetic Minority Over-sampling Technique) is used for oversampling the minority class in imbalanced datasets to address class imbalance issues.
from sklearn.linear_model import LogisticRegression # LogisticRegression is used for logistic regression analysis and building logistic regression models for classification tasks.
from sklearn.ensemble import RandomForestClassifier # RandomForestClassifier is an ensemble learning method that combines multiple decision trees to build a classification model.
from sklearn.metrics import accuracy_score, confusion_matrix # accuracy_score is used to calculate the accuracy of classification models. confusion_matrix is used to compute the confusion matrix to evaluate classification model performance.
from sklearn import metrics # metrics provides various metrics for model evaluation.
from sklearn.metrics import roc_curve # roc_curve is used to plot the receiver operating characteristic (ROC) curve for binary classification models.
from sklearn.model_selection import GridSearchCV # GridSearchCV is used for hyperparameter tuning by exhaustively searching the specified parameter values.
from sklearn.model_selection import RepeatedStratifiedKFold # RepeatedStratifiedKFold is a cross-validation strategy that ensures stratification and repeated sampling of data during model evaluation.
from xgboost import XGBClassifier # XGBClassifier is an implementation of the XGBoost algorithm for classification tasks.
from xgboost import XGBRFClassifier # XGBRFClassifier is an implementation of the XGBoost algorithm for random forest-based classification tasks.
from sklearn.tree import export_graphviz # export_graphviz is used to export decision tree models in Graphviz format for visualization.

import warnings
warnings.filterwarnings('ignore') # The warnings module is used to manage warning messages. The filterwarnings function is used to ignore warnings during code execution.


### Dataset Loading

In [None]:
# Load Dataset
ml = "https://raw.githubusercontent.com/Akash1141/Cardiovascular-Risk-Prediction-Classification-Project/main/data_cardiovascular_risk.csv"

In [None]:
# Loading  Dataset

df = pd.read_csv(ml, encoding = "ISO-8859-1") # Some times while saving the CSV File the data shall be encoded, to over come this issue in future we use this label
# encoding = "ISO-8859-1" when reading the file with pandas will ensure that the text is decoded properly and can be read correctly by the program.

### Dataset First View

In [None]:
# Dataset First Look
df.head(30)
df.tail(30)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# check for duplicates
if df.duplicated().any():
    print("There are duplicates in the dataset.")
else:
    print("There are no duplicates in the dataset.")

# Check for duplicates count
duplicates = df.duplicated()
print('\nDuplicates:\n', duplicates.sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(df.isnull().sum())

In [None]:
# Visualizing the missing values

# Create a heatmap of missing/null values in the DataFrame
sns.heatmap(df.isnull(), cmap='coolwarm')

# Show the plot
plt.show()

In [None]:
# create a bar chart of the null values in the dataframe
df.isnull().sum().plot(kind='bar')

# Show the plot
plt.show()

### What did you know about your dataset?

The given Dataset is of Cardiovascular Risk Prediction.

**Generally!!!**

The dataset is from a on going cardiovascular study on recidents of the town of Framingham Massachusetts. The classification goal is to predict that weather the patient has the 10-year risk of future coronary heart desiease (CHD). The dataset provides the patient's information. It includes over 4,000 records and 15 attributes. Each attribute is a potential risk factor. There are both dempgraphic, behavioral and medial risk factor.


**Technically!!!**

From the above data operations made we get to know that:

The dataset has **NO duplicates.**

The column education has 87 missing values.

The column cigsPerDay has 22 missing values.

The column BPMeds has 44 missing values.

The column totChol has 38 missing values.

The column BMI has 14 missing values.

The column heartRate 1 missing values.

The column glucose 304 missing values.


From this information, we can see that there are a significant number of missing values.
we can conclude that the dataset contains a combination of categorical and numerical variables, and there are missing values present in multiple columns that need to be handled appropriately during data preprocessing and analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

print(df.columns) # This prints all the coloumns present in the dataset

In [None]:
# Dataset Describe

df.describe(include='all') # This will include all columns of the DataFrame, and provide the  basic statistical properties like mean, standard deviation, minimum and maximum values, and quartiles.


### Variables Description

The **"id"** column represents the unique identifier for each record.

The **"age"** column provides the age of the patients.

The **"education"** column has 87 missing values.

The values 1, 2, 3, and 4 correspond to different education categories. While the specific meanings of these categories can vary depending on the context and data collection methodology, a common interpretation is as follows:

1: Some high school education or less

2: Completed high school education

3: Some college education or vocational training

4: Completed college education or higher

The **"sex"** column represents the gender of the patients.

The **"is_smoking"** column indicates whether the patients are smoking or not.

The **"cigsPerDay"** column has 22 missing values and represents the number of cigarettes smoked per day.

The **"BPMeds"** column has 44 missing values and indicates whether the patients are taking blood pressure medications.

The **"prevalentStroke"** column represents whether the patients have had a prevalent stroke.

The **"prevalentHyp"** column represents whether the patients have prevalent hypertension.

The **"diabetes"** column indicates whether the patients have diabetes.

The **"totChol"** column has 38 missing values and represents the total cholesterol levels of the patients.

The **"sysBP"** column represents the systolic blood pressure of the patients.

The **"diaBP"** column represents the diastolic blood pressure of the patients.

The **"BMI"** column has 14 missing values and represents the Body Mass Index of the patients.

The **"heartRate"** column has 1 missing value and represents the heart rate of the patients.

The **"glucose"** column has 304 missing values and represents the glucose levels of the patients.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# Now we shall print all the unique values present in the coloumn in 1 go using a for loop

for column in df.columns:
    unique_values = df[column].unique()
    num_unique_values = len(unique_values)
    print(f"The column '{column}' has {num_unique_values} unique values:")
    print(unique_values)
    print()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Creating a copy of the Dataset to keep the original data safe.

df1 = df.copy()

### 1. Checkinf for Null Values

In [None]:
print(df1.isnull().sum()) # Printing the null values of the copied dataset

#### **As this is a medical dataset and all the inputs are very much important and complicated to fill it by understanding the data. We can not risk these kind of medical dataset to be filled by Mannual, mean, median or by other methods which may result in bad filling of the data and effect the model accuracy in real time**

## **For the above mentioned reason we are dropping all the null values and also we have very less nuber of null values in total when compared to the 4,000 rows of the dataset**

In [None]:


# Drop the columns 'agent' and 'company'
df1 = df1.dropna()

# Print the first 5 rows of the resulting dataframe
df1.head()

In [None]:
print(df1.isnull().sum()) # Cheking if the null values are dropped and checking the remaning null values

## Now We have ZERO NULL values in the DATASET

#### 2. Outlier Detection and Treatment:

##### Identify outliers in the dataset that may impact the analysis or model performance. Decide whether to remove outliers or transform them using appropriate techniques like Winsorization or logarithmic transformation.

In [None]:
# Finding min and max of Sales coloumn before Outliner treatment
print("MAX Values Before Outliner Treatment")
print(df1.max())
print("____________________________________________________")
print("MIN Values Before Outliner Treatment")
print(df1.min())

### By viewing the min and max values that lies in the dataset we can conclude that there are no outliers present in the dataset.

### 3. We shall check how many Numeric and Categorical Data do we have




In [None]:
numeric_cols = df1.select_dtypes(include=['int64', 'float64']).columns # This gets all the Numeric coloumns
categorical_cols = df1.select_dtypes(include=['object', 'category']).columns # THis gets all the Categorical coloumns

print(f"Number of numeric columns: {len(numeric_cols)}")
print()
print(f"Numeric columns: {numeric_cols}")
print()
print()
print(f"Categorical columns: {categorical_cols}")
print()
print(f"Number of categorical columns: {len(categorical_cols)}")

### As there are 2 categorical columns, we shall convert them to numerical coloumn

In [None]:
df1.head()

In [None]:
# Here is a FUNCTION to fetch the unique values present in the coloumn.

def get_unique_values(df1, column_name):#   Returns an array of the unique values in the specified column of a pandas DataFrame, sorted in the order in which they appear in the DataFrame.

    unique_values = df1[column_name].unique()
    return unique_values

In [None]:
# Calling the function by sepcifing the coloumn name of which we have to fetch the unique values.

unique_sex = get_unique_values(df1, 'sex')
unique_is_smoking = get_unique_values(df1, 'is_smoking')
# Print the unique values
print(unique_sex)
print(unique_is_smoking)

In [None]:
# Initialize LabelEncoder
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

# Encode the 'StateHoliday' column
df1['sex'] = label_encoder.fit_transform(df1['sex'])
df1['is_smoking'] = label_encoder.fit_transform(df1['is_smoking'])
# Print the updated df1 DataFrame
df1.head()

In [None]:
# Checkinf for unique values of the encoded 2 coloumns

# Calling the function by sepcifing the coloumn name of which we have to fetch the unique values.

unique_sex = get_unique_values(df1, 'sex')
unique_is_smoking = get_unique_values(df1, 'is_smoking')
# Print the unique values
print(unique_sex)
print(unique_is_smoking)

### What all manipulations have you done and insights you found?

Removal of Null Values: Null values were removed from the dataset. This step ensures that the dataset contains complete and usable data for analysis.

Conversion of Categorical Columns: Two categorical columns were converted into numerical format. By converting categorical variables into numerical representation, it becomes easier to perform mathematical operations and apply machine learning algorithms.

Cross-Verification of Conversion: After the conversion of categorical columns, cross-verification was performed to ensure the accuracy and correctness of the converted numerical values. This step is crucial to ensure the integrity of the data and the reliability of subsequent analyses.

Outlier Detection: Outliers were examined in the dataset. It was found that no outliers were present. Identifying and handling outliers is essential as they can significantly impact statistical analyses and modeling results. The absence of outliers suggests that the dataset is relatively consistent and free from extreme values.

Data Scaling: It was determined that scaling the data is not required for this medical dataset. The range of the data is considered acceptable as is. Scaling is often performed to normalize the data and bring all features to a similar scale, but if the range of values in the dataset is already appropriate for analysis, scaling may not be necessary.

Insights Obtained:
Based on the described manipulations, the specific insights or findings obtained from the dataset are not mentioned. The provided information focuses more on the data processing steps and considerations rather than the specific outcomes or insights gained from the analysis. To extract meaningful insights from the dataset, further exploratory data analysis, statistical tests, and predictive modeling techniques can be applied.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Bar Chart

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt

# Count the number of occurrences of each education level
education_counts = df1['education'].value_counts()

# Create a bar chart
plt.bar(education_counts.index, education_counts.values)
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.title('Distribution of Education Levels')
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is suitable for visualizing the distribution of categorical variables, such as education levels.


##### 2. What is/are the insight(s) found from the chart?

 The bar chart shows the frequency of each education level, providing an understanding of the educational background of the population.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can help identify the education level composition of the population, which may be useful for targeting educational campaigns or assessing the impact of education on health outcomes.

#### Chart - 2 - Histogram

In [None]:
# Chart - 2 visualization code

plt.hist(df1['age'], bins=10)
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Distribution of Age')
plt.show()


##### 1. Why did you pick the specific chart?

A histogram is used to visualize the distribution of a continuous variable, such as age.


##### 2. What is/are the insight(s) found from the chart?

The histogram displays the frequency of different age ranges, revealing the age distribution of the population.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the age distribution can aid in developing age-specific health interventions or targeting age-specific marketing campaigns.

#### Chart - 3 - Pie Chart

In [None]:
# Chart - 3 visualization code

sex_counts = df1['sex'].value_counts()

# Create a pie chart
plt.pie(sex_counts.values, labels=sex_counts.index, autopct='%1.1f%%')
plt.title('Gender Distribution')
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart is suitable for displaying the proportion or percentage of different categories, such as gender distribution.


##### 2. What is/are the insight(s) found from the chart?

The pie chart presents the percentage of males and females in the dataset, providing insights into the gender composition of the population.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the gender distribution can help tailor healthcare services or marketing strategies specific to different genders.

#### Chart - 4 - Line Plot

In [None]:
# Chart - 4 visualization code

plt.plot(df1['age'], df1['sysBP'])
plt.xlabel('Age')
plt.ylabel('Systolic Blood Pressure')
plt.title('Systolic Blood Pressure by Age')
plt.show()




##### 1. Why did you pick the specific chart?

A line plot is effective for visualizing the relationship between two continuous variables over a continuous range, such as the relationship between age and systolic blood pressure.


##### 2. What is/are the insight(s) found from the chart?

The line plot shows how systolic blood pressure changes with age, indicating any potential trends or patterns.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying relationships between age and blood pressure can help in determining appropriate age-specific interventions or monitoring strategies for blood pressure management.

#### Chart - 5 - Scatter Plot

In [None]:
# Chart - 5 visualization code

plt.scatter(df1['BMI'], df1['glucose'])
plt.xlabel('BMI')
plt.ylabel('Glucose')
plt.title('BMI vs. Glucose')
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is useful for visualizing the relationship between two continuous variables, such as BMI and glucose levels, to identify potential correlations or patterns.


##### 2. What is/are the insight(s) found from the chart?

The scatter plot can reveal any associations or trends between BMI and glucose levels, helping understand their potential relationship.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Understanding the relationship between BMI and glucose levels can assist in managing diabetes or developing targeted interventions for individuals with specific BMI ranges.

#### Chart - 6 - Box Plot

In [None]:
# Chart - 6 visualization code

import seaborn as sns

sns.boxplot(x=df1['TenYearCHD'], y=df1['sysBP'])
plt.xlabel('Ten-Year CHD')
plt.ylabel('Systolic Blood Pressure')
plt.title('Systolic Blood Pressure by Ten-Year CHD')
plt.show()


##### 1. Why did you pick the specific chart?

A box plot is effective for visualizing the distribution of a continuous variable across different categories, such as comparing systolic blood pressure for individuals with and without a ten-year risk of coronary heart disease (CHD).


##### 2. What is/are the insight(s) found from the chart?

 The box plot provides insights into the differences in systolic blood pressure between individuals with and without a ten-year CHD risk.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Identifying associations between systolic blood pressure and CHD risk can aid in risk assessment and developing appropriate interventions or treatment plans.

#### Chart - 7 - Stacked Bar Chart

In [None]:
# Chart - 7 visualization code

hypertension_counts = df1.groupby('prevalentHyp')['TenYearCHD'].value_counts().unstack()

# Create a stacked bar chart
hypertension_counts.plot(kind='bar', stacked=True)
plt.xlabel('Prevalent Hypertension')
plt.ylabel('Count')
plt.title('Ten-Year CHD Count by Prevalent Hypertension')
plt.legend(title='Ten-Year CHD')
plt.show()


##### 1. Why did you pick the specific chart?

A stacked bar chart is effective for displaying the relationship between two categorical variables while representing the contribution of each category to the total count.


##### 2. What is/are the insight(s) found from the chart?

The stacked bar chart shows the count of CHD cases based on the presence or absence of prevalent hypertension.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the relationship between prevalent hypertension and CHD risk can help develop targeted interventions for individuals with hypertension and assess the impact of hypertension control on CHD prevention.

#### Chart - 8 - Vioin Plot

In [None]:
# Chart - 8 visualization code
sns.violinplot(x=df1['TenYearCHD'], y=df1['age'])
plt.xlabel('Ten-Year CHD')
plt.ylabel('Age')
plt.title('Age Distribution by Ten-Year CHD')
plt.show()


##### 1. Why did you pick the specific chart?

 A violin plot combines a box plot and a kernel density plot to visualize the distribution of a continuous variable across different categories.


##### 2. What is/are the insight(s) found from the chart?

 The violin plot showcases the age distribution for individuals with and without a ten-year CHD risk.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Understanding the age distribution in relation to CHD risk can aid in identifying age-specific prevention strategies or interventions.

#### Chart - 9 - Area Plot

In [None]:
# Chart - 9 visualization code
df1.groupby('education')['TenYearCHD'].mean().plot(kind='area', stacked=False)
plt.xlabel('Education Level')
plt.ylabel('Mean Ten-Year CHD')
plt.title('Mean Ten-Year CHD by Education Level')
plt.show()


##### 1. Why did you pick the specific chart?

An area plot is suitable for illustrating the distribution or variation of a numerical variable across different categories.


##### 2. What is/are the insight(s) found from the chart?

The area plot shows the mean ten-year CHD risk for each education level, providing insights into potential differences.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Identifying the relationship between education level and CHD risk can inform targeted health education initiatives and interventions.

#### Chart - 10 - Donut Chart

In [None]:
# Chart - 10 visualization code
is_smoking_counts = df1['is_smoking'].value_counts()

# Create a donut chart
plt.pie(is_smoking_counts.values, labels=is_smoking_counts.index, autopct='%1.1f%%', wedgeprops={'edgecolor': 'white'})
plt.title('Smoking Status')
# Draw a white circle in the middle to create a donut shape
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.axis('equal')
plt.show()


##### 1. Why did you pick the specific chart?

A donut chart is a variation of a pie chart that includes a hole in the center. It can effectively display the proportion of different categories, such as smoking status.


##### 2. What is/are the insight(s) found from the chart?

The donut chart presents the percentage of smokers and non-smokers in the dataset, providing insights into the smoking behavior of the population.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the smoking status distribution can help in designing targeted smoking cessation programs or assessing the impact of smoking on CHD risk.

#### Chart - 11 - Violin Swarm Plot

In [None]:
# Chart - 11 visualization code
sns.violinplot(x=df1['TenYearCHD'], y=df1['BMI'], inner=None)
sns.swarmplot(x=df1['TenYearCHD'], y=df1['BMI'], color='k', alpha=0.7)
plt.xlabel('Ten-Year CHD')
plt.ylabel('BMI')
plt.title('BMI Distribution by Ten-Year CHD')
plt.show()


##### 1. Why did you pick the specific chart?

A violin swarm plot combines a violin plot and a scatter plot to display the distribution and individual data points for a numerical variable across different categories.


##### 2. What is/are the insight(s) found from the chart?

The violin swarm plot presents the BMI distribution for individuals with and without a ten-year CHD risk, as well as the individual data points.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Examining the BMI distribution in relation to CHD risk can aid in identifying potential associations and developing targeted interventions for individuals at higher risk.

#### Chart - 12 - Corelation Matrix Plot

In [None]:
# Chart - 12 visualization code
corr_matrix = df1[['age', 'cigsPerDay', 'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose']].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()


##### 1. Why did you pick the specific chart?

 A correlation matrix plot visualizes the correlation coefficients between multiple variables, providing insights into the strength and direction of their relationships.


##### 2. What is/are the insight(s) found from the chart?

The correlation matrix plot shows the pairwise correlations between age, cigarettes per day, blood pressure, BMI, heart rate, and glucose levels.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Understanding the relationships between these variables can help identify potential risk factors and guide interventions or preventive measures for CHD.

#### Chart - 13 - Violin Plot with Hue

In [None]:
# Chart - 13 visualization code
sns.violinplot(x=df1['TenYearCHD'], y=df1['age'], hue=df1['sex'], split=True)
plt.xlabel('Ten-Year CHD')
plt.ylabel('Age')
plt.title('Age Distribution by Ten-Year CHD with Gender')
plt.show()


##### 1. Why did you pick the specific chart?

A violin plot with hue allows us to compare the distribution of a numerical variable across different categories while incorporating an additional categorical variable (gender) using different colors.


##### 2. What is/are the insight(s) found from the chart?

 The violin plot displays the age distribution for individuals with and without a ten-year CHD risk, split by gender.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Examining the age distribution by CHD risk and gender can provide insights into potential differences, enabling the development of targeted interventions or gender-specific risk assessment strategies.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import numpy as np

correlation_matrix = df1.corr()
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

plt.figure(figsize=(15, 8))
sns.heatmap(correlation_matrix, annot=True, mask=mask, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap is ideal for visualizing the correlation between multiple variables in a tabular format.


##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals the correlation coefficients between different variables, highlighting potential relationships.


Business Impact: Identifying correlations between variables can help understand the factors influencing CHD risk, allowing for targeted interventions and risk prediction models.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

sns.pairplot(df1[['age', 'sysBP', 'glucose', 'TenYearCHD']], hue='TenYearCHD')
plt.title('Pairplot')
plt.show()


##### 1. Why did you pick the specific chart?

A pairplot is used to visualize the relationships between multiple variables, allowing for quick comparisons and identifying patterns.


##### 2. What is/are the insight(s) found from the chart?

Insights: The pairplot displays scatter plots for different combinations of age, systolic blood pressure, glucose levels, and CHD risk.


Business Impact: Understanding the relationships between these variables can help identify potential risk factors and contribute to the development of personalized risk assessment models.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***