# Coronary Heart Disease Machine Learning Analysis

## Introduction

In this Jupyter Notebook, we perform a machine learning analysis to predict coronary heart disease (CHD) risk using a dataset. We'll go through the following steps:

- Importing the required packages
- Importing and displaying an overview of the dataset
- Preprocessing the data
- Training and evaluating different machine learning models

### 1. Importing the Required Packages


Let's start by importing the necessary Python packages for our analysis.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import make_pipeline

### 2. Importing & Displaying an Overview of the Dataset

Now, we'll load and take a quick look at the dataset.

In [2]:
df = pd.read_csv('data/Suivi.csv')
df.head()

Unnamed: 0,Sex,Age,isSmoker,CigsPerDay,BloodPressureMeds,PrevStroke,PrevHypertension,Diabetes,Cholesterol,SysBloodPressure,DiaBloodPressure,BMI,HeartRate,Glucose,RiskCHD
0,1,39,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,46,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0
4,0,43,0,0.0,0.0,0,1,0,228.0,180.0,110.0,30.3,77.0,99.0,0


### 3. Preprocessing Data

We'll preprocess the data by selecting relevant features, splitting it into training and testing sets, and standardizing the features.

In [3]:
y = df['RiskCHD']
X = df[['Sex', 'Age', 'isSmoker', 'CigsPerDay', 'BloodPressureMeds', 'PrevStroke', 'Diabetes', 'Cholesterol', 'SysBloodPressure', 'DiaBloodPressure', 'BMI', 'HeartRate', 'Glucose']]

In [4]:
# Split the data into training and testing sets with random_state=42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [5]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### 4. Model Training and Evaluation

We will train and evaluate the following machine learning models:

#### Logistic Regression Model

In [6]:
lr_classifier = LogisticRegression(random_state=4)
lr_classifier.fit(X_train_scaled, y_train)
y_pred_lr = lr_classifier.predict(X_test_scaled)

# Evaluate Logistic Regression
print("\nLogistic Regression Classifier:")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Precision:", precision_score(y_test, y_pred_lr))
print("Recall:", recall_score(y_test, y_pred_lr))
print("F1 Score:", f1_score(y_test, y_pred_lr))


Logistic Regression Classifier:
Accuracy: 0.6838487972508591
Precision: 0.6701421800947868
Recall: 0.7048853439680958
F1 Score: 0.6870748299319728


#### Support Vector Machine Model

In [7]:
svm_classifier = SVC(random_state=4)
svm_classifier.fit(X_train_scaled, y_train)
y_pred_svm = svm_classifier.predict(X_test_scaled)

# Evaluate Support Vector Machine
print("\nSupport Vector Machine Classifier:")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Precision:", precision_score(y_test, y_pred_svm))
print("Recall:", recall_score(y_test, y_pred_svm))
print("F1 Score:", f1_score(y_test, y_pred_svm))


Support Vector Machine Classifier:
Accuracy: 0.7103583701521846
Precision: 0.6949952785646837
Recall: 0.7337986041874377
F1 Score: 0.7138700290979632


#### Random Forest Classifier

In [8]:
rf_classifier = RandomForestClassifier(random_state=4)
rf_classifier.fit(X_train_scaled, y_train)
y_pred_rf = rf_classifier.predict(X_test_scaled)

# Evaluate Random Forest
print("\nRandom Forest Classifier:")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Precision:", precision_score(y_test, y_pred_rf))
print("Recall:", recall_score(y_test, y_pred_rf))
print("F1 Score:", f1_score(y_test, y_pred_rf))


Random Forest Classifier:
Accuracy: 0.9597447226313206
Precision: 0.9332079021636877
Recall: 0.9890329012961117
F1 Score: 0.9603097773475314


#### Decision Tree Model

In [9]:
dt_classifier = DecisionTreeClassifier(random_state=4)
dt_classifier.fit(X_train_scaled, y_train)
y_pred_dt = dt_classifier.predict(X_test_scaled)

# Evaluate Decision Tree
print("\nDecision Tree Classifier:")
print("Accuracy:", accuracy_score(y_test, y_pred_dt))
print("Precision:", precision_score(y_test, y_pred_dt))
print("Recall:", recall_score(y_test, y_pred_dt))
print("F1 Score:", f1_score(y_test, y_pred_dt))


Decision Tree Classifier:
Accuracy: 0.8919980363279333
Precision: 0.8265221017514596
Recall: 0.9880358923230309
F1 Score: 0.9000908265213443


#### Gradient Boosting Model

In [10]:
gb_classifier = GradientBoostingClassifier(random_state=4)
gb_classifier.fit(X_train_scaled, y_train)
y_pred_gb = gb_classifier.predict(X_test_scaled)

# Evaluate Gradient Boosting
print("\nGradient Boosting Classifier:")
print("Accuracy:", accuracy_score(y_test, y_pred_gb))
print("Precision:", precision_score(y_test, y_pred_gb))
print("Recall:", recall_score(y_test, y_pred_gb))
print("F1 Score:", f1_score(y_test, y_pred_gb))


Gradient Boosting Classifier:
Accuracy: 0.738831615120275
Precision: 0.7162534435261708
Recall: 0.7776669990029911
F1 Score: 0.7456978967495221


#### Gradient Boosting Model with Hyperparameter Tuning (GridSearchCV)

In [11]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

gb_classifier = GradientBoostingClassifier(random_state=4)
gb_grid = GridSearchCV(gb_classifier, param_grid, cv=5, scoring='f1', n_jobs=-1)
gb_grid.fit(X_train_scaled, y_train)
best_gb = gb_grid.best_estimator_
y_pred_gb = best_gb.predict(X_test_scaled)

# Evaluate Gradient Boosting with GridSearchCV
print("\nGradient Boosting Classifier + GridSearchCV:")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Precision:", precision_score(y_test, y_pred_rf))
print("Recall:", recall_score(y_test, y_pred_rf))
print("F1 Score:", f1_score(y_test, y_pred_rf))


Gradient Boosting Classifier + GridSearchCV:
Accuracy: 0.9597447226313206
Precision: 0.9332079021636877
Recall: 0.9890329012961117
F1 Score: 0.9603097773475314


### Conclusion

In this analysis, we explored various machine learning models to predict coronary heart disease (CHD) risk. The Random Forest Classifier achieved the highest F1 score, suggesting its effectiveness in this prediction task. Further hyperparameter tuning of the Gradient Boosting Classifier using GridSearchCV might yield even better results.