# Predicting heart disease using machine learning

This notebook looks into using various Python based machine learning and data science libraries in an attempt to build a machine learning model capable of predicting whether or not someone has heart disease based on their medical attributes 

We're going to take the following approach: 
1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation

## 1. Problem Definition

In a statement, 
> Given clinical parameters about a patient, can we predict whether or not the have heart disease?

## 2. Data

The original data come from: `https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data`


## 3. Evaluation

> If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept we'll pursue the project

## 4. Features 
This is where you will get diffrent information about each of the features of your data.

**Create Data Dictionary**


    1. id (Unique id for each patient)
    2. age (Age of the patient in years)
    3. origin (place of study)
    4. sex (Male/Female)
    5. cp chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])
    6. trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
    7. chol (serum cholesterol in mg/dl)
    8. fbs (if fasting blood sugar > 120 mg/dl)
    9. restecg (resting electrocardiographic results)
     -- Values: [normal, stt abnormality, lv hypertrophy]
    10. thalach: maximum heart rate achieved
    11. exang: exercise-induced angina (True/ False)
    12. oldpeak: ST depression induced by exercise relative to rest
    13. slope: the slope of the peak exercise ST segment
    14. ca: number of major vessels (0-3) colored by fluoroscopy
    15. thal: [normal; fixed defect; reversible defect]
    16. target: the predicted attribute
 

## Preparing the tools 

We are going to use pandas, matplotlib and NumPy for data analysis and manipulation 


In [None]:
# Import all the tools 

# Regular EDA (exploring data analysis) and plotting libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

# Models from scikit learn 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, f1_score, recall_score
from sklearn.metrics import RocCurveDisplay

## Load Data

In [None]:
df = pd.read_csv("./data/heart-disease.csv")
df.shape # rows, columns

In [None]:
df.head()

## Data Exploration (exploratory data analysis or EDA) 

The goal here is to find out more about the data and become a aubject matter export on the dataset you are working with.

1. What questions are you trying to solve? 
2. What kind of data do we have and how do we treat diffrent types? 
3. What's missing from the data and how do you deal with it 
4. Where are the outliers and why should you care about them? 
5. How can you add, change or remove features to get more out of your data 

In [None]:
df.tail()

In [None]:
# Let's find out how many of each class there
df["target"].value_counts()

In [None]:
df["target"].value_counts().plot(kind = "bar", color=["salmon","lightblue"]);

In [None]:
df.info()

In [None]:
#Are there any missing values ? 
df.isna().sum()

In [None]:
df.describe()

### Heart Disease Freuquency according to Sex 

In [None]:
df.sex.value_counts()

In [None]:
# Compare target column with sex column 
pd.crosstab(df.target, df.sex)

In [None]:
# Create a plot of crosstab 
pd.crosstab(df.target, df.sex).plot(kind = "bar", figsize=(10,6), color=["salmon","lightblue"])

plt.title("Heart Disease Freuquency for Sex") 
plt.xlabel("0 = No Disease, 1 = Disease")
plt.ylabel("Amount")
plt.legend(["Female","Male"])

plt.xticks(rotation=0);

In [None]:
df.head()

In [None]:
df["thalach"].value_counts()

## Age vs Max Heart Rate or Heart Disease

In [None]:
# Create another figure 
plt.figure(figsize= (10,6))

# Scatter with possitive examples 
plt.scatter(df.age[df.target ==1],
           df.thalach[df.target == 1],
           c = "salmon");

# Scatter with negative examples
plt.scatter(df.age[df.target ==0],
           df.thalach[df.target == 0],
           c = "lightblue");

plt.legend(["Heart Disease","No Heart Disease"]);
plt.title("Heart Disease in function of Age and Max Heart Rate")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate");

In [None]:
# Check the distribution of the age column with a histogram 
df.age.plot.hist();

## Heart Disease Frequency per Chest Pain Type

5. cp chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])

In [None]:
pd.crosstab(df.cp,df.target)

In [None]:
# Make the crosstab more visual

pd.crosstab(df.cp,df.target).plot(kind="bar",
                                  figsize=(10,6),
                                  color= ["salmon", "lightblue"])

# Add some communication 
plt.title("Heart Disease Frequency Per Chest Pain Type")
plt.xlabel("Chest Pain Type")
plt.ylabel("Amount")
plt.legend(["No Disease", "Disease"])
plt.xticks(rotation=0); 

In [None]:
df.head()

In [None]:
# Make a correlation matrix 
df.corr()

In [None]:
# Let's make the correlation matrix a little bit prettier

corr_matrix = df.corr()
fig,ax = plt.subplots(figsize=(15,10))
ax = sns.heatmap(corr_matrix, 
                 annot=True,
                 linewidths=0.5,
                 fmt=".2f",
                 cmap="YlGnBu");



# 5. Modeling 

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
# Split the data 
X = df.drop("target",axis = 1) 
y = df["target"] 

#reproduce the results 
np.random.seed(42)

X_train,X_test,y_train,y_test = train_test_split(X,y,shuffle=True, test_size = 0.2) 

In [None]:
X_train.shape, X_test.shape

Now we have got our datat split into training and test sets, it's time to build a machine learning model.
We will train it (find the patterns) on the training set.
And we will test it (use the patterns) on the test set. 

We are going to try 3 diffrent machine learning models: 
1. Logistic Regression 
2. K-Nearest Neighbours Classifier
3. Random Forest Classifier 

In [None]:
#Put models into a Dict 

models = { "Logistic Regression": LogisticRegression(),
           "KNN": KNeighborsClassifier(),
           "Random Forest": RandomForestClassifier()}

# Create a function to fit and score our models 
def fit_and_score (models, X_train,X_test,y_train,y_test):
    """
    Fits and evaluates given machine learning models. 
    models: a dict of diffrent Scikit.Learn machine learning models 
    X_train: training data (no labels)
    X_test: test data (no labels)
    y_train: training data (labels) 
    y_test: test data (labels) 

    """
    np.random.seed(42) 

    # Make a dict to keep model scores 

    model_scores={}

    # Loop through models 
    for name, model in models.items():
        #Fit the model to the data 
        model.fit(X_train,y_train)
        #Evaluate the model and append it to the model_scores dict
        model_scores[name] = model.score(X_test,y_test)
    return model_scores

In [None]:
models_scores = fit_and_score(models,X_train,X_test,y_train,y_test) 

In [None]:
models_scores

## Model Comparison

In [None]:
model_compare = pd.DataFrame(models_scores, index = ["accuracy"])
model_compare.T.plot.bar();

In [None]:
model_compare

Now we have got a basline model .... and we know a model's first predictions aren't always what we should based our next steps off. What should we do?

Let's look at the following: 

* Hyperparameter tuning
* Feature importance
* Confusion Matrix
* Cross_Validation
* Precision
* Recall
* F1 score
* Classification report
* ROC curve
* Area under the curve (AUC)
  

## Hyperparameter Tuning (by hand)

In [None]:
# Let's tune KNN

train_scores = []
test_scores = []

# Create a list of diffrent values for n neighbors
neighbors = range (1,21)

# Setup KNN instance 
knn = KNeighborsClassifier();

#Loop through diffrent n_neigbors 

for i in neighbors:
    knn.set_params(n_neighbors=i)

    # Fit the alg
    knn.fit(X_train,y_train)

    # Updating the training scores list
    train_scores.append(knn.score(X_train,y_train))

    # Update the test scores list
    test_scores.append(knn.score(X_test,y_test))
    

In [None]:
train_scores

In [None]:
test_scores

In [None]:
plt.plot(neighbors,train_scores, label="Train score")
plt.plot(neighbors,test_scores,label="Test score")
plt.xticks(np.arange(1,21,1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend();

print(f"Maximum KNN score on the test data: {max(test_scores)*100:.2f}%")

In [None]:
#KNN is not the right algorithm for this problem because the other algorithms perform far better

## Hyperparameter tuning with RandomizedSearchCV

We are going to tune: 
* LogisticRegression()
* RandomForestClassifier()

... using RSCV

In [None]:
# Create a hyperparmeter grid for logisitcregression model 
log_reg_grid={
    "C":np.logspace(-4,4,20),
    "solver":["liblinear"]
}

# Create a HP grid fpr RandomForestClass.
rf_grid ={
    "n_estimators":np.arange(10,1000,50),
    "max_depth":[None,3,5,10],
    "min_samples_split": np.arange(2,20,2),
    "min_samples_leaf":np.arange(1,20,2)
}

Now we have got hyperparameters grids for each of our models, let's tune them using RandomizedSearchCV

In [None]:
# tune LR Model

np.random.seed(42)

# Setup random hyperparameter search for LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter = 20,
                                verbose = True)

# Fit random hyperparameter search model for LogisticRegression
rs_log_reg.fit(X_train,y_train)

In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(X_test,y_test)

Now we have tuned LogisticRegression(), let's do the same for RandomForest()

In [None]:
np.random.seed(42)

# Setup random hyperparameter search for RandomForestClassifier
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                                param_distributions=rf_grid,
                                cv=5,
                                n_iter = 20,
                                verbose = True)

# Fit random hyperparameter search model for LogisticRegression
rs_rf.fit(X_train,y_train)

In [None]:
rs_rf.best_params_

In [None]:
rs_rf.score(X_train,y_train)

## Hyperparameter Tuning with GridSearchCV 

Since our LogisticRegression model provides the best scores so far, we will try and improve the HP by GridSearchCV

In [None]:
# Diffrent hyperparameters for our LogisticRegression Model
log_reg_grid={
    "C":np.logspace(-4,4,30),
    "solver":["liblinear"]
}

# Setup grid hyperparameter search for LogisticRegression
gs_log_reg = GridSearchCV(LogisticRegression(),
                          param_grid=log_reg_grid,
                          cv = 5,
                          verbose = True)
# Fit the model
gs_log_reg.fit(X_train,y_train)

In [None]:
gs_log_reg.score(X_test,y_test)

## Evaluating our tuned machine learning classifier, beyond accuracy 

* ROC Curve and AUC Score
* Confusuin Matrix
* Classification Report
* Precision
* Recall 
* F1-Score

... and it would be great if cross-validation was used where possible 

To make compariosons and evaluate our trained model, first we need to make predictions. 

In [None]:
# Make Predictions with tuned model
y_preds = gs_log_reg.predict(X_test)

In [None]:
y_preds

In [None]:
y_test

In [None]:
from sklearn import metrics

y_score = gs_log_reg.predict_proba(X_test)[:, 1]   # Wahrscheinlichkeit f√ºr Klasse 1

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_score)
roc_auc = metrics.auc(fpr, tpr)

disp = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
disp.plot()

In [None]:

y_score

In [None]:
gs_log_reg.best_params_

def plot_conf_mat(y_test, y_preds):
    """
    Plots a confusion matrix using Seaborn's heatmap().
    """
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True, # Annotate the boxes
                     cbar=False)
    plt.xlabel("Predicted label") # predictions go on the x-axis
    plt.ylabel("True label") # true labels go on the y-axis 
    
plot_conf_mat(y_test, y_preds)

In [None]:
confusion_matrix(y_test, y_preds)

Now we've got a ROC curve, an AUC metric and a confusion matrix, let's get a classification report as well as cross-validated precision, recall and f1-score

In [None]:
print(classification_report(y_test,y_preds))

### Calculating evaluation metrics using cross-validation

We're going to calculate precision, recall and f1-score of our model using cross-validation and to do so we'll be using `cross_cal_score()`

In [None]:
# Check best Hyperparameters 
gs_log_reg.best_params_

In [None]:
# Create a new classifier with best parameters
clf = LogisticRegression(C= 0.20433597178569418, solver = "liblinear")

In [None]:
# Cross-validated accuracy
cv_acc = cross_val_score(clf,X,y,cv = 5, scoring = "accuracy")

cv_acc.mean()

In [None]:
#Cross-validated precision
cv_prec = cross_val_score(clf,X,y,cv = 5, scoring = "precision")
cv_prec.mean()

In [None]:
#Cross-validated recall
cv_rec = cross_val_score(clf,X,y,cv = 5, scoring = "recall")
cv_rec.mean()

In [None]:
#Cross-validates f1-score
cv_f1 = cross_val_score(clf,X,y,cv = 5, scoring = "f1")
cv_f1.mean()

In [None]:
# Visualize cross-validated metrics 
cv_metrics = pd.DataFrame({"Accuracy": cv_acc.mean(),
                           "Precision":cv_prec.mean(),
                           "Recall":cv_rec.mean(),
                           "F1":cv_f1.mean()},
                          index = [0])
cv_metrics.T.plot.bar(title="Cross-validated classification metrics", legend= False)

### Feature Importance 

Feature importance is another way of asking which features contributed most to tthe outcomes of the model and how did the contribute 

Finding feature importance is diffrent do each machine learning model. 

Let's find the feature importance for our LR model...

In [None]:
df.head()

In [None]:
# Fit an instance of Logistic Regression
# Create a new classifier with best parameters
clf = LogisticRegression(C= 0.20433597178569418, solver = "liblinear")

clf.fit(X_train,y_train);

In [None]:
#Check coef_
clf.coef_

In [None]:
feature_dict = dict(zip(df.columns,list(clf.coef_[0])))
feature_dict

In [None]:
# Visualize Feature Importance 
feature_df = pd.DataFrame(feature_dict, index =[0])
feature_df.T.plot.bar(title="Feature importance", legend = False); 

In [None]:
pd.crosstab(df["sex"],df["target"])

In [None]:
pd.crosstab(df["slope"],df["target"]) #  slope: the slope of the peak exercise ST segment

## 6. Experimenations

If you haven't hit your evaluation metric yet... ask yourself...

* Could you collect more data?
* Could you try a better model? Like CatBoost or XGBoost?
* Could you improve the current models? (beyond what we've done so far)
* If your model is good enough ( you have hit your evaluation metric)
* How would you export it and share it 