# Predicting heart disease using machine learning

This notebook looks into into various python based machine learning and data science libraries in an attempt to build a ml model capable of predicting whether or not someone has heart disease or not based on their medical attributes.

Approach :
1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation

### 1. Problem Definition
In a statement,
> Given clinical parameters about a patient,can we predict whether or not have heart disease
### 2. Data
The original data came from the Cleaveland data from UCI Machine Learning Repository.
There is version available in kaggle : https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
### 3. Evaluation 
> If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept,we'll pursue the project
### 4. Features
##### Create the data dictionary 

        age
        sex
        chest pain type (4 values)
        resting blood pressure
        serum cholestoral in mg/dl
        fasting blood sugar > 120 mg/dl
        resting electrocardiographic results (values 0,1,2)
        maximum heart rate achieved
        exercise induced angina
        oldpeak = ST depression induced by exercise relative to rest
        the slope of the peak exercise ST segment
        number of major vessels (0-3) colored by flourosopy
        thal: 0 = normal; 1 = fixed defect; 2 = reversable defect
### Preparing the tools

In [None]:
# Import all tools we need
# regular EDA (Exploratory data analysis) and plotting lib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline 
# We want our plots to appear inside the notebook

# Models from scikit learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Models for evaluation
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import precision_score,recall_score,f1_score
from sklearn.metrics import RocCurveDisplay
#from sklearn.metrics import plot_roc_curve

## Load data

In [None]:
df=pd.read_csv("heart-disease.csv")
df

In [None]:
df.shape  #rows and cols

## Data Exploration (EDA)
The goal here is to find more about data and become a subject matter export on the dataset.

1. What question(s) are you trying to solve?
2. What kind of data do we have and how do we treat different types?
3. What's missing from the data and how do you deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data?

In [None]:
# Knowing how many of each are there
df["target"].value_counts()

In [None]:
df["target"].value_counts().plot(kind="bar",color=["salmon","lightblue"],
                                xlabel='target',
                                 ylabel="count");

In [None]:
df.info()

In [None]:
# Finding if there are any missing values
df.isna().sum()

In [None]:
df.describe()

### Heart Disease Frequency by sex

In [None]:
df.sex.value_counts()

In [None]:
# Compare the sex column with target column
pd.crosstab(df.target,df.sex)

In [None]:
# Create a plot of crosstab
pd.crosstab(df.target,df.sex).plot(kind="bar",
                                   figsize=(10,6),ylabel="sex count",
                                   color=["salmon","lightblue"]);
plt.title("Heart Disease Frequency for Sex")
plt.xlabel("0 = No disease, 1 = Disease")
plt.legend(["Female","Male"]);
plt.xticks(rotation=0);

### Age vs Max Heartrate for Heart disease

In [None]:
df.thalach.value_counts()

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(df.age[df.target==1],
           df.thalach[df.target==1],
           color="green");
# Scatter with negative examples
plt.scatter(df.age[df.target==0],
            df.thalach[df.target==0],
            c="salmon");
plt.xlabel("Age");
plt.ylabel("Max Heart Rate");
plt.title("Heart Disease in function of Age and Max heart rate")
plt.legend(["Disease","No Disease"]);

In [None]:
# Checking the distribution of Age column with histogram
df.age.plot.hist();

### Heart Disease Frequency per Chest Pain Type

In [None]:
pd.crosstab(df.cp,df.target)

In [None]:
# Making the crosstab visual
pd.crosstab(df.cp,df.target).plot(kind="bar",
                                  figsize=(10,6),
                                  color=["salmon","lightblue"])
plt.title("Heart Disease Frequency as per Chest pain type")
plt.xlabel("Chest pain Type")
plt.ylabel("Amount")
plt.legend(["No Disease","Disease"])
plt.xticks(rotation=0);

In [None]:
# Make a correlation matrix
df.corr()

In [None]:
corr_matrix = df.corr()
fig,ax = plt.subplots(figsize=(15,10))
ax = sns.heatmap(corr_matrix,
                 annot=True,
                 linewidths=0.5,
                 fmt=".2f",
                 cmap="YlGnBu"
                );
#bottom,top = ax.get_ylim()
#ax.set_ylim(bottom+0.5,top-0.5)

## 5. Evaluation

In [None]:
# Split data into X and y
X = df.drop("target",axis=1)
y=df["target"]
X.head()

In [None]:
y.head()

In [None]:
# Split data into train and test sets
np.random.seed(42)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

In [None]:
len(X_train),len(X_test)

#### Experimenting on 3 different machine learning models :
1. LogisticRegression
2. K-NearestNeighbours Classifier
3. Random Forest Classifier

In [None]:
# Keeping models in dictionary
models = {"Logistic Regression" : LogisticRegression(),
          "KNN" : KNeighborsClassifier(),
          "Random Forest" : RandomForestClassifier()}

# A function to fit and score models
def fit_and_score(models,X_train,X_test,y_train,y_test):
    """
    Fits and evaluates given machine learning models.\
    models :  a dict of different scikit learn ML models.
    X_train and y_train are training data,
    X_test and y_test are testing data labels.
    """
    np.random.seed(42)
    # Make a dictionary to keep model scores
    model_scores={}
    # Loop through models for evaluating
    for name,model in models.items():
        model.fit(X_train,y_train)
        model_scores[name]=model.score(X_test,y_test)
    return model_scores

In [None]:
model_scores=fit_and_score(models = models,
                           X_train = X_train,
                           X_test = X_test,
                           y_train = y_train,
                           y_test = y_test)

model_scores

### Model Comparision

In [None]:
model_compare = pd.DataFrame(model_scores,index = ["accuracy"])
model_compare.T.plot.bar();
plt.xticks(rotation=0);
plt.ylabel("Accuracy score")
plt.title("Accuracy Measure for baseline models")

Approach for getting the best model :
1. Hyperparameter tuning
2. Feature importance
3. Confusion Matrix
4. Cross validtion
5. Precision
6. Recall
7. F1 Score
8. Classification report
9. ROC curve
10. Area under the curve

### Hyperparameter Tuning (by hand)

In [None]:
# Tuning KNN model
train_scores=[]
test_scores=[]
neighbors = range(1,21)
knn = KNeighborsClassifier()
for i in neighbors:
    knn.set_params(n_neighbors=i)
    knn.fit(X_train,y_train)
    train_scores.append(knn.score(X_train,y_train))
    test_scores.append(knn.score(X_test,y_test))

In [None]:
train_scores

In [None]:
test_scores

In [None]:
plt.plot(neighbors,train_scores,label="Train Score")
plt.plot(neighbors,test_scores,label="Test Score")
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.xticks(np.arange(1,21,1))
plt.legend()
print(f"Maximum KNN score on test data :{max(test_scores)*100:.2f}%")

### Hyperparameter tuning by Randomized SearchCV

Both LogisticRegression and RandomForestClassifier are tuned by Randomsearchcv.

In [None]:
# Creating a hyperparamter grid for logistic regression
log_reg_grid = {"C":np.logspace(-4,4,20),
                "solver":["liblinear"]}
# Create a hyper parameter grid for RandomForestClassifier
rf_grid={"n_estimators":np.arange(10,1000,50),
         "max_depth":[None,3,5,10],
         "min_samples_split":np.arange(2,20,2),
         "min_samples_leaf":np.arange(1,20,2)}

In [None]:
# Tuning logistic regression model
np.random.seed(42)
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter=20,
                                verbose=True)
rs_log_reg.fit(X_train,y_train)

In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(X_test,y_test)

In [None]:
np.random.seed(42)
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                                param_distributions=rf_grid,
                                cv=5,
                                n_iter=20,
                                verbose=True)
rs_rf.fit(X_train,y_train)

In [None]:
rs_rf.score(X_test,y_test)

In [None]:
rs_rf.best_params_

### Hyperpaper tuning with GridSearchCV

In [None]:
# Creating a hyperparamter grid for logistic regression
log_reg_grid = {"C":np.logspace(-4,4,20),
                "solver":["liblinear"]}
gs_log_reg = GridSearchCV(LogisticRegression(),
                                param_grid=log_reg_grid,
                                cv=5,
                                verbose=True)
gs_log_reg.fit(X_train,y_train)


In [None]:
gs_log_reg.best_params_

In [None]:
gs_log_reg.score(X_test,y_test)

In [None]:
# rf_grid={"n_estimators":np.arange(10,1000,50),
#          "max_depth":[None,3,5,10],
#          "min_samples_split":np.arange(2,20,2),
#          "min_samples_leaf":np.arange(1,20,2)}
# gs_rf = RandomizedSearchCV(RandomForestClassifier(),
#                                 param_grid=rf_grid,
#                                 cv=5,
#                                 verbose=True)
# rs_rf.fit(X_train,y_train)

## Evaluating our tuned machine learning classifier, beyond accuracy
* roc curve and auc curve
* Confusion matrix
* Classification report
* Precision
* Recall
* F1 score

In [None]:
# Making prediction with tuned model
y_preds=gs_log_reg.predict(X_test)
y_preds

In [None]:
# # Plotting ROC Curve and calculatig AUC metric
RocCurveDisplay.from_estimator(estimator=gs_log_reg,
                               X=X_test,
                               y=y_test);


In [None]:
sns.set(font_scale=1.5)
def plot_conf_matrix(y_test,y_preds):
    """
    PLots a nice looking confusion matrix using Seaborn's heatmap()
    """
    fig,ax = plt.subplots(figsize=(3,3))
    ax = sns.heatmap(confusion_matrix(y_test,y_preds),
                     annot = True,
                     cbar=False)
    plt.xlabel("True label")
    plt.ylabel("Predicted label")

plot_conf_matrix(y_test,y_preds)
    

#### Classification report

In [None]:
print(classification_report(y_test,y_preds))

### Calculate evaluation metrics using cross validation

In [None]:
# Creting a new clasifier with best parameters
clf = LogisticRegression(C=0.23357214690901212, solver= 'liblinear')

In [None]:
cv_acc=cross_val_score(clf,
                       X,
                       y,
                       cv=5,
                       scoring="accuracy")
cv_acc=np.mean(cv_acc)
cv_acc

In [None]:
cv_precision=cross_val_score(clf,
                       X,
                       y,
                       cv=5,
                       scoring="precision")
cv_precision=np.mean(cv_precision)
cv_precision

In [None]:
cv_recall=cross_val_score(clf,
                       X,
                       y,
                       cv=5,
                       scoring="recall")
cv_recall=np.mean(cv_recall)
cv_recall

In [None]:
cv_f1=cross_val_score(clf,
                       X,
                       y,
                       cv=5,
                       scoring="f1")
cv_f1=np.mean(cv_f1)
cv_f1

In [None]:
# Visualizing cross validated metrics
cv_metrics=pd.DataFrame({"Accuracy":cv_acc,
                         "Precision":cv_precision,
                         "Recall":cv_recall,
                         "F1":cv_f1},
                        index=[0])
cv_metrics.T.plot.bar(title="Cross validated classification metrics",
                     legend=False);
plt.xticks(rotation=0);

## Feature Importance
For our Logistic Regression model ...

In [None]:
clf.fit(X_train,y_train);

In [None]:
clf.coef_

In [None]:
df.head()

In [None]:
# Match coef's of features to cols
feature_dict=dict(zip(df.columns,list(clf.coef_[0])))
feature_dict               

In [None]:
# Visualizing the feature importance
feature_df=pd.DataFrame(feature_dict,index=[0])
feature_df.T.plot.bar(title="Feature Importance",legend=False);

In [None]:
pd.crosstab(df["sex"],df["target"])

In [None]:
pd.crosstab(df["slope"],df["target"])

# 6.Experimentation
Going for XBoost and other to improv