<a id="4"></a><h1 style='background:#6daa9f; border:3; color:white'><center> Machine Learning For Heart Failure Prediction: ANN Endgame </center></h1>

<center><img 
src="https://d1nakyqvxb9v71.cloudfront.net/wp-content/uploads/2020/01/heart-health-tips-animation-thumbnail.gif" width="900" height="900"></img></center>

<br>

<a id="4"></a><h1 style='background:#7ad16d; border:0; color:black'><center> Table of contents </center></h1>

1. [Introduction](#1)
1. [Data cleaning, exploration and preprocessing](#2)
1. [Basic model building](#3)
1. [Comparison: ANN vs rest](#4)
1. [Acknowledgements](#5)

<a id="1"></a><h1 style='background:#7ad16d; border:0; color:black'><center> Introduction </center></h1>

Cardiovascular disease (CVD) is the most common cause of morbidity and mortality among men and women globally. Heart failure is a commong CVD condition. The Heart Foundation defines Heart failure as "A condition where your heart isn’t pumping as well as it should be." The signs and symptoms of heart failure commonly include shortness of breath, excessive tiredness and leg swelling. 

Common causes of heart failure:
    1. Coronary artery disease
    2. Myocardial infraction (heart attack)
    3. High blood pressure
    4. Arterial fibrillation
    5. Cardiomyopathy
    6. Valvular heart disease 
    7. Infections 

> **Objective**: In this notebook, I will build a series of classifier model and compare that with ANN.

**Variables in the dataset:**

* **Age**: Age of the patient
* **Anaemia**: If the patient had the haemoglobin below the normal range
* **Creatinine_phosphokinase**: The level of the creatine phosphokinase in the blood in mcg/L
* **Diabetes**: If the patient was diabetic
* **Ejection_fraction**: Ejection fraction is a measurement of how much blood the left ventricle pumps out with each contraction
* **High_blood_pressure**: If the patient had hypertension
* **Platelets**: Platelet count of blood in kiloplatelets/mL
* **Serum_creatinine**: The level of serum creatinine in the blood in mg/dL
* **Serum_sodium**: The level of serum sodium in the blood in mEq/L
* **Sex**: The sex of the patient
* **Smoking**: If the patient smokes actively or ever did in past
* **Time**: It is the time of the patient's follow-up visit for the disease in months
* **Death_event**: If the patient deceased during the follow-up period

<a id="2"></a><h1 style='background:#7ad16d; border:0; color:black'><center> Data cleaning, exploration and preprocessing  </center></h1>

# Loading the dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import seaborn as sns
from keras.layers import Dense, BatchNormalization, Dropout, LSTM
from keras.models import Sequential
from keras.utils import to_categorical
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score

In [None]:
#loading data
data = pd.read_csv("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
data.head()

In [None]:
#Prevalence of outcome event
sns.set_theme(context='poster')
plt.figure(figsize=(10,7))
plt.title('Disease status \n (Survived (0), Death (1))', fontsize=20)
cols= ["#7cd16d","#eb2009"]
sns.countplot(x= data["DEATH_EVENT"], palette= cols)
plt.show()

In [None]:
#finding missing values
data.isnull().sum()

In [None]:
#correlation between the variables in the study
data.corr().style.background_gradient(cmap='Spectral').set_precision(2)

What the data tell us:
<br>
    1. Serum creatiine (r=0.29) is postively and sodium (-0.20) is negatively correlated with risk of death.
    1. Interestingly lifestyle factors such as smoking (-0.01) and diabetes (0.00) were either not correlated or weakly correlated with risk of deaths. 

In [None]:
#Age distribution
sns.set_theme(context='poster')
plt.figure(figsize=(20,20))
plt.title('Distribution of age', color="Green",fontsize=40)
Days_of_week=sns.countplot(x=data['age'])
Days_of_week.set_xticklabels(Days_of_week.get_xticklabels(), rotation=40, ha="right",fontsize=20)
plt.tight_layout()
plt.show()

In [None]:
# Boxen and swarm plot of some non binary features.
feature = ["age","creatinine_phosphokinase","ejection_fraction","platelets","serum_creatinine","serum_sodium", "time"]
for i in feature:
    plt.figure(figsize=(8,8))
    sns.swarmplot(x=data["DEATH_EVENT"], y=data[i], color="black", alpha=0.5)
    sns.violinplot(x=data["DEATH_EVENT"], y=data[i], palette=cols)
    plt.show()

What the data tell us:
<br>
    1. Outlier observations are detected for the variables above.
    1. This might be due to measurement error or due to some factors unique to the study population.

In [None]:
sns.set_theme(context='poster')
plt.figure(figsize=(15,10))
plt.title('Kernel density plot for age based on follow up (time)', color="Green",fontsize=30)
sns.kdeplot(x=data["time"], y=data["age"], hue =data["DEATH_EVENT"], palette=cols)
plt.tight_layout()
plt.show()

In [None]:
data.describe().T

# Data preprocessing

The major steps invovled in preprocessing:
<br>
    1. Outlier detection and correction.
    1. If necessary feature engineering for the dependent and independent variables.
    1. Dividing the dataset for training and test sets.

In [None]:
#Defining the target X and Y variable
X=data.drop(["DEATH_EVENT"],axis=1)
y=data["DEATH_EVENT"]

In [None]:
#Standard scaler features of the dataset
col_names = list(X.columns)
s_scaler = preprocessing.StandardScaler()
X_df= s_scaler.fit_transform(X)
X_df = pd.DataFrame(X_df, columns=col_names)   
X_df.describe().T

In [None]:
#Examining the scaled features
sns.set_theme(context='poster')
plt.figure(figsize=(20,15))
plt.title('Examining the scaled features (of columns)', color="Green",fontsize=30)
#colours =["#774571","#b398af","#f1f1f1" ,"#afcdc7", "#6daa9f"]
sns.violinplot(data = X_df,palette = 'Set2')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

In [None]:
#Spliting test and training sets
X_train, X_test, y_train,y_test = train_test_split(X_df,y,test_size=0.3,random_state=42)

<a id="3"></a><h1 style='background:#7ad16d; border:0; color:black'><center> Basic model building </center></h1>

We build our model using artificial nural network,which involves the following steps:
1. Initialising the ANN
1. Defining the added layers
1. Compiling the ANN
1. Train the ANN

Following building the ANN model we will compare the results with similar models build using:
    1. Catboost
    1. Random Forest
    1. Xgboost
    1. Logistic regression
    1. KNN

In [None]:
# 1. Initialising the NN
model = Sequential()

# 2. layers
model.add(Dense(units = 9, kernel_initializer = 'uniform', activation = 'relu', input_dim = 12))
model.add(Dense(units = 9, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dense(units = 7, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dense(units = 5, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

# 3. Compiling the ANN
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# 4. Train the ANN
history = model.fit(X_train, y_train, batch_size = 32, epochs = 500, validation_split=0.2)

In [None]:
val_accuracy = np.mean(history.history['val_accuracy'])
print("\n%s: %.2f%%" % ('val_accuracy', val_accuracy*100))

Here, we show the testing results as well as the classification report and the confusion matrix.

In [None]:
# Predicting from the test set results
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5)
np.set_printoptions()

In [None]:
# Confusion matrix for prediction results 
cmap1 = sns.diverging_palette(275,150,  s=40, l=65, n=6)
plt.subplots(figsize=(12,8))
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix/np.sum(cf_matrix), cmap = 'magma', annot = True, annot_kws = {'size':15})

In [None]:
ac_ann = accuracy_score(y_test,y_pred)

In [None]:
#Print the classification test results
print(classification_report(y_test, y_pred))

<a id="4"></a><h1 style='background:#7ad16d; border:0; color:black'><center> Comparison: ANN vs rest </center></h1>


**1. Logistic regression**

In [None]:
# Confusion Matrix & accuracy score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

model = LogisticRegression()

#Fit the model
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mylist = []
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
# accuracy score
acc_logreg = accuracy_score(y_test, y_pred)

mylist.append(acc_logreg)
print(cm)
print(acc_logreg)

**2. Random forrest classification**

In [None]:
#Finding the optimum number of n_estimators
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
list1 = []
for estimators in range(10,30):
    classifier = RandomForestClassifier(n_estimators = estimators, random_state=0, criterion='entropy')
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    list1.append(accuracy_score(y_test,y_pred))
#Figure
sns.set_theme(context='poster')
plt.figure(figsize=(15,10))
plt.title('Number of estimators', color="Green",fontsize=30)
plt.plot(list(range(10,30)), list1)
plt.show()

In [None]:
# Training the RandomForest Classifier on the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 15, criterion='entropy', random_state=0)
classifier.fit(X_train,y_train)

In [None]:
# Predicting the test set results
y_pred = classifier.predict(X_test)
print(y_pred)

In [None]:
#Confusion matrix and accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
acc_randomforest = accuracy_score(y_test, y_pred)
mylist.append(acc_randomforest)
print(cm)
print(acc_randomforest)

In [None]:
# Confusion matrix for prediction results 
cmap1 = sns.diverging_palette(275,150,  s=40, l=65, n=6)
plt.subplots(figsize=(12,8))
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix/np.sum(cf_matrix), cmap = 'magma', annot = True, annot_kws = {'size':15})

**3. Xboost**

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
list1 = []
for estimators in range(10,30,1):
    classifier = XGBClassifier(n_estimators = estimators, max_depth=12, subsample=0.7)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    list1.append(accuracy_score(y_test,y_pred))
##Figure
sns.set_theme(context='poster')
plt.figure(figsize=(15,10))
plt.title('Number of estimators', color="Green",fontsize=30)
plt.plot(list(range(10,30,1)), list1)
plt.show()

In [None]:
from xgboost import XGBClassifier
classifier = XGBClassifier(n_estimators = 10, max_depth=12, subsample=0.7)
classifier.fit(X_train,y_train)

In [None]:
y_pred = classifier.predict(X_test)
print(y_pred)

In [None]:
# Making the confusion matrix and calculating the accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
ac_xgboost = accuracy_score(y_test, y_pred)
mylist.append(ac_xgboost)
print(cm)
print(ac_xgboost)

In [None]:
# Confusion matrix for prediction results 
cmap1 = sns.diverging_palette(275,150,  s=40, l=65, n=6)
plt.subplots(figsize=(12,8))
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix/np.sum(cf_matrix), cmap = 'magma', annot = True, annot_kws = {'size':15})

**4. Catboost**

In [None]:
from catboost import CatBoostClassifier
classifier = CatBoostClassifier()
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)
print(y_pred)

In [None]:
#Confusion matrix and accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
ac_catboost = accuracy_score(y_test, y_pred)
mylist.append(ac_catboost)
print(cm)
print(ac_catboost)

In [None]:
# Confusion matrix for prediction results 
cmap1 = sns.diverging_palette(275,150,  s=40, l=65, n=6)
plt.subplots(figsize=(12,8))
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix/np.sum(cf_matrix), cmap = 'magma', annot = True, annot_kws = {'size':15})

In [None]:
#Summary of all model classifiers

In [None]:
models = pd.DataFrame({
    'Model': ['KNN', 'Logistic Regression', 
              'Random Forest','xgboost','catboost'],
    'Score': [acc_logreg, 
              acc_randomforest, ac_ann, ac_xgboost,ac_catboost
              ]})
models.sort_values(by='Score', ascending=False)

In [None]:
#Figure for all the classifier models
plt.rcParams['figure.figsize']=15,8 
sns.set_style("darkgrid")
ax = sns.barplot(x=models.Model, y=models.Score, palette = "rocket", saturation =1.5)
plt.xlabel("Classifier Models", fontsize = 20 )
plt.ylabel("% of Accuracy", fontsize = 20)
plt.title("Accuracy of different Classifier Models", fontsize = 25)
plt.xticks(fontsize = 12, horizontalalignment = 'center', rotation = 8)
plt.yticks(fontsize = 13)
for p in ax.patches:
    width, height = p.get_width(), p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height:.2%}', (x + width/2, y + height*1.02), ha='center', fontsize = 'x-large')
plt.show()

<a id="5"></a><h1 style='background:#7ad16d; border:0; color:black'><center> Acknowledgements </center></h1>

The model I have developed has slightly underperformed compared to other models developed by fellow Kaggler [@karnikakapoor](https://www.kaggle.com/karnikakapoor/heart-failure-prediction-ann) and [@midouazerty](https://www.kaggle.com/midouazerty/heart-disease-using-8-machine-learning-algorithms). I will continuousuly updating and training the model - hoping to get a better score in subsequent iteraions.

> Last not the least, I would like to thank the fellow Kaggler [@karnikakapoor](https://www.kaggle.com/karnikakapoor/heart-failure-prediction-ann) and [@midouazerty](https://www.kaggle.com/midouazerty/heart-disease-using-8-machine-learning-algorithms) for providing the template for building a ANN model. I further compared the ANN approach with other traditional and ML based models using the template provided by [@midouazerty](https://www.kaggle.com/midouazerty/heart-disease-using-8-machine-learning-algorithms). 

Despite the model's results here, key driver's of cardiovascular health lies indisputably in our lifestyle habits (e.g. sedentary habits, consumption of junk and energy dense food, smoking etc.), changes in these habits are essential for reducing the risk of cardiovascular events (e.g. HF).

<center><img 
src="https://cdn.dribbble.com/users/1277402/screenshots/4180449/heartwalk.gif" width="900" height="900"></img></center>

<br>