# Heart Failure Prediction

*In this notebook I used 3 models to classify heart failure using Heart Failure Prediction dataset. The models I used:*
1. Gradient Boosting Classifier
2. Neural Network
3. Support Vector Classifier

*Gradient Boosting classifier is one of ensemble learning methods. This model learns from previous layer tree's error.*

*Random Forests to eliminate important features.*

*Neural network is an efficient and easy way to create machine learning architectures. Using Keras makes it even easier to build.*

*Support Vectors uses decision boundaries to split samples of each class. For non linear problems, they use kernels to fit data. I used radial basis function as kernel.*

*To visualize dataset and model performances, I imported plotly.express, plotly.graph_objs, plotly.subplots.*


*I'm not a data scientist or student in computer science departmant, I just enjoy training models and am open to suggestions.*🥳

# Libraries

In [None]:
import pandas as pd
import numpy as np

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

from keras.models import Sequential
from keras.layers import Dense, Dropout

import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.express as px

import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv("../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
data.head()

In [None]:
X = data.drop("DEATH_EVENT", axis=1)
y = data["DEATH_EVENT"]

rfc = RandomForestClassifier(n_estimators=60, bootstrap=True)
rfc.fit(X,y)
f_i = pd.DataFrame(data={"feature_importances":rfc.feature_importances_*100}, index=X.columns)

fig = go.Figure(data=[go.Pie(labels=f_i.index, values=f_i["feature_importances"], hole=.4)])
fig.update_layout(width=700, height=450, template = 'plotly_dark', title_text="Feature Importances")
fig.show()

### Feature Importances

In [None]:
f_i.sort_values(by="feature_importances", inplace=True)
X = X.drop(f_i.index[:5], axis=1)
data_ = data.copy().drop(f_i.index[:5], axis=1)
data_.head()

In [None]:
data_.describe()

# Exploratory Data Analysis

## Gender

In [None]:
fig = px.histogram(data, x="sex", marginal="violin", hover_data=data.columns,
                   title ="Gender", 
                   template="plotly_dark",
                   opacity=0.8)
fig.update_layout(
        width=700,height=600,
        yaxis_title_text='count',
        bargap=0.05,
        )
fig.show()

In [None]:
fig = px.histogram(data, x="sex", color="DEATH_EVENT", marginal="violin", hover_data=data.columns,
                   title ="Gender vs Death Event", 
                   template="plotly_dark",
                   color_discrete_map={"0": "RebeccaPurple", "1": "MediumPurple"},
                   opacity=0.8)
fig.update_layout(
        width=700,height=600,
        yaxis_title_text='count',
        bargap=0.05)
fig.show()

## Diabetes

In [None]:
diabetes = data[data["diabetes"] == 1]
non_diabetes = data[data["diabetes"] == 0]

d_d = diabetes[diabetes["DEATH_EVENT"] == 1]
d_s = diabetes[diabetes["DEATH_EVENT"] == 0]

nd_d = non_diabetes[non_diabetes["DEATH_EVENT"] == 1]
nd_s = non_diabetes[non_diabetes["DEATH_EVENT"] == 0]

fig = make_subplots(rows=3, cols=1, specs=[[{'type':'domain'}], [{'type':'domain'}],[{'type':'domain'}]])

fig.add_trace(go.Pie(labels=["Diabetes","No Diabetes"],
                     values=[len(diabetes),len(non_diabetes)],hole=.3),1,1)
fig.add_trace(go.Pie(name="DIABETES vs HEART FAILURE",labels=["Heart Failure","Survived"], 
                     values=[len(d_d),len(d_s)],hole=.3),2,1)

fig.add_trace(go.Pie(name="NO DIABETES vs HEART FAILURE",labels=["Heart Failure","Survived"],
                     values=[len(nd_d),len(nd_s)],hole=.3),3,1)
fig.update_traces(hole=.4, hoverinfo="label+percent")
fig.update_layout(width=700, height=900, title_text="Diabetes vs Heart Failure",template='plotly_dark',
                 annotations=[dict(text='DIABETES', x=0.2, y=0.6, font_size=10, showarrow=False),
                 dict(text='DIABETES RATIO', x=0.2, y=1.04, font_size=10, showarrow=False),
                 dict(text='NO DIABETES', x=0.2, y=0.3, font_size=10, showarrow=False)])
fig.show()

## Smoking

In [None]:
smokers = data[data["smoking"] == 1]
non_smokers = data[data["smoking"] == 0]

s_d = smokers[smokers["DEATH_EVENT"] == 1]
s_s = smokers[smokers["DEATH_EVENT"] == 0]

ns_d = non_smokers[non_smokers["DEATH_EVENT"] == 1]
ns_s = non_smokers[non_smokers["DEATH_EVENT"] == 0]

fig = make_subplots(rows=3, cols=1, specs=[[{'type':'domain'}], [{'type':'domain'}],[{'type':'domain'}]])

fig.add_trace(go.Pie(labels=["Smokers","Non Smokers"],
                     values=[len(smokers),len(non_smokers)],hole=.3),1,1)
fig.add_trace(go.Pie(labels=["Heart Failure","Survived"], 
                     values=[len(s_d),len(s_s)],hole=.3),2,1)

fig.add_trace(go.Pie(labels=["Heart Failure","Survived"],
                     values=[len(ns_d),len(ns_s)],hole=.3),3,1)
fig.update_traces(hole=.4, hoverinfo="label+percent")
fig.update_layout(width=700, height=900, title_text="Smoking vs Heart Failure",template='plotly_dark',
                 annotations=[dict(text='SMOKER RATIO', x=0.2, y=1.04, font_size=10, showarrow=False),
                 dict(text='SMOKERS', x=0.2, y=0.6, font_size=10, showarrow=False),
                 dict(text='NON SMOKERS', x=0.2, y=0.3, font_size=10, showarrow=False)])
fig.show()

# 

In [None]:
for i in X.columns:
    fig = px.box(X, x=i, color_discrete_sequence=['mediumspringgreen'])
    fig.update_layout(width=700,height=450, title_text=i, template = 'plotly_dark')
    fig.show()

In [None]:
shape1 = data_.shape

for column in data_.columns:
    q1 = data_[column].quantile(0.25)
    q3 = data_[column].quantile(0.75)
    iqr = q3-q1
    minimum = q1-(1.5*iqr)
    maximum = q3+(1.5*iqr)
    
    min_in = data_[data_[column] < minimum].index
    max_in = data_[data_[column] > maximum].index
    
    data_.drop(min_in, inplace=True)
    data_.drop(max_in, inplace=True)

shape2 = data_.shape

outliers = shape1[0]-shape2[0]

print("Total count of outliers have deleted: ",outliers)

In [None]:
for i in data_.columns[:-1]:
    title = i + " (without outliers)"
    fig = px.box(data_, x=i, color_discrete_sequence = ['red'])
    fig.update_layout(width=700,height=450, title_text=title, template='plotly_dark')
    fig.show()

# Feature Counts and Distributions

In [None]:
for column in data_.columns:
    fig = go.Figure()
    fig.add_trace(go.Histogram(x=data_[column],marker_color="#ccffff",opacity=0.8))
    fig.update_layout(
        width=700,height=450, 
        title_text=column,
        yaxis_title_text='count',
        bargap=0.05,
        template = 'plotly_dark')
    fig.show()
    if column != "DEATH_EVENT":
        fig = px.violin(data_, y=column, x="DEATH_EVENT",box=True, points="all",hover_data=data_.columns)
        fig.update_layout(
        width=700,height=450, 
        title_text=column,
        yaxis_title_text='distribution',
        template = 'plotly_dark')
        fig.show()

# Random Forests

*Decision tree based models don't need scaled input. Because of that I will only split train set and test set.*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [None]:
rfc = RandomForestClassifier(n_estimators=60, bootstrap=True)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
accuracy_rfc = accuracy_score(y_test, y_pred)*100
print("Ranfom Forests accuracy on test set: ",accuracy_rfc,"%")

In [None]:
fig = px.imshow(confusion_matrix(y_test, y_pred),
                labels=dict(x="Predictions", y="True"),
                x=['Survived (0)','Death Event (1)'],
                y=['Survived (0)','Death Event (1)'],
               template="plotly_dark")
fig.update_layout(width=700, height=600)
fig.show()

# Gradient Boosting Classifier

### Finding optimal hyperparameter

*Before training, I want to find optimal number of trees that model would have to ensure the accuracy maximum. Then define it with this best number of trees by setting n_estimators hyperparameter.*

In [None]:
gbc = GradientBoostingClassifier(max_depth=2, min_samples_split=3, n_estimators=150)
gbc.fit(X_train, y_train)

errors = [mean_squared_error(y_test, y_pred) for y_pred in gbc.staged_predict(X_test)]
acc = [accuracy_score(y_test, y_pred) for y_pred in gbc.staged_predict(X_test)]
best_n_estimators = np.argmax(acc)

gbc = GradientBoostingClassifier(max_depth=3, n_estimators=best_n_estimators)
gbc.fit(X_train, y_train)
y_pred = gbc.predict(X_test)
accuracy_gbc = accuracy_score(y_test, y_pred)*100

print("Gradient Boosting Classifier Accuracy: ",accuracy_gbc,"%")

In [None]:
fig = px.line(x=range(len(errors)), y=errors, title='Validation error')
fig.update_layout(width=700,height=450,xaxis_title='Number of trees',yaxis_title='MSE',template="plotly_dark")
fig.show()

In [None]:
fig = px.line(x=range(len(acc)), y=acc, title='Validation Accuracy')
fig.update_layout(width=700,height=450,xaxis_title='Number of trees',yaxis_title='Accuracy',template="plotly_dark")

In [None]:
fig = px.imshow(confusion_matrix(y_test, y_pred),
                labels=dict(x="Predictions", y="True"),
                x=['Death Event (1)', 'Survived (0)'],
                y=['Death Event (1)', 'Survived (0)'],
               template="plotly_dark")
fig.update_layout(width=700, height=600)
fig.show()

# Neural Network

In [None]:
X_nn = data.drop("DEATH_EVENT",axis=1)
y_nn = data["DEATH_EVENT"]
X_train_nn, X_test_nn, y_train_nn, y_test_nn = train_test_split(X_nn, y_nn, shuffle=True)

scaler = StandardScaler()
X_train_nn = scaler.fit_transform(X_train_nn)
X_test_nn = scaler.transform(X_test_nn)

input_size = X_nn.shape[1]

model = Sequential()
model.add(Dense(units=input_size, input_dim=X_train_nn.shape[1], activation="relu"))
model.add(Dense(units=input_size, activation="relu"))
model.add(Dense(units=input_size, activation="relu"))
model.add(Dense(units=input_size, activation="relu"))
model.add(Dense(units=input_size, activation="relu"))
model.add(Dense(units=input_size/2, activation="relu"))
model.add(Dense(units=1, activation="sigmoid"))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'] )
model.summary()

In [None]:
history = model.fit(X_train_nn, y_train_nn, epochs=100, verbose=True, validation_split=0.2)

In [None]:
print("Accuracy on validation set: ",history.history["val_accuracy"][-1]*100,"%")

In [None]:
y_pred_nn = model.predict(X_test_nn)
accuracy = np.dot(history.history["accuracy"],100)
loss = history.history['loss']

score = model.evaluate(X_test_nn, y_test_nn, steps=5)
accuracy_nn = score[1]*100
print()
print("Accuracy on test set: ",accuracy_nn,"%")

# Support Vector Classifier

In [None]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

svc = SVC(kernel="rbf")
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy_svc = accuracy_score(y_test, y_pred)*100
print("SVC Accuracy on test set: ",accuracy_svc,"%")

In [None]:
fig = px.histogram(y_test, template="plotly_dark", title="Counts of class samples in test set")
fig.update_layout(width=700,height=450, bargap=0.1)

In [None]:
fig = px.imshow(confusion_matrix(y_test, y_pred),
                labels=dict(x="Predictions", y="True"),
                x=['Survived (0)','Death Event (1)'],
                y=['Survived (0)','Death Event (1)'],
               template="plotly_dark")
fig.update_layout(width=700, height=600)
fig.show()

*Confusion matrix shows that support vector classifier performed well classifying class 0 samples but bad at classifying class 1 samples with only 50% accuracy in this class. To understand better, classification report can help.*

In [None]:
classification_report(y_test, y_pred, target_names=["class 0","class 1"])

# Comparision of Models

In [None]:
compare = pd.DataFrame(index=["Random Forests","Gradient Boosting Classifier","Neural Network","Support Vector Classifier"],
                       columns=["Accuracy"],
                      data=[accuracy_rfc,accuracy_gbc,accuracy_nn,accuracy_svc])
compare

In [None]:
fig = px.bar(compare,
             template="plotly_dark", 
             title="Comparision of models",
             color_discrete_sequence=["darkviolet"])
fig.update_layout(width=700, height=450, xaxis_title="Models", yaxis_title="Accuracy")
fig.update(layout_showlegend=False)
fig.show()