## Introduction

In this notebook, we will predict heart failure based on given `csv` dataset.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Load dataset

The dataset consists of 13 columns features of heart failure patients. In following section, we will load and explore the given dataset.

In [2]:
df = pd.read_csv('heart_failure_clinical_records_dataset.csv')

In [4]:
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [5]:
df.describe()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


In [6]:
df.isnull().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

There are no null values in the dataset.

## Data preprocessing

Since the data is balanced and not any missing values, we won't do any preprocessing here. Since data has high scaling and variation, we normalize our dataset using `MinMaxScaler`.

In [68]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df.values[::, :-1])

Now, let's split dataset into train and test set.

In [77]:
# split input data and target label
X, y = df.values[::, :-1], df.values[::, -1]

In [78]:
from sklearn.model_selection import train_test_split

# splitting data into train set and test set
X, x, Y, y = train_test_split(X, y, test_size=0.33, random_state=42)

## Model selection

Here, we are applying various models to classify the heart failure patient.

In [79]:
from sklearn.metrics import accuracy_score

### Decision Tree

In [80]:
from sklearn.tree import DecisionTreeClassifier

In [81]:
tree_model = DecisionTreeClassifier().fit(X, Y)

In [82]:
pred_tree_y = tree_model.predict(x)

In [83]:
tree_accuracy = accuracy_score(y, pred_tree_y)
print(tree_accuracy)

0.6868686868686869


### Naive Bayes

In [84]:
from sklearn.naive_bayes import MultinomialNB

naive_model = MultinomialNB().fit(X, Y)
pred_naive_y = naive_model.predict(x)
naive_accuracy = accuracy_score(y, pred_naive_y)
print(naive_accuracy)

0.7171717171717171


### Ridge Classifier

In [85]:
from sklearn.linear_model import RidgeClassifier

ridge_model = RidgeClassifier().fit(X, Y)
pred_ridge_y = ridge_model.predict(x)
ridge_accuracy = accuracy_score(y, pred_ridge_y)
print(ridge_accuracy)

0.7676767676767676


### Nearest Neighbor

In [86]:
from sklearn.neighbors import KNeighborsClassifier

neighbor_model = KNeighborsClassifier().fit(X, Y)
pred_neighbor_y = neighbor_model.predict(x)
neighbor_accuracy = accuracy_score(y, pred_neighbor_y)
print(neighbor_accuracy)

0.5555555555555556


## Random Forest

In [87]:
from sklearn.ensemble import RandomForestClassifier

forest_model = RandomForestClassifier().fit(X, Y)
pred_forest_y = forest_model.predict(x)
forest_accuracy = accuracy_score(y, pred_forest_y)
print(forest_accuracy)

0.7373737373737373


### Gradient Boost

In [88]:
from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier().fit(X, Y)
pred_gb_y = gb_model.predict(x)
gb_accuracy = accuracy_score(y, pred_gb_y)
print(gb_accuracy)

0.7272727272727273


### Ada Boost

In [89]:
from sklearn.ensemble import AdaBoostClassifier

ada_model = AdaBoostClassifier().fit(X, Y)
pred_ada_y = ada_model.predict(x)
ada_accuracy = accuracy_score(y, pred_ada_y)
print(ada_accuracy)

0.7373737373737373


### Bagging 

In [90]:
from sklearn.ensemble import BaggingClassifier

bag_model = BaggingClassifier().fit(X, Y)
pred_bag_y = bag_model.predict(x)
bag_accuracy = accuracy_score(y, pred_bag_y)
print(bag_accuracy)

0.7070707070707071


### SGD 

In [91]:
from sklearn.linear_model import SGDClassifier

sgd_model = SGDClassifier().fit(X, Y)
pred_sgd_y = sgd_model.predict(x)
sgd_accuracy = accuracy_score(y, pred_sgd_y)
print(sgd_accuracy)

0.42424242424242425


### SVC

In [92]:
from sklearn.svm import LinearSVC

svc_model = LinearSVC().fit(X, Y)
pred_svc_model = svc_model.predict(x)
svc_accuracy = accuracy_score(y, pred_svc_model)
print(svc_accuracy)

0.42424242424242425




### Neural Nets

In [93]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

In [94]:
nn_model = Sequential()

nn_model.add(Dense(12, ))
nn_model.add(Dense(128, activation='relu'))
nn_model.add(Dense(64, activation='relu'))
nn_model.add(Dense(16, activation='relu'))
nn_model.add(Dense(1, activation='sigmoid'))

In [95]:
nn_model.compile(loss='binary_crossentropy',
                optimizer='adam',
                metrics=['accuracy'])

In [96]:
nn_model.fit(X, Y, batch_size=20, epochs=50)

Train on 200 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f7c9c3baa90>

In [97]:
pred_nn_y = nn_model.predict(x)
pred_nn_y = (pred_nn_y > 0.5)
nn_accuracy = accuracy_score(y, pred_nn_y)
print(nn_accuracy)

0.42424242424242425


## Comparisons

Let us compare the overall machine learning models for this classification task.

## Conclusion

Our model, `RandomForest` performed really well without preprocessing of the dataset. We can further improve the accuracy of our model by various methods.