# Classification
In this project, you will use a dataset from Kaggle to predict the survival of patients with heart failure from serum creatinine and ejection fraction, and other factors such as age, anemia, diabetes, and so on.

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Heart failure is a common event caused by CVDs, and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity, and harmful alcohol use using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia, or already established disease) need early detection and management wherein a machine learning model can be of great help.

In [2]:
#1 Loading data
import pandas as pd
data=pd.read_csv('heart_failure.csv')

#2 Inspecting data
data.info()

#3 Inspecting the distribution of specific column
from collections import Counter
print('Classes and number of values in the dataset', Counter(data['death_event']))

#4 Extracting data into y variable
y=data['death_event']

#5 Extracting data into x variables
x=data[['age','anaemia','creatinine_phosphokinase','diabetes','ejection_fraction','high_blood_pressure','platelets','serum_creatinine','serum_sodium','sex','smoking','time']]

#Data Preprocessing
#6 Converting categorical features in data into one-hot encoding vectors
x = pd.get_dummies(x)

#7 Splitting data
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.3, random_state=0)

#8 Initialising a Column Transformer object to scale numeric features
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('numeric', StandardScaler(), ['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time'])])

#9 Training the scaler instance on training data and assigning it back to X_train
X_train = ct.fit_transform(X_train)

#10 Do the same as #9 with X_test
X_test = ct.fit_transform(X_test)

#Prepare labels for classification
#11 Initialising an instance of LabelEncoder
le = LabelEncoder()

#12 Fitting training labels of Y_train
Y_train = le.fit_transform(Y_train.astype(str))

#13 Doing the same for Y_test
Y_test = le.fit_transform(Y_test.astype(str))

#14 and #15 Transforming encoded training labels into binary vectors
from tensorflow.keras.utils import to_categorical
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)

#Design the Model
#16 Initialising the model instance
from tensorflow.keras.models import Sequential
model = Sequential()

#17 Creating an input layer instance and adding it into the model via Model.add()
from tensorflow.keras.layers import Dense, InputLayer
model.add(InputLayer(input_shape=(X_train.shape[1],)))

#18 Creating a hidden layer
model.add(Dense(12, activation='relu'))

#19 Creating an output layer
model.add(Dense(2, activation='softmax'))

#20 Compiling model instance
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

#Train and Evaluate the model
#21 Fitting the model instance to both training data X_train and Y_train
model.fit(X_train, Y_train, epochs=100, batch_size=16, verbose=1)

#22 Evaluating the trained model of X_test and Y_test and assign result into loss and acc variable
loss, acc = model.evaluate(X_test, Y_test, verbose=0)
print('Loss:', loss, 'Accuracy:', acc)

#Generating a classification report
#23 Getting predictions from the test data X_test and the model
y_estimate = model.predict(X_test, verbose=0)

#24 Using numpy.argmax to select indices of each true classes for y_estimate
import numpy as np
y_estimate = np.argmax(y_estimate, axis=1)

#25 Using numpy.argmax to select indices of each true classes for y_true
y_true = np.argmax(Y_test, axis=1)

#26 
from sklearn.metrics import classification_report
print(classification_report(y_true, y_estimate))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                299 non-null    int64  
 1   age                       299 non-null    float64
 2   anaemia                   299 non-null    object 
 3   creatinine_phosphokinase  299 non-null    int64  
 4   diabetes                  299 non-null    object 
 5   ejection_fraction         299 non-null    int64  
 6   high_blood_pressure       299 non-null    object 
 7   platelets                 299 non-null    float64
 8   serum_creatinine          299 non-null    float64
 9   serum_sodium              299 non-null    int64  
 10  sex                       299 non-null    object 
 11  smoking                   299 non-null    object 
 12  time                      299 non-null    int64  
 13  DEATH_EVENT               299 non-null    int64  
 14  death_even

Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Loss: 0.4316433072090149 Accuracy: 0.8111110925674438
              precision    recall  f1-score   support

           0       0.86      0.87      0.86        62
           1       0.70      0.68      0.69        28

    accuracy                           0.81        90
   macro avg       0.78      0.77      0.78        90
weighted avg       0.81      0.81      0.81        90

