# Predicting Heart Failure Mortality using Machine Learning

In this project, I explore the fascinating world of predictive analytics in healthcare by leveraging machine learning to predict the mortality risk of patients with heart failure.

### Project Overview

Cardiovascular diseases (CVDs) claim millions of lives annually, making them a global health concern. Heart failure, a common consequence of CVDs, presents an opportunity for early intervention. In this project, I utilize a dataset from Kaggle, encompassing 12 features such as age, anemia, and serum creatinine, to develop a machine learning model capable of predicting mortality based on these factors.

### Dataset Information

- **Number of Entries:** 299
- **Features:** Age, Anaemia, Creatinine Phosphokinase, Diabetes, Ejection Fraction, High Blood Pressure, Platelets, Serum Creatinine, Serum Sodium, Sex, Smoking, Time.
- **Target Variable:** DEATH_EVENT (0: No Death, 1: Death)

### Significance

As cardiovascular diseases remain a leading cause of death globally, there's a critical need for predictive models to aid in early detection. This project aims to contribute to this goal by training a neural network model to identify individuals at high risk.

### Model Training

I've trained a neural network model using features like age, ejection fraction, and serum creatinine. The model's performance is evaluated on a test set, and key metrics such as accuracy, precision, recall, and F1-score are considered to assess its effectiveness.

### Final Model Performance

After 100 epochs of training, the model achieved an accuracy of approximately 92.47% on the training set and demonstrated promising results on the test set with an accuracy of 73.33%. The model's precision, recall, and F1-score metrics are also provided to offer a comprehensive understanding of its performance.


----------------

In [37]:
# Importing the libraries
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Dense
from tensorflow.keras.utils import to_categorical
import numpy as np
from sklearn.metrics import classification_report

In [38]:
# Load the data
data = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
data

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


In [39]:
# Display information about the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


In [40]:
# Check the distribution of the target variable
death_event_distribution = Counter(data['DEATH_EVENT'])
print("Death Event Distribution:", death_event_distribution)

Death Event Distribution: Counter({0: 203, 1: 96})


In [41]:
# Extract the target variable (label)
y = data['DEATH_EVENT']

# Extract features
x = data[['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes', 'ejection_fraction', 
          'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium', 
          'sex', 'smoking', 'time']]

## Data Preprocessing

In [42]:
# Perform one-hot encoding for categorical features
x = pd.get_dummies(x, columns=['sex', 'smoking', 'anaemia', 'diabetes', 'high_blood_pressure'])

In [43]:
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [44]:
# Standardize numeric features using StandardScaler
ct = StandardScaler()
X_train[['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets',
         'serum_creatinine', 'serum_sodium', 'time']] = ct.fit_transform(X_train[['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets',
                                                                                   'serum_creatinine', 'serum_sodium', 'time']])
X_test[['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets',
        'serum_creatinine', 'serum_sodium', 'time']] = ct.transform(X_test[['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets',
                                                                            'serum_creatinine', 'serum_sodium', 'time']])

# Convert features to array format
X_train = np.asarray(X_train).astype('float32')
X_test = np.asarray(X_test).astype('float32')

## Prepare Labels for Classification

In [45]:
# Encode labels using LabelEncoder
le = LabelEncoder()
Y_train = le.fit_transform(Y_train.astype(str))
Y_test = le.transform(Y_test.astype(str))

# Convert labels to categorical format
Y_train = to_categorical(Y_train, dtype = 'int64')
Y_test = to_categorical(Y_test, dtype = 'int64')

## Build, Train, & Evaluate the Model

In [46]:
# Build the neural network model
model = Sequential()
model.add(InputLayer(input_shape=(X_train.shape[1],)))
model.add(Dense(12, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

# Train the model
model.fit(np.asarray(X_train).astype('float32'), Y_train, epochs=100, batch_size=16)

Epoch 1/100


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

<keras.src.callbacks.History at 0x1d2f9b69c10>

In [47]:
# Evaluate the model on the test set
loss, acc = model.evaluate(X_test, Y_test)
print("Test Loss:", loss)
print("Test Accuracy:", acc)

Test Loss: 0.6819261908531189
Test Accuracy: 0.7333333492279053


## Generating a classification report

In [48]:
# Predict using the trained model
y_estimate = model.predict(X_test)
y_estimate = np.argmax(y_estimate, axis=1)
y_true = np.argmax(Y_test, axis=1)

# Generate classification report
print(classification_report(y_true, y_estimate))

              precision    recall  f1-score   support

           0       0.71      0.91      0.80        35
           1       0.80      0.48      0.60        25

    accuracy                           0.73        60
   macro avg       0.76      0.70      0.70        60
weighted avg       0.75      0.73      0.72        60

