# Cardiovascular Diseases (CDVs) Detection with Deep Learning

## Introduction

Cardiovascular diseases (CVDs) are the leading cause of death globally, accounting for approximately 31% of all deaths worldwide. Among the many manifestations of CVDs, heart failure is a significant contributor, causing widespread mortality and reduced quality of life.

This project leverages a clinical dataset from Kaggle to develop a machine learning model that predicts patient survival after heart failure. By analyzing features such as age, serum creatinine, ejection fraction, and comorbidities (e.g., anemia, diabetes), this model aims to assist healthcare professionals in identifying high-risk patients and improving treatment strategies.

## Dataset Overview

The dataset used in this project is publicly available on Kaggle: [Heart Failure Clinical Records Dataset](https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data). It contains information on 299 patients with 12 clinical features, including:
- Demographics (e.g., age, sex)
- Laboratory findings (e.g., serum creatinine, serum sodium)
- Clinical history (e.g., presence of diabetes, high blood pressure)
- Outcome (survival status)

### Acknowledgments

**Citation**:  
Davide Chicco, Giuseppe Jurman: *Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone*. BMC Medical Informatics and Decision Making 20, 16 (2020).  
[Read the publication here](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5).

**License**:  
The dataset is shared under the **CC BY 4.0** license, allowing for sharing and adaptation with appropriate credit.

## Objective

The objective of this project is to:
1. Perform exploratory data analysis to understand the dataset.
2. Build and evaluate machine learning models to predict survival outcomes.
3. Provide insights into the most influential factors contributing to heart failure mortality.

## Why This Matters

With the growing prevalence of CVDs, early detection and proactive management are critical. By addressing this challenge with data-driven methods, this project contributes to the broader goal of reducing preventable deaths and enhancing patient care.



### Data Loading and analysis

In [97]:
# Libraries Import

import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.compose import ColumnTransformer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer
from sklearn.metrics import classification_report
from tensorflow.keras.utils import to_categorical
import numpy as np

In [98]:
# Load the dataset into a DataFrame
data = pd.read_csv("heart_failure_clinical_records_dataset.csv")

# Print the columns and their types
data.info() # as there are no categorical columns, we don't need to use OneHotEncoder

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


In [99]:
# Check the distribution of the 'DEATH_EVENT' column
death_event_distribution = Counter(data['DEATH_EVENT'])

# Print the distribution
print(death_event_distribution)

Counter({0: 203, 1: 96})


### Data Prepraration before applying models

In [100]:
# Extract the label column 'DEATH_EVENT'
y = data['DEATH_EVENT']

In [101]:
# Extract the clinics features for using them later as predictors
feature_columns = [
    'age', 'anaemia', 'creatinine_phosphokinase', 'diabetes', 'ejection_fraction',
    'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium',
    'sex', 'smoking', 'time'
]
x = data[feature_columns]
print(x.head())

    age  anaemia  creatinine_phosphokinase  diabetes  ejection_fraction  \
0  75.0        0                       582         0                 20   
1  55.0        0                      7861         0                 38   
2  65.0        0                       146         0                 20   
3  50.0        1                       111         0                 20   
4  65.0        1                       160         1                 20   

   high_blood_pressure  platelets  serum_creatinine  serum_sodium  sex  \
0                    1  265000.00               1.9           130    1   
1                    0  263358.03               1.1           136    1   
2                    0  162000.00               1.3           129    1   
3                    0  210000.00               1.9           137    1   
4                    0  327000.00               2.7           116    0   

   smoking  time  
0        0     4  
1        0     6  
2        1     7  
3        0     7  
4        

In [102]:
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(
    x,  # Features
    y,  # Labels
    test_size=0.2,  # Percentage of data for the test set
    random_state=42  # Random state for reproducibility
)

# Display the shapes of the resulting datasets to confirm the split
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)

X_train shape: (239, 12)
X_test shape: (60, 12)
Y_train shape: (239,)
Y_test shape: (60,)


In [103]:

# Define the numeric features to scale
numeric_features = [
    'age', 'creatinine_phosphokinase', 'ejection_fraction',
    'platelets', 'serum_creatinine', 'serum_sodium', 'time'
]

# Initialize the ColumnTransformer with StandardScaler for the numeric features
ct = ColumnTransformer(
    transformers=[
        ('scaler', StandardScaler(), numeric_features)
    ],
    remainder='passthrough'  # Keep the other columns unchanged
)


In [104]:
# Fit and transform the training data
X_train = ct.fit_transform(X_train)

# Display the first few rows of the transformed training data
print(X_train[:5])

[[ 1.16420244 -0.35037003 -2.00086672 -1.43956795 -0.18870542  0.13125912
  -1.56416577  1.          0.          0.          1.          0.        ]
 [ 1.16420244 -0.50593309 -0.02267169 -0.40847646  1.12060172 -0.54581131
   0.37989712  0.          0.          1.          1.          0.        ]
 [-0.03281933 -0.50064183 -0.71073953  1.34544205  0.11344238 -0.09443102
   0.4950061   1.          1.          0.          1.          0.        ]
 [-0.75664461 -0.47101077 -0.71073953 -0.47225532 -0.69228509 -0.09443102
  -0.25959725  0.          0.          0.          1.          1.        ]
 [ 2.75098914  0.0052027  -0.02267169  0.00989189  0.44580496 -0.54581131
  -1.34673769  0.          1.          1.          1.          0.        ]]


In [105]:
# List of columns in the transformed data
passthrough_columns = [col for col in feature_columns if col not in numeric_features]  # Unscaled columns

# Full column order after transformation
transformed_columns = numeric_features + passthrough_columns

# Display the column mapping
print("Column order after transformation:", transformed_columns)


Column order after transformation: ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time', 'anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking']


In [106]:
# Transform the test data
X_test = ct.transform(X_test)

In [107]:
## ------ Even though the usage of LabelEncoder might seem unnecessary in this case since our target column is binary, it helps to maintain the data structure necessary for executing the sequential NN ------ ##
# Initialize the LabelEncoder instance
le = LabelEncoder()
# Fit and transform the training labels
Y_train = le.fit_transform(Y_train)
# Transform the test labels
Y_test = le.transform(Y_test)

In [108]:
# Transform the encoded training labels into a binary vector
Y_train = to_categorical(Y_train)
# Transform the encoded test labels into a binary vector
Y_test = to_categorical(Y_test)

### Model Set-up

In [109]:
# Initialize the Sequential model
model = Sequential()

# Display the initialized model to confirm
print(model)

<Sequential name=sequential_3, built=False>


In [110]:
# Create an InputLayer instance and add it to the model
input_layer = InputLayer(shape=(X_train.shape[1],))  # Use 'shape' instead of 'input_shape'
model.add(input_layer)

# Display the model to confirm the addition of the input layer
model.summary()

In [111]:
# Create a hidden layer with 12 neurons and ReLU activation
hidden_layer = Dense(units=12, activation='relu')

# Add the hidden layer to the model
model.add(hidden_layer)

# Display the model to confirm the addition of the hidden layer
model.summary()

In [112]:
# Create an output layer with softmax activation and the number of neurons equal to the number of classes
output_layer = Dense(units=Y_train.shape[1], activation='softmax')  # Y_train.shape[1] gives the number of classes

# Add the output layer to the model
model.add(output_layer)

# Display the model to confirm the addition of the output layer
model.summary()

In [113]:
# Compile the model with the specified loss, optimizer, and metrics
model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)


### Model Fit and Performance Analysis

In [114]:
# Fit the model to the training data
history = model.fit(
    X_train,  # Training features
    Y_train,  # Training labels
    epochs=100,  # Number of epochs
    batch_size=16,  # Batch size
    verbose=1  # Display training progress
)

# Confirm training has started
print("Model training complete!")


Epoch 1/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - accuracy: 0.5988 - loss: 0.7582
Epoch 2/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.6888 - loss: 0.5982
Epoch 3/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.6971 - loss: 0.5616
Epoch 4/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.7108 - loss: 0.5296
Epoch 5/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.7855 - loss: 0.4997
Epoch 6/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - accuracy: 0.7566 - loss: 0.4950
Epoch 7/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.8038 - loss: 0.4444
Epoch 8/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.7864 - loss: 0.4704
Epoch 9/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━

In [115]:
# Evaluate the trained model on the test data
loss, acc = model.evaluate(
    X_test,  # Test features
    Y_test,  # Test labels
    verbose=1  # Display evaluation progress
)

# Display the results
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {acc}")


[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - accuracy: 0.7826 - loss: 0.5710
Test Loss: 0.5693067908287048
Test Accuracy: 0.7833333611488342


In [116]:
# Get predictions for the test data
y_estimate = model.predict(X_test)

# Display the first few predictions
print(y_estimate[:5]) # The output is a probability distribution for each class, e.g., [0.1, 0.9] means 10% for class 0 and 90% for class 1


[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[[9.4264507e-01 5.7354953e-02]
 [9.9702042e-01 2.9795209e-03]
 [9.1280198e-01 8.7198094e-02]
 [2.0149213e-04 9.9979848e-01]
 [8.9440191e-01 1.0559808e-01]]


In [117]:
# Select the indices of the true classes for each label encoding in y_estimate
y_estimate = np.argmax(y_estimate, axis=1) #converts the probability distribution to the actual class with the highest probability

# Display the first few predicted class indices
print(y_estimate[:5])


[0 0 0 1 0]


In [118]:
# Select the indices of the true classes for each label encoding in Y_test
y_true = np.argmax(Y_test, axis=1)

# Display the first few true class indices
print(y_true[:5])

[0 0 1 1 0]


In [119]:
# Generate and print the classification report
report = classification_report(y_true, y_estimate)

print("Classification Report:")
print(report)

Classification Report:
              precision    recall  f1-score   support

           0       0.74      0.97      0.84        35
           1       0.93      0.52      0.67        25

    accuracy                           0.78        60
   macro avg       0.83      0.75      0.75        60
weighted avg       0.82      0.78      0.77        60



### Updated Summary of the Classification Report

1. **Overall Accuracy**: 
   - The model achieved an overall accuracy of **78%** on the test data, meaning that 78% of the predictions correctly classified whether or not there was a death event.

2. **Performance for Patients Without a Death Event (Class 0)**:
   - **Precision**: 0.74 – Of all predictions made for "no death event," 74% were correct.
   - **Recall**: 0.97 – The model successfully identified 97% of patients who truly did not experience a death event.
   - **F1-score**: 0.84 – A strong balance between precision and recall for predicting no death event.
   - **Support**: 35 patients in the test set did not experience a death event.

3. **Performance for Patients With a Death Event (Class 1)**:
   - **Precision**: 0.93 – Of all predictions made for "death event," 93% were correct.
   - **Recall**: 0.52 – The model correctly identified only 52% of patients who truly experienced a death event.
   - **F1-score**: 0.67 – Indicates moderate performance in predicting death events due to the lower recall.
   - **Support**: 25 patients in the test set experienced a death event.

4. **Macro Average**:
   - **Precision**: 0.83 – Average precision across both classes.
   - **Recall**: 0.75 – Average recall across both classes.
   - **F1-score**: 0.75 – Suggests balanced performance overall, though the model performs better for "no death event" than "death event."

5. **Weighted Average**:
   - **Precision**: 0.82 – Precision weighted by the support of each class.
   - **Recall**: 0.78 – Recall weighted by the support of each class.
   - **F1-score**: 0.77 – Slight bias toward class 0 (no death event) due to its higher support.

6. **Class Imbalance**:
   - The dataset is imbalanced with a **ratio of ~2:1** (203 instances of "no death event" vs. 96 instances of "death event").
   - This imbalance is reflected in the model's stronger performance for "no death event" (class 0) and its lower recall for "death event" (class 1).

---

### Key Observations:
- **High Recall for Class 0 (No Death Event)**: The model is highly effective at identifying patients who will not experience a death event.
- **Low Recall for Class 1 (Death Event)**: The model struggles to identify patients who will experience a death event, as evidenced by the 52% recall for class 1.
- **Class Imbalance Impact**: The class imbalance is likely skewing the model's performance, favoring predictions for the majority class (class 0).

---

### Recommendations:
1. **Address Class Imbalance**:
   - Use techniques like oversampling the minority class (class 1), undersampling the majority class (class 0), or applying class weights during model training.
2. **Focus on Recall for Class 1**:
   - Optimize the model for better recall in predicting death events, which might involve tweaking loss functions, adding more layers, or adjusting learning rates.
3. **Threshold Adjustment**:
   - Experiment with adjusting the decision threshold for classifying "death event" to improve recall for class 1, even if it reduces precision slightly.

This way, the model will be better aligned with the critical goal of identifying patients at risk of a death event, which is often more important in healthcare scenarios.
