<a href="https://colab.research.google.com/github/mannb986/cardiovascular_dieases_classification/blob/main/cardiovascular_diseases_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cardiovascular Diseases Classification

In this project I will use a dataset from Kaggle to predict the survival of patients with heart failure from serum creatine and ejection fraction, and other factors such as age, anemia, diabetes, etc. 

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Heart failure is a common event caused by CVDs, and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity, and harmful alcohol use using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia, or already established disease) need early detection and management wherein a machine learning model can be of great help.

## Loading the Data

In [1]:
 from google.colab import files


uploaded = files.upload()

Saving heart_failure.csv to heart_failure (1).csv


In [2]:
import pandas as pd 
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.compose import ColumnTransformer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer
from sklearn.metrics import classification_report
from tensorflow.keras.utils import to_categorical
import numpy as np

data = pd.read_csv('heart_failure.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                299 non-null    int64  
 1   age                       299 non-null    float64
 2   anaemia                   299 non-null    object 
 3   creatinine_phosphokinase  299 non-null    int64  
 4   diabetes                  299 non-null    object 
 5   ejection_fraction         299 non-null    int64  
 6   high_blood_pressure       299 non-null    object 
 7   platelets                 299 non-null    float64
 8   serum_creatinine          299 non-null    float64
 9   serum_sodium              299 non-null    int64  
 10  sex                       299 non-null    object 
 11  smoking                   299 non-null    object 
 12  time                      299 non-null    int64  
 13  DEATH_EVENT               299 non-null    int64  
 14  death_even

In [4]:
data.head()

Unnamed: 0.1,Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT,death_event
0,0,75.0,no,582,no,20,yes,265000.0,1.9,130,yes,no,4,1,yes
1,1,55.0,no,7861,no,38,no,263358.03,1.1,136,yes,no,6,1,yes
2,2,65.0,no,146,no,20,no,162000.0,1.3,129,yes,yes,7,1,yes
3,3,50.0,yes,111,no,20,no,210000.0,1.9,137,yes,no,7,1,yes
4,4,65.0,yes,160,yes,20,no,327000.0,2.7,116,no,no,8,1,yes


In [5]:
Counter(data['death_event'])

Counter({'no': 203, 'yes': 96})

In [6]:
#creating labels for the model
y = data['death_event']

In [7]:
#creating the features for the model
features = [
            'age',
            'anaemia',
            'creatinine_phosphokinase',
            'diabetes',
            'ejection_fraction',
            'high_blood_pressure',
            'platelets',
            'serum_creatinine',
            'serum_sodium',
            'sex',
            'smoking',
            'time'
            ]

x = data[features]

In [8]:
x.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,75.0,no,582,no,20,yes,265000.0,1.9,130,yes,no,4
1,55.0,no,7861,no,38,no,263358.03,1.1,136,yes,no,6
2,65.0,no,146,no,20,no,162000.0,1.3,129,yes,yes,7
3,50.0,yes,111,no,20,no,210000.0,1.9,137,yes,no,7
4,65.0,yes,160,yes,20,no,327000.0,2.7,116,no,no,8


## Data Processing

In [9]:
#converting categorical features to one-hot encoding vectors
x = pd.get_dummies(x)

In [10]:
x.head()

Unnamed: 0,age,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium,time,anaemia_no,anaemia_yes,diabetes_no,diabetes_yes,high_blood_pressure_no,high_blood_pressure_yes,sex_no,sex_yes,smoking_no,smoking_yes
0,75.0,582,20,265000.0,1.9,130,4,1,0,1,0,0,1,0,1,1,0
1,55.0,7861,38,263358.03,1.1,136,6,1,0,1,0,1,0,0,1,1,0
2,65.0,146,20,162000.0,1.3,129,7,1,0,1,0,1,0,0,1,0,1
3,50.0,111,20,210000.0,1.9,137,7,0,1,1,0,1,0,0,1,1,0
4,65.0,160,20,327000.0,2.7,116,8,0,1,0,1,1,0,1,0,1,0


In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [12]:
columns = [
           'age',
           'creatinine_phosphokinase',
           'ejection_fraction',
           'platelets',
           'serum_creatinine',
           'serum_sodium',
           'time'
           ]
ct = ColumnTransformer([("numeric", StandardScaler(), columns)], remainder='passthrough')

In [13]:
X_train = ct.fit_transform(X_train)

In [14]:
X_test = ct.transform(X_test)

## Prepare Labels for Classification

In [15]:
le = LabelEncoder()

In [16]:
Y_train = le.fit_transform(Y_train.astype(str))
Y_test = le.transform(Y_test.astype(str))

In [17]:
#transforming the encoded labels into binary vector
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)

## Design the Model

In [18]:
model = Sequential()

In [19]:
model.add(InputLayer(input_shape=(X_train.shape[1],)))

In [20]:
model.add(Dense(12, activation='relu'))

In [21]:
model.add(Dense(2, activation='softmax'))

In [22]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 12)                216       
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 26        
Total params: 242
Trainable params: 242
Non-trainable params: 0
_________________________________________________________________


In [23]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

## Train & Evaluate the Model

In [24]:
model.fit(X_train, Y_train, epochs=100, batch_size=16, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f3317258d90>

In [26]:
loss, acc = model.evaluate(X_test, Y_test, verbose=0)
print("Loss:", loss, "Accuracy:", acc)

Loss: 0.5450414419174194 Accuracy: 0.747474730014801


## Generating a Classification Report

In [27]:
y_estimate = model.predict(X_test, verbose=0)

In [28]:
#selecting the indices of the true classes for each label
y_estimate = np.argmax(y_estimate, axis=1)
y_true = np.argmax(Y_test, axis=1)

In [30]:
#printing the additional metrics such as F1 score
print(classification_report(y_true, y_estimate))

              precision    recall  f1-score   support

           0       0.75      0.84      0.79        57
           1       0.74      0.62      0.68        42

    accuracy                           0.75        99
   macro avg       0.75      0.73      0.73        99
weighted avg       0.75      0.75      0.74        99



The above results show a good model performance with a F1 score of 0.74. 