## **Predicting Mortality in Heart Failure Patients**
***
In this project, a dataset from Kaggle was used to predict the survival of patients with heart failure based on factors such as serum creatinine, ejection fraction, age, anemia, diabetes, and other clinical variables. </br></br>

Cardiovascular diseases (CVDs) were the leading cause of death globally, claiming an estimated 17.9 million lives each year, which accounted for 31% of all deaths worldwide. Heart failure, a common condition caused by CVDs, was a key event of interest in the dataset, which contained 12 features that could be used to predict mortality due to heart failure. </br></br>

Most cardiovascular diseases were preventable by addressing behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity, and harmful alcohol use through population-wide strategies.</br></br>

Individuals with cardiovascular disease or those at high cardiovascular risk, due to factors like hypertension, diabetes, hyperlipidemia, or pre-existing conditions, required early detection and management. In this context, machine learning models played a significant role in predicting patient outcomes and potentially improving survival rates. </br></br>
***
#### **1. Load dataframe and print dataset information**
- Imported libraries.
- Used `pandas.read_csv()` to load the data from `heart_failure.csv` to a pandas DataFrame object `data`.
- Used the `DataFrame.info()` method to print all the columns and their types for `data`.

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.compose import ColumnTransformer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer
from sklearn.metrics import classification_report
from tensorflow.keras.utils import to_categorical
import numpy as np

data = pd.read_csv('heart_failure.csv')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB
None


***
#### **2. Print distribution of DEATH_EVENT**
Print the distribution of the `DEATH_EVENT` column in `data` using `collections.Counter`. This is the column for prediction `y`.

In [2]:
y = data['DEATH_EVENT']
print('Classes and number of values in the dataset',Counter(y))

Classes and number of values in the dataset Counter({0: 203, 1: 96})


***
#### **3. Date pre-processing**
- Extracted the features columns [`age`,`anaemia`,`creatinine_phosphokinase`,`diabetes`,`ejection_fraction`,`high_blood_pressure`,`platelets`,`serum_creatinine`,`serum_sodium`,`sex`,`smoking`,`time`] from `data` and assign the result to a variable called `x`.

- Used `pandas.get_dummies()` function to convert the categorical features in `x` to one-hot encoding vectors and assigned the result back to variable `x`.

In [3]:
x = data.drop(columns = ['DEATH_EVENT'])
x = pd.get_dummies(x)

***
#### **4. Train-test split**
- Used the `sklearn.model_selection.train_test_split()` method to split the data into training features `X_train`, test features `X_test`, training labels `Y_train` and test labels `Y_test` respectively.

In [4]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

***
#### **5. Scaling numeric features**
- Initialized a ColumnTransformer object by using StandardScaler to scale the numeric features in the dataset: [`age`,`creatinine_phosphokinase`,`ejection_fraction`,`platelets`,`serum_creatinine`,`serum_sodium`,`time`]. Assigned the resulting object to a variable called `ct`.
- Used the `ColumnTransformer.fit_transform()` function to train the scaler instance `ct` on the training data `X_train` and assigned the result back to `X_train`.
- Used the `ColumnTransformer.transform()` to scale the test data instance `X_test` using the trained scaler ct, and assign the result back to `X_test`.

In [5]:
ct = ColumnTransformer([("numeric", StandardScaler(), ['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time'])])
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

***
#### **6. Prepare labels for classification**
- Initialized an instance of `LabelEncoder` and assigned it to a variable called `le`.
- Used the `LabelEncoder.fit_transform()` function to fit the encoder instance `le` to the training labels `Y_train`, while at the same time converting the training labels according to the trained encoder.
- Used the `LabelEncoder.transform()` function, encode the test labels `Y_test` using the trained encoder `le`.
- Using the `tensorflow.keras.utils.to_categorical()` function, transformed `Y_train`, `Y_test` into a binary vectors and assign the result back to `Y_train`, `Y_test`.

In [6]:
le = LabelEncoder()
Y_train = le.fit_transform(Y_train.astype(str))
Y_test = le.fit_transform(Y_test.astype(str))
Y_train = to_categorical(Y_train)
Y_test= to_categorical(Y_test)

***
#### **7. Design the model**
- Initialized a `tensorflow.keras.models.Sequential` model instance called `model`.
- Created an input layer instance of `tensorflow.keras.layers.InputLayer` and added it to the model instance model using the `Model.add()` function.
- Created a hidden layer instance of `tensorflow.keras.layers.Dense` with `relu` activation function and `12 hidden neurons`, and add it to the model instance `model`.
- Created an output layer instance of `tensorflow.keras.layers.Dense` with a `softmax` activation function (because of classification) with the number of neurons corresponding to the number of classes in the dataset (2).
- Using the `Model.compile()` function, compiled the model instance `model` using the `categorical_crossentropy` loss, `adam optimizer` and `accuracy` as metrics.

In [7]:
model = Sequential()
model.add(InputLayer(shape=(X_train.shape[1],)))
model.add(Dense(12, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

***
#### **8. Train and evaluate the model**
- Using the `Model.fit()` function, fitted the model instance `model` to the training data `X_train` and training labels `Y_train`. Set the number of epochs to 100 and the batch size parameter to 16.
- Using the `Model.evaluate()` function, evaluated the trained model instance `model` on the test data `X_test` and test labels `Y_test`. Assigned the result to a variable called `loss` (representing the final loss value) and a variable called `acc` (representing the accuracy metrics), respectively.


In [8]:
model.fit(X_train, Y_train, epochs = 100, batch_size = 16, verbose=1)
loss, acc = model.evaluate(X_test, Y_test, verbose=0)
print("\nLoss", loss, "Accuracy:", acc)

Epoch 1/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 7ms/step - accuracy: 0.5235 - loss: 0.7687
Epoch 2/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.5626 - loss: 0.7050 
Epoch 3/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.6437 - loss: 0.6413 
Epoch 4/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.6569 - loss: 0.6209 
Epoch 5/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7406 - loss: 0.5809
Epoch 6/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7786 - loss: 0.5698 
Epoch 7/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7695 - loss: 0.5648
Epoch 8/100
[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.8004 - loss: 0.5358 
Epoch 9/100
[1m15/15[0m [32m━━━━━━━━━━━━

***
#### **9. Generating a classification report**
- Used the `Model.predict()` to get the predictions for the test data `X_test` with the trained model instance `model`. Assign the result to a variable called `y_estimate.`
- Used the `numpy.argmax()` method to select the indices of the true classes for each label encoding in `y_estimate`. Assign the result to a variable called `y_estimate`.
- Used the `numpy.argmax()` method to select the indices of the true classes for each label encoding in `Y_test`. Assign the result to a variable called `y_true`.
- Printed additional metrics, such as `F1-score`, using the `sklearn.metrics.classification_report()` function by providing it with `y_true` and `y_estimate` vectors as input parameters.

In [9]:
y_estimate = model.predict(X_test, verbose=1)
y_estimate = np.argmax(y_estimate,axis=1)
y_true = np.argmax(Y_test,axis=1)
print(classification_report(y_true, y_estimate))

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 87ms/step
              precision    recall  f1-score   support

           0       0.79      1.00      0.88        37
           1       1.00      0.57      0.72        23

    accuracy                           0.83        60
   macro avg       0.89      0.78      0.80        60
weighted avg       0.87      0.83      0.82        60

