# Classification

In this project, you will use a dataset from Kaggle to predict the survival of patients with heart failure from serum creatinine and ejection fraction, and other factors such as age, anemia, diabetes, and so on.
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Heart failure is a common event caused by CVDs, and this dataset contains 12 features that can be used to predict mortality by heart failure.
Most cardiovascular diseases can be prevented by addressing behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity, and harmful alcohol use using population-wide strategies.
People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia, or already established disease) need early detection and management wherein a machine learning model can be of great help.

Using pandas.read_csv(), load the data from heart_failure.csv to a pandas DataFrame object. Assign the resulting DataFrame to a variable called data.

Use the DataFrame.info() method to print all the columns and their types of the DataFrame instance data.

In [2]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.compose import ColumnTransformer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer
from sklearn.metrics import classification_report
from tensorflow.keras.utils import to_categorical
import numpy as np

data = pd.read_csv('heart_failure.csv')

print(data.info)

<bound method DataFrame.info of      Unnamed: 0   age anaemia  creatinine_phosphokinase diabetes  \
0             0  75.0      no                       582       no   
1             1  55.0      no                      7861       no   
2             2  65.0      no                       146       no   
3             3  50.0     yes                       111       no   
4             4  65.0     yes                       160      yes   
..          ...   ...     ...                       ...      ...   
294         294  62.0      no                        61      yes   
295         295  55.0      no                      1820       no   
296         296  45.0      no                      2060      yes   
297         297  45.0      no                      2413       no   
298         298  50.0      no                       196       no   

     ejection_fraction high_blood_pressure  platelets  serum_creatinine  \
0                   20                 yes  265000.00               1.9   
1

Print the distribution of the death_event column in the data DataFrame class using collections.Counter. This is the column you will need to predict.

Extract the label column death_event from the data DataFrame and assign the result to a variable called y.

Extract the features columns ['age','anaemia','creatinine_phosphokinase','diabetes','ejection_fraction','high_blood_pressure','platelets','serum_creatinine','serum_sodium','sex','smoking','time'] from the DataFrame instance data and assign the result to a variable called x.

In [3]:
print(Counter(data['death_event']))

y=data['death_event']

x=data[['age','anaemia','creatinine_phosphokinase','diabetes','ejection_fraction','high_blood_pressure','platelets','serum_creatinine','serum_sodium','sex','smoking','time']]

Counter({'no': 203, 'yes': 96})


Use the pandas.get_dummies() function to convert the categorical features in the DataFrame instance x to one-hot encoding vectors and assign the result back to variable x.

Use the sklearn.model_selection.train_test_split() method to split the data into training features, test features, training labels, and test labels, respectively. To the test_size parameter assign the percentage of data you wish to put in the test data, and use any value for the random_state parameter. Store the results of the function to X_train, X_test, Y_train, Y_test variables, making sure you use this order.

In [4]:
x  = pd.get_dummies(x)

X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

Initialize a ColumnTransformer object by using StandardScaler to scale the numeric features in the dataset: ['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time']. Assign the resulting object to a variable called ct.


Use the ColumnTransformer.fit_transform() function to train the scaler instance ct on the training data X_train and assign the result back to X_train.


Use the ColumnTransformer.transform() to scale the test data instance X_test using the trained scaler ct, and assign the result back to X_test.

In [5]:
ct = ColumnTransformer([("numeric", StandardScaler(),['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium','time'])])

X_train = ct.fit_transform(X_train)

X_test = ct.transform(X_test)

Initialize an instance of LabelEncoder and assign it to a variable called le.


Using the LabelEncoder.fit_transform() function, fit the encoder instance le to the training labels Y_train, while at the same time converting the training labels according to the trained encoder.

Using the LabelEncoder.transform() function, encode the test labels Y_test using the trained encoder le.




In [7]:
le = LabelEncoder()
Y_train  = le.fit_transform(Y_train.astype(str))
Y_test  = le.transform(Y_test.astype(str))

Using the tensorflow.keras.utils.to_categorical() function, transform the encoded training labels Y_train into a binary vector and assign the result back to Y_train.

Using the tensorflow.keras.utils.to_categorical() function, transform the encoded test labels Y_test into a binary vector and assign the result back to Y_test.

In [8]:
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)

Initialize a tensorflow.keras.models.Sequential model instance called model.

Create an input layer instance of tensorflow.keras.layers.InputLayer and add it to the model instance model using the Model.add() function.

Create a hidden layer instance of tensorflow.keras.layers.Dense with relu activation function and 12 hidden neurons, and add it to the model instance model.

Create an output layer instance of tensorflow.keras.layers.Dense with a softmax activation function (because of classification) with the number of neurons corresponding to the number of classes in the dataset.

In [9]:
model = Sequential()

model.add(InputLayer(input_shape=(X_train.shape[1],)))

model.add(Dense(12, activation='relu'))

model.add(Dense(2, activation='softmax'))

Using the Model.compile() function, compile the model instance model using the categorical_crossentropy loss, adam optimizer and accuracy as metrics.

Using the Model.fit() function, fit the model instance model to the training data X_train and training labels Y_train. Set the number of epochs to 100 and the batch size parameter to 16.

Using the Model.evaluate() function, evaluate the trained model instance model on the test data X_test and test labels Y_test. Assign the result to a variable called loss (representing the final loss value) and a variable called acc (representing the accuracy metrics), respectively.

In [10]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train , Y_train, epochs = 100, batch_size = 16, verbose=1)

loss, acc = model.evaluate(X_test , Y_test, verbose=0)
print("Loss", loss, "Accuracy:", acc)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Loss 0.414075642824173 Accuracy: 0.8222222328186035


Use the Model.predict() to get the predictions for the test data X_test with the trained model instance model. Assign the result to a variable called y_estimate.

Use the numpy.argmax() method to select the indices of the true classes for each label encoding in y_estimate. Assign the result to a variable called y_estimate.

Use the numpy.argmax() method to select the indices of the true classes for each label encoding in Y_test. Assign the result to a variable called y_true.

Print additional metrics, such as F1-score, using the sklearn.metrics.classification_report() function by providing it with y_true and y_estimate vectors as input parameters.

In [11]:
y_estimate = model.predict(X_test, verbose=0)
y_estimate = np.argmax(y_estimate, axis = 1)
y_true = np.argmax(Y_test, axis=1)

print(classification_report(y_estimate, y_estimate))




              precision    recall  f1-score   support

           0       1.00      1.00      1.00        72
           1       1.00      1.00      1.00        18

    accuracy                           1.00        90
   macro avg       1.00      1.00      1.00        90
weighted avg       1.00      1.00      1.00        90

