# Heart Failure Prediction
Eduardo Cruz

Dataset Link: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

Random Forest algorithm was used to create a prediction model that determined whether or not a patient.

In [1]:
# These two lines import the required packages and libraries for numpy and pandas
import numpy as np
import pandas as pd

# This line will import the RandomForestClassifier class
from sklearn.ensemble import RandomForestClassifier

### Reading the Data File and Assigning it to a Pandas DataFrame

In [2]:
# "read_csv" is a pandas function that will read the data file "heart.csv"
df = pd.read_csv("heart.csv")

# Making sure the dataset works by printing every 10 lines:
df[0::10]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
10,37,F,NAP,130,211,0,Normal,142,N,0.0,Up,0
20,43,F,TA,100,223,0,Normal,142,N,0.0,Up,0
30,53,M,NAP,145,518,0,Normal,130,N,0.0,Flat,1
40,54,F,ATA,150,230,0,Normal,130,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
870,71,F,ATA,160,302,0,Normal,162,N,0.4,Up,0
880,52,M,NAP,172,199,1,Normal,162,N,0.5,Up,0
890,64,M,TA,170,227,0,LVH,155,N,0.6,Flat,0
900,58,M,ASY,114,318,0,ST,140,N,4.4,Down,1


We see categorical data in the Dataframe that will need to be converted into binary 
information our prediction model can use. This is achieved by performing one-hot encoding.

In [3]:
# Create an array for the categorical data in our DataFrame
categorical_cols = ['Sex','ChestPainType','RestingECG','ExerciseAngina','ST_Slope']

# Alter our DataFrame so that it feeds categorical data to our prediction model via OneHotEncoding
updated_df = pd.get_dummies(df, columns = categorical_cols)

updated_df[0::10]

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_F,Sex_M,ChestPainType_ASY,...,ChestPainType_NAP,ChestPainType_TA,RestingECG_LVH,RestingECG_Normal,RestingECG_ST,ExerciseAngina_N,ExerciseAngina_Y,ST_Slope_Down,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,0,1,0,...,0,0,0,1,0,1,0,0,0,1
10,37,130,211,0,142,0.0,0,1,0,0,...,1,0,0,1,0,1,0,0,0,1
20,43,100,223,0,142,0.0,0,1,0,0,...,0,1,0,1,0,1,0,0,0,1
30,53,145,518,0,130,0.0,1,0,1,0,...,1,0,0,1,0,1,0,0,1,0
40,54,150,230,0,130,0.0,0,1,0,0,...,0,0,0,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
870,71,160,302,0,162,0.4,0,1,0,0,...,0,0,0,1,0,1,0,0,0,1
880,52,172,199,1,162,0.5,0,0,1,0,...,1,0,0,1,0,1,0,0,0,1
890,64,170,227,0,155,0.6,0,0,1,0,...,0,1,1,0,0,1,0,0,1,0
900,58,114,318,0,140,4.4,1,0,1,1,...,0,0,0,0,1,1,0,1,0,0


In [4]:
# Create a python list of feature names
feature_cols = ['Age','Sex_F', 'Sex_M','ChestPainType_ASY','ChestPainType_NAP','ChestPainType_TA',
                'RestingBP','Cholesterol','FastingBS','RestingECG_LVH','RestingECG_Normal','RestingECG_ST',
                'MaxHR','ExerciseAngina_N','ExerciseAngina_Y','Oldpeak','ST_Slope_Down','ST_Slope_Flat','ST_Slope_Up']

# Use the above list of feature names to select the features from the DataFrame
X = updated_df[feature_cols]

# Print the first 5 rows from the DataFrame
X.head()

Unnamed: 0,Age,Sex_F,Sex_M,ChestPainType_ASY,ChestPainType_NAP,ChestPainType_TA,RestingBP,Cholesterol,FastingBS,RestingECG_LVH,RestingECG_Normal,RestingECG_ST,MaxHR,ExerciseAngina_N,ExerciseAngina_Y,Oldpeak,ST_Slope_Down,ST_Slope_Flat,ST_Slope_Up
0,40,0,1,0,0,0,140,289,0,0,1,0,172,1,0,0.0,0,0,1
1,49,1,0,0,1,0,160,180,0,0,1,0,156,1,0,1.0,0,1,0
2,37,0,1,0,0,0,130,283,0,0,0,1,98,1,0,0.0,0,0,1
3,48,1,0,1,0,0,138,214,0,0,1,0,108,0,1,1.5,0,1,0
4,54,0,1,0,1,0,150,195,0,0,1,0,122,1,0,0.0,0,0,1


In [5]:
# Select a series of labels (the last column) from the DataFrame
y = df['HeartDisease']

# Checking the label vector by printing every 10 values
y[::10]

0      0
10     0
20     0
30     1
40     0
      ..
870    0
880    0
890    0
900    1
910    0
Name: HeartDisease, Length: 92, dtype: int64

### Splitting Dataset into Testing and Training Sets:

In [6]:
from sklearn.model_selection import train_test_split

# Now we randomly split the original dataset into training and testing sets
# The function "train_test_split" from "sklearn.cross_validation" library performs random splitting.
# "test_size=0.35" means that we pick 35% of data samples for testing set, and the rest (65%) for training set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=3)

X_train.shape

(596, 19)

In [7]:
# print the size of the training set (65%):
print(X_train.shape)
print(y_train.shape)

(596, 19)
(596,)


In [8]:
# print the size of the testing set (35%):
print(X_test.shape)
print(y_test.shape)

(322, 19)
(322,)


### Using Random Forest Classifier to Predict Heart Failure:

In [9]:
from sklearn.ensemble import RandomForestClassifier

my_RandomForest = RandomForestClassifier(n_estimators=19, bootstrap=True, random_state=3)

# Training ONLY on the training set:
my_RandomForest.fit(X_train, y_train)

# Testing on the testing set:
y_predict_dt = my_RandomForest.predict(X_test)

print(y_predict_dt)

[1 1 1 1 0 1 1 1 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 0 0 1
 0 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 1
 0 0 1 1 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 0 0 1 1 1 1 0 1
 1 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 0 0 1 0 0 0
 1 1 1 1 1 0 1 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 1 1 1 0 0 1 0 0 1
 1 1 0 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 1 1 0 0 1 0 0 0 0 1 0 0 1
 1 0 0 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0
 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 1 0
 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 0 0 1]


### Accuracy Evaluation of Random Forest Method:

In [10]:
from sklearn.metrics import accuracy_score

score_dt = accuracy_score(y_test, y_predict_dt)

print(score_dt)

0.8757763975155279


Therefore, the accuracy for the random forest method was 88%. 