### Author: Andrés Felipe Sánchez Arias
Date: Jun-03-2024
Last actualization: Jun-04-2024

#### Predicting Heart Disease with Random Forest Classifier

This Jupyter notebook utilizes a Random Forest classifier to predict the likelihood of heart disease based on various patient attributes. The dataset encompasses essential features such as age, sex, chest pain type, resting blood pressure, cholesterol levels, and more. 

The notebook follows a structured workflow, starting with data preprocessing where categorical variables are converted into dummy variables using one-hot encoding. Subsequently, the dataset is split into training and testing sets to train the Random Forest classifier on known data. 

Once trained, the model's predictive capabilities are showcased by using randomly generated patient data to evaluate the presence or absence of heart disease based on the patient's attributes.

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Import train_test_split function for splitting data
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

# Import the random module for selecting random elements
import random

# Import the NumPy library for numerical computations and array operations
import numpy as np

import sys
import os
sys.path.append(os.path.abspath('../'))

In [2]:
input_csv = '../csv/heart.csv'

In [3]:
data = pd.read_csv(input_csv)

In [4]:
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [5]:
# Extract features by removing the target variable 'HeartDisease'
x = data.drop("HeartDisease", axis=1)

# Extract the target variable 'HeartDisease'
y = data["HeartDisease"]

# Define categories for categorical features
categories = {
    "ChestPainType": ['ATA', 'NAP', 'ASY', 'TA'],
    "Sex": ['M', 'F'],
    "RestingECG": ['Normal', 'ST', 'LVH'],
    "ExerciseAngina": ['N', 'Y'],
    "ST_Slope": ['Up', 'Flat', 'Down']
}
# Convert categorical variables into dummy/indicator variables
x = pd.get_dummies(x, columns=categories.keys())

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

# Initialize Random Forest classifier
rf_classifier = RandomForestClassifier()

# Train the classifier using the training data
rf_classifier.fit(x_train, y_train)

# Generate random patient data within specified ranges for each feature
random_data = {
    "Age": np.random.randint(20,80),
    "Sex": np.random.choice(categories["Sex"]),
    "ChestPainType": np.random.choice(categories["ChestPainType"]),
    "RestingBP": np.random.randint(100,200),
    "Cholesterol": np.random.randint(100,300),
    "FastingBS": np.random.choice([0,1]),
    "RestingECG": np.random.choice(categories["RestingECG"]),
    "MaxHR": np.random.randint(60,220),
    "ExerciseAngina": np.random.choice(categories["ExerciseAngina"]),
    "OldPeak": np.random.uniform(0,5),
    "ST_Slope":  np.random.choice(categories["ST_Slope"])
}

# Create a DataFrame from the random patient data and convert categorical variables into dummy/indicator variables
random_df = pd.DataFrame([random_data])
random_df = pd.get_dummies(random_df, columns=categories.keys())

# Handle missing features in the random DataFrame by adding dummy columns and setting their values to 0
missing_features = set(x_train.columns) - set(random_df.columns)
for feature in missing_features:
    random_df[feature] = 0

# Reorder columns to match the training data
random_df = random_df[x_train.columns]

# Predict whether the randomly generated patient has heart disease or not
random_prediction = rf_classifier.predict(random_df)

# Function to print the randomly generated patient data
def print_features(random_data):
    for feature, value in random_data.items():
        print(f"{feature} = {value}")
print("Randomly Generated Patient Data: ")

# Print the randomly generated patient data
print_features(random_data)

# Output the prediction result based on the random patient data
if random_prediction[0] == 1:
    print("Heart Disease Detected")
else:
    print("Normal")

Randomly Generated Patient Data: 
Age = 57
Sex = F
ChestPainType = TA
RestingBP = 150
Cholesterol = 266
FastingBS = 1
RestingECG = Normal
MaxHR = 135
ExerciseAngina = N
OldPeak = 2.0565801286568104
ST_Slope = Flat
Heart Disease Detected
