<a id="top"></a>
# Heart Failure Prediction
# Random Forest Classification
## Contents

* <a href="#Dependencies">Dependencies</a>
* <a href="#Load">Loading the Data</a>
* <a href="#ModelPrep">Model Preperation</a>
    <!-- * <a href="#FeatSel">Feature Selection</a>
    * <a href="#Scale">Scaling</a>
    * <a href="#TTSplit">Train/Test Split</a>
    * <a href="#Tune">Hyperparameter Tuning</a>
    * <a href="#Train">Training</a>
    * <a href="#Eval">Evaluate Models</a>
    * <a href="#Best">Choose Best Model</a> -->
* <a href="#Exp^2">Explain Features & Export Model</a>
* <a href="#Other">Other</a>
* <a href="#Cite">Citations</a>

----
<a id="Dependencies"></a>
<a href="#top">Back to Top</a>
## Dependencies

In [20]:
# Loading the Data
import pandas as pd
# Model Preperation
from sklearn.model_selection import cross_val_predict, cross_val_score #, StratifiedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report #, precision_score, recall_score, f1_score, roc_auc_score
# import numpy as np
# import os
# Export & Explain
# import shap
import joblib
# from tqdm import tqdm
# import time

----
<a id="Load"></a>
<a href="#top">Back to Top</a>
## Loading the Data
    Ensure that your data is clean and properly preprocessed. Handle missing values, encode categorical variables if necessary, and address any outliers.

In [21]:
# file path
file_path = "datasets/heart.csv" # previously cleaned

# Load the dataset
df = pd.read_csv(file_path)

# Set the maximum number of columns to display to None
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [22]:
df.dtypes

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object

In [None]:
df.shape

In [23]:
# List of object columns to factor
object_columns = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

# Factor each object column
for column in object_columns:
    df[column], _ = pd.factorize(df[column])

# Print the modified DataFrame
print(df.head())

   Age  Sex  ChestPainType  RestingBP  Cholesterol  FastingBS  RestingECG  \
0   40    0              0        140          289          0           0   
1   49    1              1        160          180          0           0   
2   37    0              0        130          283          0           1   
3   48    1              2        138          214          0           0   
4   54    0              1        150          195          0           0   

   MaxHR  ExerciseAngina  Oldpeak  ST_Slope  HeartDisease  
0    172               0      0.0         0             0  
1    156               0      1.0         1             1  
2     98               0      0.0         0             0  
3    108               1      1.5         1             1  
4    122               0      0.0         0             0  


----
<a id="ModelPrep"></a>
<a href="#top">Back to Top</a>
## Model Preperation

In [24]:
# Assume 'HeartDisease' is the target variable, and other columns are features
# Adjust the feature columns accordingly based on your dataset
X = df.drop('HeartDisease', axis=1)
y = df['HeartDisease']

# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(rf_model, X, y, cv=5, scoring='accuracy')

# Print the cross-validation scores
print("Cross-Validation Scores:", cv_scores)
print(f"Mean Accuracy: {cv_scores.mean():.2f} (+/- {cv_scores.std():.2f})")

Cross-Validation Scores: [0.89130435 0.83695652 0.82608696 0.81420765 0.75956284]
Mean Accuracy: 0.83 (+/- 0.04)


In [25]:
# If you want additional evaluation metrics, you can uncomment the following lines
y_pred = cross_val_predict(rf_model, X, y, cv=5)
classification_rep = classification_report(y, y_pred)
print("Classification Report:\n", classification_rep)

Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.79      0.80       410
           1       0.83      0.85      0.84       508

    accuracy                           0.83       918
   macro avg       0.82      0.82      0.82       918
weighted avg       0.83      0.83      0.83       918



In [26]:
# Fit the model to the entire dataset
rf_model.fit(X, y)

RandomForestClassifier(random_state=42)

----
<a id="ModelExport"></a>
<a href="#top">Back to Top</a>
## Model Exportation

In [27]:
# Save the trained model to a file using joblib
model_filename = "heart_disease_model.joblib"
joblib.dump(rf_model, model_filename)

['heart_disease_model.joblib']