# Heart Disease Random Forest Classifier

This notebook implements a Random Forest Classifier to predict heart disease based on various features from the dataset.

## Step 1: Get the Data Ready

This step involves loading the dataset, splitting it into features and labels, and preparing the training and testing sets.

In [None]:
import pandas as pd
import numpy as np

heart_disease = pd.read_csv('heart.csv')

heart_disease.describe().T

The target column indicates whether the patient has heart disease (target=1) or not (target=0). This is our "label" column.

### Feature Selection

Create X (all the feature columns) and y (the target column).

In [None]:
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

X.head(), y.head(), y.value_counts()

### Split Data into Training and Testing Set

We will split the data into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=43)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

## Step 2: Choose the Model and Hyperparameters

We will choose the Random Forest model and view its hyperparameters.

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

clf.get_params()

## Step 3: Fit the Model with the Data and Use It to Make Predictions

We will fit the model with the training data and make predictions.

In [None]:
clf.fit(X=X_train, y=y_train)
y_preds = clf.predict(X=X_test)
y_preds

## Step 4: Evaluate the Model

We will evaluate the model's performance using accuracy scores, classification reports, and confusion matrices.

In [None]:
train_acc = clf.score(X=X_train, y=y_train)
test_acc = clf.score(X=X_test, y=y_test)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(f"Accuracy (Training Dataset): {train_acc * 100}%")
print(f"Accuracy (Testing Dataset): {test_acc * 100}%")
print(classification_report(y_test, y_preds))
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat
accuracy_score(y_test, y_preds)

## Step 5: Hyperparameter Tuning

We will experiment with different numbers of estimators and use cross-validation to find the best model.

In [None]:
np.random.seed(42)
for i in range(100, 200, 10):
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set with {i} estimators: {model.score(X_test, y_test) * 100:.2f}%")

from sklearn.model_selection import cross_val_score
np.random.seed(42)
for i in range(100, 200, 10):
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    cross_val_mean = np.mean(cross_val_score(model, X, y, cv=5))
    print(f"5-fold cross-validation score with {i} estimators: {cross_val_mean * 100:.2f}%")

## Step 6: Save a Model

We will save the trained model using both pickle and joblib.

In [None]:
import pickle
pickle.dump(model, open("random_forest_model_1.pkl", "wb"))

loaded_pickle_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
print(f"Loaded pickle model prediction score: {loaded_pickle_model.score(X_test, y_test) * 100:.2f}%")

from joblib import dump, load
dump(model, "random_forest_model_1.joblib")
loaded_joblib_model = load("random_forest_model_1.joblib")
print(f"Loaded joblib model prediction score: {loaded_joblib_model.score(X_test, y_test) * 100:.2f}%")