In [None]:
Random Forest Classifier for Predicting Heart Disease Risk


Here is the step-by-step implementation:

Preprocess the dataset by handling missing values, encoding categorical variables, and scaling numerical features.

Split the dataset into training and testing sets.

Train a Random Forest Classifier with default hyperparameters.

Evaluate the model performance using accuracy, precision, recall, and F1 score.

Identify and visualize the top 5 most important features.

Tune the hyperparameters of the model using GridSearchCV.

Compare the performance of the tuned model with the default model.

Interpret the model by analyzing decision boundaries.

Let's start implementing this in a Jupyter Notebook:


# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Load the dataset
url = "https://drive.google.com/uc?export=download&id=1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ"
data = pd.read_csv(url)

# Preprocess the dataset
# Check for missing values
print(data.isnull().sum())

# Handle missing values
imputer = SimpleImputer(strategy='mean')
data_imputed = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Encode categorical variables
label_encoder = LabelEncoder()
data_imputed['sex'] = label_encoder.fit_transform(data_imputed['sex'])
data_imputed['cp'] = label_encoder.fit_transform(data_imputed['cp'])
data_imputed['restecg'] = label_encoder.fit_transform(data_imputed['restecg'])
data_imputed['slope'] = label_encoder.fit_transform(data_imputed['slope'])
data_imputed['thal'] = label_encoder.fit_transform(data_imputed['thal'])

# Scale numerical features
scaler = StandardScaler()
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
data_imputed[numerical_features] = scaler.fit_transform(data_imputed[numerical_features])

# Split the dataset into training and testing sets
X = data_imputed.drop('target', axis=1)
y = data_imputed['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')

# Feature importance
feature_importances = clf.feature_importances_
features = X.columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Visualize feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df.head(5))
plt.title('Top 5 Feature Importances')
plt.show()

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f'Best parameters: {grid_search.best_params_}')

# Train the tuned model
best_clf = grid_search.best_estimator_
best_clf.fit(X_train, y_train)

# Evaluate the tuned model
y_pred_tuned = best_clf.predict(X_test)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
precision_tuned = precision_score(y_test, y_pred_tuned)
recall_tuned = recall_score(y_test, y_pred_tuned)
f1_tuned = f1_score(y_test, y_pred_tuned)

print(f'Tuned Accuracy: {accuracy_tuned:.2f}')
print(f'Tuned Precision: {precision_tuned:.2f}')
print(f'Tuned Recall: {recall_tuned:.2f}')
print(f'Tuned F1 Score: {f1_tuned:.2f}')

# Plotting decision boundaries for the two most important features
top_features = importance_df['Feature'].head(2).values
X_top_features = X[top_features]

# Fit model on two features
clf_top_features = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
clf_top_features.fit(X_train[top_features], y_train)

# Plot decision boundaries
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_top_features.iloc[:, 0], y=X_top_features.iloc[:, 1], hue=y)
plt.title('Decision Boundaries with Top 2 Features')
plt.xlabel(top_features[0])
plt.ylabel(top_features[1])
plt.show()
Explanation:
Preprocessing:

Handled missing values using mean imputation.
Encoded categorical variables using LabelEncoder.
Scaled numerical features using StandardScaler.
Train-Test Split:

Split the dataset into 70% training and 30% testing sets.
Model Training:

Trained a Random Forest Classifier with 100 trees and a maximum depth of 10.
Model Evaluation:

Evaluated the model using accuracy, precision, recall, and F1 score.
Feature Importance:

Identified and visualized the top 5 most important features.
Hyperparameter Tuning:

Tuned the model using GridSearchCV with a specified parameter grid.
Reported the best parameters and performance metrics.

Decision Boundaries:

Analyzed the decision boundaries using the two most important features.