Build a random forest classifier to predict the risk of heart disease based on a dataset of patient
information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type,
resting blood pressure, serum cholesterol, and maximum heart rate achieved.


Dataset link: https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?
usp=share_link



Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.

Q2. Split the dataset into a training set (70%) and a test set (30%).

Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each
tree. Use the default values for other hyperparameters.

Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.

Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart.

Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.

Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model.

Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the
decision boundaries on a scatter plot of two of the most important features. Discuss the insights and
limitations of the model for predicting heart disease risk.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset
url = "https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=sharing"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
data = pd.read_csv(path)

# Q1. Preprocess the dataset
# Handling missing values (imputing with mean)
data.fillna(data.mean(), inplace=True)

# Encoding categorical variables (if any, you need to check the dataset)
# Since the dataset doesn't have categorical variables, encoding isn't necessary.

# Scaling numerical features (if necessary, check if features need scaling)
# In this case, scaling might not be necessary since RandomForest is not sensitive to feature scaling.

# Q2. Split the dataset into training and test sets
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Q3. Train a random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_classifier.fit(X_train, y_train)

# Q4. Evaluate the performance of the model
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

# Q5. Identify top 5 most important features
feature_importances = rf_classifier.feature_importances_
sorted_indices = np.argsort(feature_importances)[::-1]
top_5_indices = sorted_indices[:5]
top_5_features = X.columns[top_5_indices]
print("Top 5 Most Important Features:")
for feature in top_5_features:
    print(feature)

# Visualize feature importances
plt.figure(figsize=(10, 6))
plt.bar(range(len(top_5_features)), feature_importances[top_5_indices], tick_label=top_5_features)
plt.title('Top 5 Most Important Features')
plt.xlabel('Features')
plt.ylabel('Importance Score')
plt.xticks(rotation=45)
plt.show()

# Q6. Tune hyperparameters using GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Q7. Report best hyperparameters and corresponding performance metrics
print("Best Hyperparameters:", grid_search.best_params_)
best_rf_classifier = grid_search.best_estimator_
y_pred_tuned = best_rf_classifier.predict(X_test)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
precision_tuned = precision_score(y_test, y_pred_tuned)
recall_tuned = recall_score(y_test, y_pred_tuned)
f1_tuned = f1_score(y_test, y_pred_tuned)
print("Tuned Model Performance:")
print("Accuracy:", accuracy_tuned)
print("Precision:", precision_tuned)
print("Recall:", recall_tuned)
print("F1 Score:", f1_tuned)

# Q8. Visualize decision boundaries
# Since the dataset has multiple features, it's challenging to visualize decision boundaries in 14 dimensions.
# For the sake of simplicity, let's consider only two most important features.
# Let's choose 'age' and 'maximum heart rate achieved' for visualization.

# Select two most important features
feature1, feature2 = top_5_features[:2]

# Select corresponding columns from training data
X_train_subset = X_train[[feature1, feature2]]

# Train RandomForestClassifier on these two features
rf_classifier_subset = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_classifier_subset.fit(X_train_subset, y_train)

# Plot decision boundaries
plt.figure(figsize=(10, 6))
x_min, x_max = X_train_subset[feature1].min() - 1, X_train_subset[feature1].max() + 1
y_min, y_max = X_train_subset[feature2].min() - 1, X_train_subset[feature2].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = rf_classifier_subset.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X_train_subset[feature1], X_train_subset[feature2], c=y_train, s=20, edgecolor='k')
plt.xlabel(feature1)
plt.ylabel(feature2)
plt.title('Decision Boundaries of Random Forest Classifier')
plt.show()
