### Build a Random Forest Classifier to Predict the Risk of Heart Disease

You are tasked with building a random forest classifier to predict the risk of heart disease based on a dataset containing patient information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type, resting blood pressure, serum cholesterol, and maximum heart rate achieved.  
Dataset link: [Google Drive](https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=share_link)

### Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the numerical features if necessary.

### Q2. Split the dataset into a training set (70%) and a test set (30%).

### Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each tree. Use the default values for other hyperparameters.

### Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.

### Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart disease risk. Visualise the feature importances using a bar chart.

### Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try different values of the number of trees, maximum depth, minimum samples split, and minimum samples leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.

### Q7. Report the best set of hyperparameters found by the search and the corresponding performance metrics. Compare the performance of the tuned model with the default model.

### Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the decision boundaries on a scatter plot of two of the most important features. Discuss the insights and limitations of the model for predicting heart disease risk.


## Step 1: Importing Libraries and Loading Data
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV

# Load the dataset
url = 'https://drive.google.com/uc?id=1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ'
df = pd.read_csv(url)
```
---

## Q1: Preprocessing the Dataset
- Handling Missing Values: We'll use the SimpleImputer to handle any missing data by replacing them with the median (for numerical features).
- Encoding Categorical Variables: We'll encode categorical features using OneHotEncoder.
- Scaling Numerical Features: We'll scale the numerical features using StandardScaler to ensure they are on the same scale.

```python
# Check for missing values
df.isnull().sum()

# Preprocessing
# Impute missing values
imputer = SimpleImputer(strategy='median')
df[['age', 'trestbps', 'chol', 'thalach', 'oldpeak']] = imputer.fit_transform(df[['age', 'trestbps', 'chol', 'thalach', 'oldpeak']])

# Encode categorical variables (e.g., 'sex', 'cp', 'fbs', etc.)
df = pd.get_dummies(df, drop_first=True)

# Scale numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df.drop('target', axis=1))  # 'target' is the label column

# Concatenate scaled features with the target variable
X = pd.DataFrame(scaled_features, columns=df.columns[:-1])
y = df['target']
```
---

## Q2: Splitting the Dataset
We will split the dataset into training and testing sets (70% for training and 30% for testing).

```python
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```
---

## Q3: Train Random Forest Classifier
Now, let's train a random forest classifier with 100 trees and a maximum depth of 10.

```python
# Train the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_classifier.fit(X_train, y_train)
```
---

## Q4: Evaluate Model Performance
We will evaluate the model using accuracy, precision, recall, and F1 score.

```python
# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
```
---

## Q5: Feature Importance
We will extract and visualize the feature importances to identify the top 5 most important features.

```python
# Get feature importance scores
feature_importances = rf_classifier.feature_importances_

# Create a DataFrame of features and their importance scores
feature_importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importances
})

# Sort features by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Plot the top 5 most important features
top_5_features = feature_importance_df.head(5)

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=top_5_features)
plt.title('Top 5 Important Features for Heart Disease Risk')
plt.show()
```
---

## Q6: Hyperparameter Tuning
We will tune the hyperparameters using Grid Search with 5-fold cross-validation.

```python
# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform GridSearchCV
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")
```
---

## Q7: Report the Best Set of Hyperparameters
We will evaluate the model with the best hyperparameters and compare it with the default model.

```python
# Evaluate the best model from grid search
best_rf = grid_search.best_estimator_

# Make predictions using the best model
y_pred_best = best_rf.predict(X_test)

# Evaluate the best model
accuracy_best = accuracy_score(y_test, y_pred_best)
precision_best = precision_score(y_test, y_pred_best)
recall_best = recall_score(y_test, y_pred_best)
f1_best = f1_score(y_test, y_pred_best)

print(f"Best Model Performance:")
print(f"Accuracy: {accuracy_best}")
print(f"Precision: {precision_best}")
print(f"Recall: {recall_best}")
print(f"F1 Score: {f1_best}")

# Compare with default model
print(f"Default Model F1 Score: {f1}")
```
---

## Q8: Interpret the Model and Plot Decision Boundaries
Since Random Forest is a non-linear model, it's difficult to interpret the decision boundaries directly. However, we can plot the decision boundaries for two important features to get some insights.

```python
# Select two important features for visualization
top_features = top_5_features['Feature'][:2]
X_subset = X[top_features]

# Train the model on the two selected features
rf_classifier.fit(X_subset, y)

# Create a mesh grid for plotting decision boundaries
xx, yy = np.meshgrid(np.linspace(X_subset.iloc[:, 0].min(), X_subset.iloc[:, 0].max(), 100),
                     np.linspace(X_subset.iloc[:, 1].min(), X_subset.iloc[:, 1].max(), 100))

# Predict on the mesh grid
Z = rf_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundary
plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X_subset.iloc[:, 0], X_subset.iloc[:, 1], c=y, edgecolors='k', marker='o')
plt.xlabel(top_features[0])
plt.ylabel(top_features[1])
plt.title('Decision Boundary of Random Forest Classifier')
plt.show()
```
---