# Day 3 — Random Forest Hyperparameter Tuning
### Machine Learning Roadmap — Week 4
### Author — N Manish Kumar
---

Random Forests are strong baseline models, but their performance and
generalization depend heavily on hyperparameters such as:

- Number of trees
- Tree depth
- Minimum samples per split
- Number of features considered at each split

These parameters control the bias–variance trade-off.

In this notebook, we will:
- Train a default Random Forest model
- Tune key hyperparameters using cross-validation
- Compare tuned vs default performance
- Understand how tuning affects bias and variance

Dataset used: **Breast Cancer Dataset (sklearn)**

---

## 1. Dataset Loading and Train/Test Split


In [1]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Training shape:", X_train.shape)
print("Test shape:", X_test.shape)

Training shape: (455, 30)
Test shape: (114, 30)


## 2. Default Random Forest Baseline

Before tuning hyperparameters, we first train a Random Forest using default
settings provided by scikit-learn.

This baseline performance will be used to compare against the tuned model
to determine whether hyperparameter optimization provides real improvement.

---

In [2]:
# Default Random Forest
rf_default = RandomForestClassifier(random_state=42, n_jobs = -1)
rf_default.fit(X_train,y_train)

# Evaluate
train_acc_default = accuracy_score(y_train, rf_default.predict(X_train))
test_acc_default = accuracy_score(y_test, rf_default.predict(X_test))

print("Default RF Train Accuracy:", train_acc_default)
print("Default RF Test Accuracy:", test_acc_default)

Default RF Train Accuracy: 1.0
Default RF Test Accuracy: 0.956140350877193


### Interpretation

Training accuracy is usually very high for Random Forests, indicating strong
ability to fit training data.

Test accuracy reflects how well the model generalizes to unseen samples.

This baseline result will be compared with the tuned Random Forest to evaluate
whether hyperparameter optimization improves generalization.

---
## 3. Hyperparameter Grid and GridSearchCV

Random Forest performance depends on several key hyperparameters that control
model complexity and randomness.

Important parameters include:
- n_estimators: number of trees in the forest
- max_depth: maximum depth of each tree
- min_samples_split: minimum samples needed to split a node
- max_features: number of features considered at each split

We use GridSearchCV to:
- Try multiple combinations of these parameters
- Evaluate each using cross-validation
- Select the combination that gives best average performance


In [3]:
# Hyperparameter grid
param_grid = {
    "n_estimators": [100,200],
    "max_depth": [None,5,10],
    "min_samples_split": [2,5],
    "max_features": ["sqrt","log2"]
}
rf = RandomForestClassifier(random_state = 42, n_jobs = -1)

grid_search = GridSearchCV(
    rf,
    param_grid = param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs =-1,
    verbose = 1
)

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best CV Accuracy:", grid_search.best_score_)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best Parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 200}
Best CV Accuracy: 0.9604395604395606


### Interpretation

GridSearchCV evaluates many Random Forest configurations using cross-validation
and selects the parameter combination that achieves the highest average
validation accuracy.

This process helps find a better bias–variance balance than default settings
and reduces the risk of choosing parameters that only work well on one split.

---

## 4. Comparing Tuned Random Forest with Default Model

After finding the best hyperparameters using cross-validation, we now evaluate
the tuned model on the test set.

We compare:
- Training accuracy
- Test accuracy

between the default Random Forest and the tuned Random Forest to determine
whether tuning improved generalization or only fit the training data better.


In [4]:
# Best tuned rf model from grid search
rf_tuned = grid_search.best_estimator_

# Evaluate Tuned Model
train_acc_tuned = accuracy_score(y_train, rf_tuned.predict(X_train))
test_acc_tuned = accuracy_score(y_test, rf_tuned.predict(X_test))

print("Default RF -> Train:", train_acc_default, "Test:", test_acc_default)
print("Tuned RF   -> Train:", train_acc_tuned, "Test:", test_acc_tuned)

Default RF -> Train: 1.0 Test: 0.956140350877193
Tuned RF   -> Train: 1.0 Test: 0.956140350877193


### Interpretation

If test accuracy improves while training accuracy does not increase
significantly, tuning has improved generalization.

If training accuracy increases but test accuracy does not, tuning may have
led to overfitting.

The best tuning outcome is higher or similar training accuracy combined with
improved test performance.

---

## 5. Understanding the Effect of Hyperparameters

Hyperparameter tuning is most useful when we understand *why* certain settings
perform better.

Key Random Forest parameters affect the model in different ways:

- n_estimators:
  More trees reduce variance but increase training time.

- max_depth:
  Controls how complex each tree is.
  Shallow trees → higher bias, deeper trees → higher variance.

- min_samples_split:
  Prevents trees from creating very small, highly specific splits.

- max_features:
  Controls randomness at each split.
  Smaller values increase diversity among trees and reduce correlation.

By examining the best parameters found by GridSearch, we can infer how the
model balanced bias and variance on this dataset.


In [5]:
# Convert grid search results to DataFrame
cv_results_df = pd.DataFrame(grid_search.cv_results_)

# Show top 5 configurations
cv_results_df[
    ["mean_test_score", "param_n_estimators", "param_max_depth",
     "param_min_samples_split", "param_max_features"]
].sort_values(by = "mean_test_score", ascending = False).head()

Unnamed: 0,mean_test_score,param_n_estimators,param_max_depth,param_min_samples_split,param_max_features
1,0.96044,200,,2,sqrt
17,0.96044,200,10.0,2,sqrt
5,0.956044,200,,2,log2
6,0.956044,100,,5,log2
7,0.956044,200,,5,log2


### Interpretation

The best-performing configurations usually balance tree depth and randomness.

If shallow depths and higher min_samples_split are selected, it suggests that
reducing overfitting was important.

If deeper trees perform better, it suggests that the dataset benefits from
more complex decision boundaries.

Understanding these trends helps guide future tuning instead of relying
entirely on brute-force search.

---