---
title: "Random Forests in Practice"
author: "Data Science Lab"
date: "2025-11-21"
format:
  html:
    toc: true
    toc-title: "Contents"
    toc-depth: 2
    code-fold: true
  pdf:
    toc: true
execute:
  echo: true
  warning: false
  message: false
---

## Overview

Random forests are ensemble models that aggregate many decision trees to reduce variance and improve generalization.[^breiman2001] This document walks through training and interpreting a random forest classifier in Python, with a mix of narrative, math, and visuals.

## Mathematical Model

Each tree $T_b$ is trained on a bootstrap sample $\mathcal{D}_b$ and a random subset of features. The forest prediction for a classification task with $B$ trees is the majority vote:

$$
\hat{y} = \mathrm{mode}\left(\{T_b(\mathbf{x})\}_{b=1}^{B}\right)
$$

For regression, the trees are averaged:

$$
\hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(\mathbf{x})
$$

The randomization across bootstrapped data and feature subsampling drives decorrelation between trees, delivering lower variance than single-tree models.

## Environment Setup

In [None]:
import importlib
import subprocess
import sys

def ensure(package):
    try:
        importlib.import_module(package)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

for pkg in ("numpy", "pandas", "seaborn", "matplotlib", "scikit-learn"):
    ensure(pkg)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import RocCurveDisplay, ConfusionMatrixDisplay, classification_report

sns.set_theme(style="whitegrid")

## Data Loading and Preparation

We will use the Breast Cancer Wisconsin dataset bundled with scikit-learn, which contains 30 features computed from digitized fine needle aspirate images.[^sklearn_breast]

In [None]:
dataset = load_breast_cancer(as_frame=True)
df = dataset.frame
df.head()

Split the data into training and testing sets (stratified to maintain label balance).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns="target"),
    df["target"],
    test_size=0.25,
    random_state=42,
    stratify=df["target"]
)

X_train.shape, X_test.shape

## Model Training

In [None]:
rf = RandomForestClassifier(
    n_estimators=400,
    max_features="sqrt",
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

Evaluate cross-validated training performance to estimate generalization ability.

In [None]:
cv_scores = cross_val_score(rf, X_train, y_train, cv=5)
cv_scores.mean(), cv_scores.std()

## Diagnostics

In [None]:
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=dataset.target_names))

### Receiver Operating Characteristic

In [None]:
fig, ax = plt.subplots(figsize=(6, 4))
RocCurveDisplay.from_estimator(rf, X_test, y_test, ax=ax)
ax.set_title("Random Forest ROC Curve")
plt.tight_layout()

### Confusion Matrix

In [None]:
fig, ax = plt.subplots(figsize=(4, 4))
ConfusionMatrixDisplay.from_estimator(rf, X_test, y_test, display_labels=dataset.target_names, ax=ax, cmap="Blues")
ax.set_title("Confusion Matrix")
plt.tight_layout()

## Feature Importance Visualization

In [None]:
importances = pd.Series(rf.feature_importances_, index=df.columns[:-1]).sort_values(ascending=False)
top_features = importances.head(15)

try:
    sns
except NameError:
    import seaborn as sns
    sns.set_theme(style="whitegrid")

plt.figure(figsize=(8, 5))
sns.barplot(x=top_features.values, y=top_features.index, palette="viridis")
plt.title("Top 15 Feature Importances")
plt.xlabel("Gini Importance")
plt.ylabel("Feature")
plt.tight_layout()

## Hyperparameter Considerations

- `n_estimators`: Increasing trees generally improves stability until diminishing returns set in.
- `max_depth` or `min_samples_leaf`: Control tree complexity, mitigating overfitting.
- `max_features`: Governs the degree of feature randomness; `sqrt` is typical for classification.
- `class_weight`: Useful for imbalanced datasets to penalize misclassification of minority classes.

Grid search or Bayesian optimization can systematically explore these settings.[^bergstra2012]

## Practical Tips

- **Feature scaling**: Not required because trees are invariant to monotonic transformations.
- **Missing values**: scikit-learn's implementation does not handle NaNs; impute beforehand.
- **Interpretability**: Use SHAP values or permutation importance for richer explanations.
- **Out-of-bag (OOB) estimates**: Enable `oob_score=True` to get a built-in validation metric without a separate hold-out set.

## References

- Breiman, L. (2001). Random forests. *Machine Learning*, 45(1), 5–32. [https://doi.org/10.1023/A:1010933404324](https://doi.org/10.1023/A:1010933404324)^[^breiman2001]
- scikit-learn Breast Cancer Dataset docs. [https://scikit-learn.org/stable/datasets/toy_dataset.html](https://scikit-learn.org/stable/datasets/toy_dataset.html)^[^sklearn_breast]
- Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. *Journal of Machine Learning Research*, 13, 281–305. [https://jmlr.org/papers/v13/bergstra12a.html](https://jmlr.org/papers/v13/bergstra12a.html)^[^bergstra2012]

[^breiman2001]: Introduced the random forest algorithm with theoretical justification and empirical benchmarks.
[^sklearn_breast]: Official description of the dataset, feature definitions, and usage considerations.
[^bergstra2012]: Demonstrated the efficiency gains of random search over grid search for hyperparameter tuning.
