## Ensemble Methods — Bagging, Random Forests, and Boosting

This notebook walks through an end-to-end supervised learning pipeline with Ensemble Methods:
	•	Data Exploration: load, inspect, visualize
	•	Preprocessing: split, (optional) imputation
	•	Modeling: from-scratch Bagging (+ sklearn RandomForest, AdaBoost, GradientBoosting)
	•	Evaluation: accuracy, confusion matrix, classification report, ROC-AUC, ROC curves, feature importances
	•	Tuning: small GridSearchCV for RandomForest/GradientBoosting

Concepts:
	•	Bagging averages/votes across many high-variance learners trained on bootstrap samples.
	•	Random Forest = Bagging with decision trees + random feature subspacing per split.
	•	Boosting fits learners sequentially, focusing on previously misclassified points (AdaBoost) or optimizing a differentiable loss (Gradient Boosting).

## Setup

We’ll use the Breast Cancer Wisconsin dataset (binary classification). Ensembles don’t require feature scaling; we include imputation for robustness.

In [None]:
# Imports & reproducibility
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from typing import Optional, Tuple, Union, List, Dict
from IPython.display import display

# sklearn utilities
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
    roc_auc_score,
    roc_curve,
)
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

# Optional viz
try:
    import seaborn as sns
except Exception:
    sns = None

np.random.seed(42)
plt.rcParams["figure.figsize"] = (6.5, 4.0)

## Data Exploration

We’ll load the dataset, examine shapes, summary statistics, and class balance. Optional: quick correlations heatmap.

In [None]:
# Load Breast Cancer dataset
cancer = load_breast_cancer(as_frame=True)
X: pd.DataFrame = cancer.data.copy()
y: pd.Series = pd.Series(cancer.target, name="target")  # 0=malignant, 1=benign

df = X.copy()
df["target"] = y

print("Shape:", df.shape)
display(df.head())
display(df.describe())

print("\nClass counts (0=malignant, 1=benign):")
display(y.value_counts())

# Optional correlations heatmap (may be dense)
if sns is not None:
    try:
        corr = df.corr(numeric_only=True)
        sns.heatmap(corr, cmap="coolwarm", center=0)
        plt.title("Feature Correlations (numeric)")
        plt.show()
    except Exception as e:
        print("Skipping heatmap:", e)
else:
    print("Seaborn not installed; skipping heatmap.")

## Preprocessing
	•	Stratified train/test split to preserve class proportions
	•	Trees/ensembles don’t need scaling; we’ll keep a median imputer in each pipeline for generality.

In [None]:
# Train/Test split (stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)
print("Train shape:", X_train.shape, " Test shape:", X_test.shape)
print("Train class balance:\n", y_train.value_counts(normalize=True))

## Modeling

We’ll train four models:
	1.	From-scratch Bagging classifier (simple, pedagogical) using decision trees as base learners
	2.	RandomForestClassifier (bagging + random subspace)
	3.	AdaBoostClassifier (boosting with exponential loss)
	4.	GradientBoostingClassifier (boosting with differentiable loss)

We’ll also do a small GridSearchCV for RandomForest and GradientBoosting.

In [None]:
class SimpleBaggingClassifier:
    """
    A minimal Bagging classifier using a base estimator (default: DecisionTreeClassifier).
    - Fits `n_estimators` learners on bootstrap samples (rows) and optional feature subsamples (columns).
    - Aggregates by probability averaging (then argmax).
    - Computes an OOB accuracy if bootstrap=True.
    NOTE: This is for teaching; for production, use sklearn.ensemble.BaggingClassifier.
    """
    def __init__(
        self,
        base_estimator: Optional[DecisionTreeClassifier] = None,
        n_estimators: int = 50,
        max_samples: float = 0.8,     # fraction of rows per bootstrap sample
        max_features: float = 1.0,    # fraction of columns per estimator
        bootstrap: bool = True,
        random_state: Optional[int] = 42,
    ):
        self.base_estimator = base_estimator or DecisionTreeClassifier(
            max_depth=3, random_state=0
        )
        self.n_estimators = int(n_estimators)
        self.max_samples = float(max_samples)
        self.max_features = float(max_features)
        self.bootstrap = bool(bootstrap)
        self.random_state = random_state

        self.estimators_: List[DecisionTreeClassifier] = []
        self.feature_indices_: List[np.ndarray] = []
        self.classes_: Optional[np.ndarray] = None
        self.oob_score_: Optional[float] = None

    def fit(self, X: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray]):
        X = np.asarray(X)
        y = np.asarray(y)
        n, d = X.shape
        rng = np.random.RandomState(self.random_state)

        self.classes_ = np.unique(y)
        self.estimators_.clear()
        self.feature_indices_.clear()

        # Track in-bag indices for OOB score
        inbag_masks = []

        n_rows = max(1, int(round(self.max_samples * n)))
        n_cols = max(1, int(round(self.max_features * d)))

        for i in range(self.n_estimators):
            # sample rows
            if self.bootstrap:
                row_idx = rng.randint(0, n, size=n_rows)  # with replacement
            else:
                row_idx = rng.choice(np.arange(n), size=n_rows, replace=False)

            # sample columns
            feat_idx = rng.choice(np.arange(d), size=n_cols, replace=False)

            est = self._fresh_estimator(i)
            est.fit(X[row_idx][:, feat_idx], y[row_idx])

            self.estimators_.append(est)
            self.feature_indices_.append(feat_idx)

            mask = np.zeros(n, dtype=bool)
            mask[row_idx] = True
            inbag_masks.append(mask)

        # Compute OOB score (if any OOB per sample exists)
        if self.bootstrap:
            votes_sum = np.zeros((n, len(self.classes_)), dtype=float)
            votes_cnt = np.zeros(n, dtype=int)

            class_to_pos = {c: j for j, c in enumerate(self.classes_)}
            for est, feat_idx, inbag in zip(self.estimators_, self.feature_indices_, inbag_masks):
                oob_idx = np.where(~inbag)[0]
                if oob_idx.size == 0:
                    continue
                proba = self._predict_proba_with_est(est, X[oob_idx][:, feat_idx], class_to_pos)
                votes_sum[oob_idx] += proba
                votes_cnt[oob_idx] += 1

            usable = votes_cnt > 0
            if usable.any():
                y_oob_pred = self.classes_[np.argmax(votes_sum[usable] / votes_cnt[usable, None], axis=1)]
                self.oob_score_ = accuracy_score(y[usable], y_oob_pred)
            else:
                self.oob_score_ = None
        else:
            self.oob_score_ = None

        return self

    def predict(self, X: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
        proba = self.predict_proba(X)
        return self.classes_[np.argmax(proba, axis=1)]

    def predict_proba(self, X: Union[pd.DataFrame, np.ndarray]) -> np.ndarray:
        X = np.asarray(X)
        n = X.shape[0]
        proba_sum = np.zeros((n, len(self.classes_)), dtype=float)
        class_to_pos = {c: j for j, c in enumerate(self.classes_)}

        for est, feat_idx in zip(self.estimators_, self.feature_indices_):
            proba_sum += self._predict_proba_with_est(est, X[:, feat_idx], class_to_pos)

        return proba_sum / float(self.n_estimators)

    # ---- helpers ----
    def _fresh_estimator(self, i: int) -> DecisionTreeClassifier:
        # Create a new base estimator per iteration with a different random_state for diversity
        rs = None if self.base_estimator.random_state is None else (self.base_estimator.random_state + i + 1)
        est = DecisionTreeClassifier(
            criterion=getattr(self.base_estimator, "criterion", "gini"),
            max_depth=self.base_estimator.max_depth,
            min_samples_leaf=self.base_estimator.min_samples_leaf if hasattr(self.base_estimator, "min_samples_leaf") else 1,
            random_state=rs,
        )
        return est

    def _predict_proba_with_est(
        self,
        est: DecisionTreeClassifier,
        X_sub: np.ndarray,
        class_to_pos: Dict[Union[int, float], int],
    ) -> np.ndarray:
        # Align estimator's class order to global classes_
        est_proba = est.predict_proba(X_sub)
        aligned = np.zeros((X_sub.shape[0], len(self.classes_)), dtype=float)
        for j, c in enumerate(est.classes_):
            aligned[:, class_to_pos[c]] = est_proba[:, j]
        return aligned