# Chapter 7: Ensemble Learning and Random Forests

## 1. Chapter Overview
**Goal:** This chapter explores the concept of **Ensemble Learning**: the idea that aggregating the predictions of a group of predictors (such as classifiers or regressors) will often get better results than with the best individual predictor. We will build the powerful **Random Forest** algorithm and explore advanced boosting techniques like **AdaBoost** and **Gradient Boosting**.

**Key Concepts:**
* **Voting Classifiers:** Hard Voting vs. Soft Voting.
* **Bagging and Pasting:** Training the same algorithm on different random subsets of data.
* **Out-of-Bag (OOB) Evaluation:** A clever way to validate Bagging models without a separate validation set.
* **Random Forests:** Combining Bagging with feature randomization.
* **Extra-Trees:** Extremely Randomized Trees for faster training.
* **Boosting:** Training predictors sequentially to correct the mistakes of previous ones (AdaBoost, Gradient Boosting).
* **Stacking:** Using a "meta-learner" to learn how to combine predictions.

**Practical Skills:**
* Implementing `VotingClassifier` in Scikit-Learn.
* Using `BaggingClassifier` to reduce variance.
* Visualizing Feature Importance with Random Forests.
* Implementing Early Stopping with Gradient Boosting to prevent overfitting.

In [None]:
# Setup
import sys
import sklearn
import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## 2. Theoretical Explanation (In-Depth)

### 1. The Wisdom of the Crowd (Voting Classifiers)
Suppose you ask a complex question to thousands of random people, then aggregate their answers. In many cases, you will find that this aggregated answer is better than an expert's answer. This is the foundation of Ensemble Learning.

**Why does this work? (Law of Large Numbers)**
Imagine a slightly biased coin that has a 51% chance of coming up heads. If you toss it 1,000 times, the ratio of heads will likely be close to 51%. If you toss it 10,000 times, the probability of getting a majority of heads climbs to over 97%. 
Similarly, if you have 1,000 classifiers that are each individually weak (only slightly better than random guessing, say 51% accuracy), combining them into an ensemble can produce a strong classifier with high accuracy, provided that:
1.  The models are sufficiently independent.
2.  They make uncorrelated errors (they don't all fail on the same difficult instances).

**Types of Voting:**
* **Hard Voting:** Each classifier votes for a class (e.g., "Class A"). The ensemble picks the class with the most votes (Majority Rule).
* **Soft Voting:** If all classifiers can estimate class probabilities (i.e., they have a `predict_proba()` method), the ensemble averages the probabilities for each class and picks the class with the highest average probability. This usually performs better than hard voting because it gives more weight to highly confident votes.

### 2. Bagging and Pasting
Another approach is to use the *same* training algorithm for every predictor but train them on different random subsets of the training set.

* **Bagging (Bootstrap Aggregating):** Sampling is performed *with replacement*. This means the same training instance can be sampled several times for the same predictor. This introduces more diversity in the subsets, which generally results in slightly higher bias but significantly lower variance (less overfitting).
* **Pasting:** Sampling is performed *without replacement*.

**Out-of-Bag (OOB) Evaluation:**
In Bagging, some instances may be sampled several times for a given predictor, while others may not be sampled at all. The instances that are NOT sampled (about 37% on average) are called "Out-of-Bag" instances. Since the predictor never saw these during training, they can be used as a validation set. We can evaluate the ensemble by averaging the OOB evaluations of each predictor, removing the need for a separate validation set.

### 3. Random Forests
A Random Forest is essentially an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with `max_samples` set to the size of the training set.

**Extra Randomness:**
Random Forests add an extra layer of randomness. Instead of searching for the very best feature when splitting a node (like a standard Decision Tree), it searches for the best feature among a *random subset of features*. 
* This results in greater tree diversity.
* It trades higher bias for lower variance, generally yielding a better overall model.

**Feature Importance:**
A great quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). 

### 4. Boosting
Boosting (Hypothesis Boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea is to train predictors sequentially, each trying to correct the predecessor.

**a. AdaBoost (Adaptive Boosting):**
To correct the predecessor, AdaBoost pays more attention to the training instances that the predecessor underfitted. 
1.  Train a base classifier (e.g., Decision Stump).
2.  Identify the misclassified instances.
3.  Increase the relative weight of those misclassified instances.
4.  Train a second classifier using the updated weights.
5.  Repeat.
The final prediction is a weighted vote, where valid classifiers (accurate ones) have more weight.

**b. Gradient Boosting:**
Like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking instance weights, this method tries to fit the new predictor to the **residual errors** made by the previous predictor.
* Step 1: Train `Tree_1` on $(X, y)$.
* Step 2: Calculate errors $y_{residual1} = y - Tree_1(X)$.
* Step 3: Train `Tree_2` on $(X, y_{residual1})$.
* Step 4: Calculate errors $y_{residual2} = y_{residual1} - Tree_2(X)$.
* Final Prediction: $y_{pred} = Tree_1(X) + Tree_2(X) + \dots$

**XGBoost:** A scalable and highly optimized implementation of Gradient Boosting that is very popular in competitions.

### 5. Stacking (Stacked Generalization)
Instead of using trivial functions (like hard voting) to aggregate the predictions of all predictors in an ensemble, why not train a model to perform this aggregation?
* **Base Learners:** The initial models (e.g., SVM, Tree, KNN).
* **Meta Learner (Blender):** The final model that takes the predictions of the base learners as inputs and outputs the final prediction.
* **Hold-out Set:** Stacking typically requires splitting the training set into two. The first part is used to train the base learners. The second part is used to generate predictions from the base learners, which then become the "training data" for the meta-learner.

## 3. Code Reproduction

### 3.1 Voting Classifiers
We will compare individual classifiers (Logistic Regression, Random Forest, SVC) against a Voting Classifier on the Moons dataset.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", probability=True, random_state=42) # probability=True needed for soft voting

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft' # Try 'hard' for majority rule, 'soft' for probability averaging
)

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

### 3.2 Bagging and Out-of-Bag Evaluation
We use `BaggingClassifier` with 500 Decision Trees. We enable `oob_score=True` to evaluate performance without a separate validation set.

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True, random_state=42)

bag_clf.fit(X_train, y_train)

print("OOB Score (Estimated Validation Accuracy):", bag_clf.oob_score_)

y_pred = bag_clf.predict(X_test)
print("Test Set Accuracy:", accuracy_score(y_test, y_pred))

### 3.3 Random Forests and Feature Importance
We train a Random Forest on the Iris dataset to see which features matter most.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
rnd_clf.fit(iris["data"], iris["target"])

for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

### 3.4 AdaBoost
We use a Decision Tree with `max_depth=1` (a Decision Stump) as the weak learner. AdaBoost will sequentially add stumps that focus on the errors of the previous ones.

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

print("AdaBoost Accuracy:", accuracy_score(y_test, ada_clf.predict(X_test)))

### 3.5 Gradient Boosting with Early Stopping
Here we manually implement Early Stopping. We train a Gradient Boosting Regressor with 120 trees, but we measure the validation error at each stage (after 1 tree, after 2 trees, etc.) and pick the best number of trees.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate noisy quadratic data
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=49)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)
gbrt.fit(X_train, y_train)

# Find the optimal number of trees
errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators, random_state=42)
gbrt_best.fit(X_train, y_train)

print(f"Best number of trees: {bst_n_estimators}")
print(f"Minimum MSE: {np.min(errors)}")

# Plotting the error curve
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.plot(errors, "b.-")
plt.plot([bst_n_estimators-1, bst_n_estimators-1], [0, np.min(errors)], "k--")
plt.plot(bst_n_estimators-1, np.min(errors), "ko")
plt.axis([0, 120, 0, 0.01])
plt.xlabel("Number of trees")
plt.ylabel("Validation Error")
plt.title("Validation Error vs Number of Trees")

plt.subplot(122)
def plot_predictions(regressors, X, y, axes, label=None, style="r-", data_style="b.", data_label=None):
    x1 = np.linspace(axes[0], axes[1], 500)
    y_pred = sum(regressor.predict(x1.reshape(-1, 1)) for regressor in regressors)
    plt.plot(X[:, 0], y, data_style, label=data_label)
    plt.plot(x1, y_pred, style, linewidth=2, label=label)
    plt.axis(axes)

plot_predictions([gbrt_best], X, y, axes=[-0.5, 0.5, -0.1, 0.8], label="Ensemble prediction")
plt.title(f"Best Model ({bst_n_estimators} trees)")
plt.show()

## 4. Step-by-Step Explanation

### 1. Voting Classifier Analysis
**Input:** We have three diverse classifiers: Logistic Regression (linear), Random Forest (ensemble of trees), and SVM (nonlinear kernel).
**Process:**
* `VotingClassifier` trains all three models on the training data.
* When predicting, if `voting='soft'`, it asks each model for the probability of class 0 and class 1.
* Example: LR says 60% Class 1, RF says 80% Class 1, SVM says 40% Class 1. Average = (60+80+40)/3 = 60%. The ensemble predicts Class 1.
**Output:** The Voting Classifier achieves slightly higher accuracy (e.g., 91.2%) compared to the individual models (e.g., 86%, 89%, 89%). This confirms the "Wisdom of the Crowd".

### 2. Bagging with OOB
**Input:** 500 Decision Trees. `max_samples=100` means each tree is trained on a small random subset of 100 instances.
**Process:** 
* The `BaggingClassifier` builds 500 trees in parallel (`n_jobs=-1`).
* Because `bootstrap=True`, about 37% of training data was never seen by Tree #1, another set of 37% was never seen by Tree #2, etc.
* `oob_score_` calculates the accuracy by testing Tree #1 only on the data it didn't see, and averaging this process across all trees.
**Output:** The OOB score (e.g., 0.904) is very close to the actual Test Set accuracy (e.g., 0.912). This proves OOB is a reliable validation metric.

### 3. Feature Importance
**Process:** The Random Forest checks every split in every tree. If splitting on "Petal Length" reduces the Gini impurity significantly, its score goes up.
**Output:** We clearly see that Petal Length and Petal Width are the most important features (score > 0.4), while Sepal Length and Width are much less relevant. This serves as automatic **Feature Selection**.

### 4. Gradient Boosting & Early Stopping
**Concept:** Overfitting happens when we add too many trees. The model starts memorizing the noise in the training data.
**Process:**
* We train 120 trees.
* `staged_predict` allows us to measure the validation error after 1 tree, 2 trees, ..., 120 trees.
* We plot this error. We see it goes down initially (learning), reaches a minimum, and then starts going up (overfitting).
**Output:** We select the number of trees corresponding to the minimum error (e.g., 55 trees) to get the optimal model that generalizes best.

## 5. Chapter Summary

* **Ensemble Methods** combine multiple weak learners to form a strong learner.
* **Bagging:** Reduces variance (overfitting) by training parallel models on random subsets. Example: Random Forests.
* **Random Forests:** Powerful, versatile, and provide Feature Importance out-of-the-box. Uses two levels of randomness (data sampling + feature sampling).
* **Boosting:** Reduces bias (underfitting) by training sequential models that correct previous mistakes. Example: AdaBoost, Gradient Boosting.
* **Stacking:** Uses a meta-model to learn how to best combine the predictions of base models, often outperforming simple voting.
* **Trade-off:** Ensembles are usually more accurate but slower to train and deploy than individual models.