<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Ensemble Learning and Random Forests


Suppose you pose a complex question to thousands of random people,
then aggregate their answers.


In many cases, this aggregated answer
is better than an expert’s answer.


This phenomenon is known as **the wisdom of the crowd**.


A similar idea applies in machine learning.


If you aggregate the predictions of multiple predictors
(such as classifiers or regressors),
you will often obtain better predictions
than from the best individual predictor.


A group of predictors is called an **ensemble**.


This approach is known as **ensemble learning**,
and algorithms that implement it
are called **ensemble methods**.


### An Example: Decision Tree Ensembles


One simple example of an ensemble method
is to train several decision tree classifiers.


Each tree is trained on a **different random subset**
of the training set.


To make a prediction,
each tree casts a vote for a class.


The class that receives the **most votes**
becomes the ensemble’s final prediction.


An ensemble of decision trees built this way
is called a **random forest**.


Despite its conceptual simplicity,
random forests are among
the most powerful machine learning algorithms available today.


### When to Use Ensemble Methods


As discussed in Chapter 2,
ensemble methods are often used
near the end of a machine learning project.


At that stage, you may already have
several reasonably good predictors.


Combining them into an ensemble
can produce an even better model.


In fact, many winning solutions
in machine learning competitions
rely heavily on ensemble methods.


A famous example is the Netflix Prize competition,
where ensembles played a central role.


### Pros and Cons of Ensemble Learning


Ensemble methods are powerful,
but they do come with some downsides.


They generally require **more computational resources**
than using a single model.


This includes higher costs
for both training and inference.


Ensembles can also be more complex
to deploy and manage in production.


Additionally, their predictions
are usually harder to interpret.


# Voting Classifiers


Suppose you have trained a few classifiers, each one achieving about 80% accuracy.


For example, you might have:
- A Logistic Regression classifier
- A Support Vector Machine (SVM)
- A Random Forest
- A k-Nearest Neighbors classifier


A simple way to build a stronger classifier is to **aggregate their predictions**.


The class that receives the **most votes** becomes the final prediction.


This approach is called **hard voting**, and the resulting model is known as a
**hard voting classifier**.


Surprisingly, a voting classifier often achieves **higher accuracy**
than the best individual classifier in the ensemble.


Even if each individual classifier is only a **weak learner**
(performing just slightly better than random guessing),
the ensemble can become a **strong learner**.


This works best when:
- There are many classifiers
- The classifiers are diverse
- They make uncorrelated errors


This idea is closely related to the **law of large numbers**.


Just like repeatedly tossing a slightly biased coin increases the chance
of observing the true probability, aggregating many weak but diverse classifiers
pushes predictions toward the correct class.


### Diversity Matters


If all classifiers make similar errors,
majority voting will not help much.


Ways to increase diversity include:
- Using very different algorithms
- Training models with different hyperparameters
- Training models on different subsets of the data


### VotingClassifier in Scikit-Learn


Scikit-Learn provides the `VotingClassifier`,
which makes implementing voting ensembles straightforward.


In [93]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC


First, generate the moons dataset and split it into training and test sets.


In [94]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)


Next, create a voting classifier using three diverse models:
- Logistic Regression
- Random Forest
- Support Vector Classifier


In [95]:
voting_clf = VotingClassifier(
    estimators=[
        ("lr", LogisticRegression(random_state=42)),
        ("rf", RandomForestClassifier(random_state=42)),
        ("svc", SVC(random_state=42)),
    ]
)


Fit the voting classifier to the training data.


In [96]:
voting_clf.fit(X_train, y_train)


When a `VotingClassifier` is trained, Scikit-Learn:
- Clones each estimator
- Fits the cloned models


The fitted models are available through the `named_estimators_` attribute.


Let’s evaluate each individual classifier on the test set.


In [97]:
for name, clf in voting_clf.named_estimators_.items():
    print(name, "=", clf.score(X_test, y_test))


lr = 0.864
rf = 0.896
svc = 0.896


By default, the voting classifier uses **hard voting**.


For a single test instance, the predicted class is the one
chosen by the majority of classifiers.


In [98]:
voting_clf.predict(X_test[:1])


array([1])

In [99]:
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]


[array([1]), array([1]), array([0])]

Now evaluate the overall accuracy of the voting classifier.


In [100]:
voting_clf.score(X_test, y_test)


0.912

The voting classifier outperforms all individual classifiers.


In [101]:
### Soft Voting


If all classifiers can estimate class probabilities,
you can use **soft voting**.


Soft voting predicts the class with the highest
average predicted probability.


Soft voting often performs better than hard voting
because it gives more weight to confident predictions.


The `SVC` class does not support probability estimates by default,
so we must enable them explicitly.


In [102]:
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True


Refit the voting classifier with soft voting enabled.


In [103]:
voting_clf.fit(X_train, y_train)


Evaluate the soft voting classifier.


In [104]:
voting_clf.score(X_test, y_test)


0.92

Soft voting achieves even higher accuracy.


**Tip:**
Soft voting works best when predicted probabilities are well-calibrated.
If needed, use `sklearn.calibration.CalibratedClassifierCV`.


# Bagging and Pasting


Another way to build a diverse ensemble is to use the **same learning algorithm**
but train each predictor on a **different random subset** of the training data.


There are two main approaches:
- **Bagging** (Bootstrap Aggregating)
- **Pasting**


When sampling is performed **with replacement**, the method is called **bagging**.


When sampling is performed **without replacement**, the method is called **pasting**.


Both approaches allow training instances to appear in multiple predictors,
but **only bagging** allows the same instance to appear multiple times
within a single predictor’s training set.


Once all predictors are trained, the ensemble makes predictions by
**aggregating** the individual predictions.


- For **classification**, aggregation is usually the **mode** (majority vote)
- For **regression**, aggregation is usually the **average**


Each individual predictor typically has **higher bias** than a model trained
on the full dataset, but aggregation reduces both **bias and variance**.


This works especially well for **high-variance, low-bias models**
such as decision trees.


### Why Bagging Reduces Variance


Averaging predictions from independent models reduces variance.


In practice, predictors are not fully independent,
but bagging still reduces correlation enough to significantly
lower variance compared to a single model.


### Bagging vs. Pasting


- **Bagging** introduces more diversity but slightly higher bias
- **Pasting** avoids redundant samples and is slightly more efficient


Bagging is generally preferred, especially for noisy datasets
or models prone to overfitting.


Both methods scale very well because predictors can be trained
and evaluated **in parallel**.


### Bagging and Pasting in Scikit-Learn


Scikit-Learn provides the `BaggingClassifier`
(and `BaggingRegressor` for regression).


from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier


The following code trains an ensemble of 500 decision trees.
Each tree is trained on 100 instances sampled **with replacement**
(i.e., bagging).


In [105]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1,
    random_state=42
)


Fit the bagging classifier to the training data.


In [106]:
bag_clf.fit(X_train, y_train)


To use **pasting** instead of bagging, set `bootstrap=False`.


A `BaggingClassifier` automatically performs **soft voting**
if the base estimator supports probability estimates,
which decision trees do.


### Out-of-Bag (OOB) Evaluation


The remaining **37%** are called **out-of-bag (OOB)** instances.


OOB instances can be used as a **validation set**,
eliminating the need for a separate one.


Enable OOB evaluation by setting `oob_score=True`.


In [107]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    oob_score=True,
    n_jobs=-1,
    random_state=42
)


Train the model with OOB evaluation enabled.


In [108]:
bag_clf.fit(X_train, y_train)


The OOB accuracy estimate is stored in `oob_score_`.


In [109]:
bag_clf.oob_score_


0.896

Let’s compare this estimate with the actual test-set accuracy.


In [110]:
from sklearn.metrics import accuracy_score

y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)


0.92

The OOB score is usually slightly pessimistic,
but it provides a very good estimate of test performance.


The OOB decision function is also available and returns
class probabilities for each training instance.


In [111]:
bag_clf.oob_decision_function_[:3]


array([[0.32352941, 0.67647059],
       [0.3375    , 0.6625    ],
       [1.        , 0.        ]])

### Random Patches and Random Subspaces


`BaggingClassifier` also supports **feature sampling**.


Feature sampling is controlled by:
- `max_features`
- `bootstrap_features`


Sampling both instances and features is called the
**random patches** method.


Sampling only features (keeping all instances)
is called the **random subspaces** method.


Feature sampling increases predictor diversity,
trading a bit more bias for lower variance.
cachchch

The importances are normalized so that their sum equals 1.


You can access feature importances using the
`feature_importances_` attribute.


The following example trains a random forest on the iris dataset
and displays feature importances.


In [112]:
from sklearn.datasets import load_iris


In [113]:
iris = load_iris(as_frame=True)


In [114]:
rnd_clf = RandomForestClassifier(
    n_estimators=500,
    random_state=42
)


Train the random forest.


In [115]:
rnd_clf.fit(iris.data, iris.target)


Display each feature’s importance.


In [116]:
for score, name in zip(rnd_clf.feature_importances_, iris.data.columns):
    print(round(score, 2), name)


0.11 sepal length (cm)
0.02 sepal width (cm)
0.44 petal length (cm)
0.42 petal width (cm)


The most important features are petal length and petal width,
while sepal features are much less important.


Similarly, when trained on image datasets such as MNIST,
random forests can reveal which pixels are most important
for classification.


Random forests are especially useful for:
- Feature selection
- Quick model benchmarking
- Strong baseline performance


# Boosting


Boosting (originally called hypothesis boosting) refers to ensemble methods
that combine several weak learners into a strong learner.


Most boosting methods train predictors **sequentially**,
each one trying to correct the errors of its predecessor.


The two most popular boosting algorithms are:
- **AdaBoost** (Adaptive Boosting)
- **Gradient Boosting**


Let’s start with AdaBoost.


## AdaBoost


AdaBoost focuses more and more on the **hard-to-classify** training instances.


Each new predictor pays more attention to instances that previous predictors
misclassified.


The algorithm works as follows:
1. Train a base classifier.
2. Increase the weights of misclassified instances.
3. Train a new classifier using the updated weights.
4. Repeat.


This sequential learning process is similar to gradient descent,
except AdaBoost adds new predictors instead of adjusting parameters.


### Important Limitation


Boosting cannot be parallelized, because each predictor depends
on the previous one.


As a result, boosting does not scale as well as bagging or pasting.


### AdaBoost Mathematics (Conceptual)


- Each training instance starts with equal weight.
- Misclassified instances get higher weights.
- More accurate predictors get higher voting weights.


Predictions are made by a **weighted vote** of all predictors.


Scikit-Learn uses a multiclass version of AdaBoost called **SAMME**.


When there are only two classes, SAMME is equivalent to standard AdaBoost.


In [117]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier


By default, AdaBoost uses **decision stumps**
(decision trees with max_depth=1).


In [118]:
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=30,
    learning_rate=0.5,
    random_state=42,
    algorithm="SAMME"
)


Train the AdaBoost classifier.


In [119]:
ada_clf.fit(X_train, y_train)




If the model overfits, try:
- Fewer estimators
- Stronger regularization of the base estimator
- A smaller learning rate


## Gradient Boosting


Gradient boosting also builds predictors sequentially,
but instead of reweighting instances,
it fits each new predictor to the **residual errors**
of the previous predictor.


We’ll start with a regression example using decision trees.


In [120]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor


Generate a noisy quadratic dataset.


In [121]:
m = 100
rng = np.random.default_rng(seed=42)
X = rng.random((m, 1)) - 0.5
noise = 0.05 * rng.standard_normal(m)
y = 3 * X[:, 0] ** 2 + noise


Train the first regression tree.


In [122]:
tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X, y)


Train a second tree on the residual errors of the first.


In [123]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=43)
tree_reg2.fit(X, y2)


Train a third tree on the residuals of the second.


In [124]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=44)
tree_reg3.fit(X, y3)


Predictions are made by summing the predictions of all trees.


In [125]:
X_new = np.array([[-0.4], [0.0], [0.5]])
sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))


array([0.57356534, 0.0405142 , 0.66914249])

### Gradient Boosting with Scikit-Learn


In [126]:
from sklearn.ensemble import GradientBoostingRegressor


This creates the same ensemble using a single class.


In [127]:
gbrt = GradientBoostingRegressor(
    max_depth=2,
    n_estimators=3,
    learning_rate=1.0,
    random_state=42
)
gbrt.fit(X, y)


The learning rate controls how much each tree contributes.


A smaller learning rate requires more trees
but often improves generalization.
This technique is called **shrinkage**.


### Early Stopping


Early stopping automatically stops training
when adding more trees no longer improves performance.


In [128]:
gbrt_best = GradientBoostingRegressor(
    max_depth=2,
    learning_rate=0.05,
    n_estimators=500,
    n_iter_no_change=10,
    random_state=42
)
gbrt_best.fit(X, y)


The actual number of trees used is often much smaller.


In [129]:
gbrt_best.n_estimators_


53

## Histogram-Based Gradient Boosting (HGB)


Histogram-based gradient boosting speeds up training
by binning continuous features into discrete values.


This reduces computational complexity and memory usage,
making it ideal for large datasets.


Key differences from standard GBRT:
- Faster training
- Built-in handling of missing values
- Native support for categorical features


In [130]:
# extra code – at least not in this chapter, it's presented in chapter 2

from pathlib import Path
import tarfile
import urllib.request

import pandas as pd
from sklearn.model_selection import train_test_split

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

housing = load_housing_data()

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing_labels = train_set["median_house_value"]
housing = train_set.drop("median_house_value", axis=1)

In [131]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import OrdinalEncoder


Example pipeline for the California housing dataset.


In [132]:
hgb_reg = make_pipeline(
    make_column_transformer(
        (OrdinalEncoder(), ["ocean_proximity"]),
        remainder="passthrough",
        force_int_remainder_cols=False
    ),
    HistGradientBoostingRegressor(
        categorical_features=[0],
        random_state=42
    )
)


Train the model.


In [133]:
hgb_reg.fit(housing, housing_labels)


Histogram-based gradient boosting is an excellent choice for:
- Large datasets
- Categorical features
- Missing values


Popular optimized gradient boosting libraries include:
- XGBoost
- LightGBM
- CatBoost


# Stacking (Stacked Generalization)


The last ensemble method we will discuss is **stacking**
(short for *stacked generalization*).


Instead of using a simple rule (such as majority voting)
to aggregate predictions, stacking **trains a model**
to perform this aggregation.


The model that learns how to combine predictions
is called a **blender** or **meta-learner**.


Each base predictor makes its own prediction,
and the blender uses these predictions as input features
to produce the final prediction.


### Training a Stacking Ensemble


To train the blender, we must first create a **blending dataset**.


This is done by generating **out-of-sample predictions**
for each base model using cross-validation.


These predictions become the **input features** for the blender,
while the original targets remain unchanged.


Once the blender is trained, the base predictors are
retrained on the **full training set**.


Scikit-Learn handles all of this automatically via
`StackingClassifier` and `StackingRegressor`.


### Stacking Classifier Example (Moons Dataset)


In [134]:
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


We build a stacking classifier using three diverse base models
and a random forest as the final estimator.


In [135]:
stacking_clf = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(probability=True, random_state=42))
    ],
    final_estimator=RandomForestClassifier(random_state=43),
    cv=5
)


Fit the stacking classifier on the training data.


In [136]:
stacking_clf.fit(X_train, y_train)


Evaluate the stacking classifier on the test set.


In [137]:
stacking_clf.score(X_test, y_test)


0.928

This stacking model typically achieves slightly better
performance than soft voting, at the cost of extra complexity.


In [138]:
### How Predictions Are Generated


For each base estimator:
- `predict_proba()` is used if available
- otherwise `decision_function()`
- otherwise `predict()`


If no `final_estimator` is provided:
- `StackingClassifier` defaults to `LogisticRegression`
- `StackingRegressor` defaults to `RidgeCV`


### Multilayer Stacking


It is possible to stack **multiple layers of blenders**,
where one blender’s output feeds into another.


This can slightly improve performance,
but significantly increases training time and system complexity.


### When to Use Stacking


Stacking works best when:
- Models are **diverse**
- Dataset is **complex or high-dimensional**
- Maximum predictive performance is required


It is especially popular in **Kaggle competitions**
and high-stakes prediction systems.


### Summary of Ensemble Methods


- Voting: simple, fast, strong baseline
- Bagging: reduces variance, great for trees
- Random forests: bagging + feature randomness
- Boosting: sequential error correction
- Stacking: learned aggregation for maximum accuracy


Ensemble methods are powerful, flexible, and easy to use,
but can overfit if not carefully regularized.


Next up: **unsupervised learning**, starting with
**dimensionality reduction**.
