# Chapter 7: Ensemble Learning and Random Forests

*Ensemble* - A group of predictors  
*Ensemble Learning* - Technique of aggregating the predictions of a group of predictors  
*Ensemble method* - An Ensemble Learning algorithm

## 7.1 Voting Classifiers

Suppose you have a few classifiers, each with 80% accuracy. A simple way to make a better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes.

*Hard voting* classifier - Majority-vote classifier

> Note: Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms, increasing chances of making different errors and thus improving the ensemble's accuracy.

Let's take a look at moons dataset.

In [1]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

In [2]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [14]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

Now look at each classifier's accuracy on test set.

In [4]:
from sklearn.metrics import accuracy_score

In [16]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.904


*Soft voting* - Predict the class with the highest class **probability**, averaged over all the individual classifiers.  
It gives more weight to highly confident votes => higher performance than hard voting.

> Note: Just replace `voting='hard'` with `voting='soft'` and ensure all classifiers can estimate class probabilities (eg. have `predict_proba()`).

## 7.2 Bagging and Pasting

Another way is to train the same algorithm on different random subsets of training set.  

*Bagging (bootstrap aggregating)* - When sampling is performed **with** replacement  
*Pasting* - When sampling is performed **without** replacement

The aggregation function is often the *statistical mode* - the most frequent prediction (classification) or average (regression).

> Note: Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance.

Predictors for bagging/pasting can be trained in parallel => scales very well (ie. popular methods).

### 7.2.1 Bagging and Pasting in Scikit-Learn

Scikit-Learn has `BaggingClassifier` and `BaggingRegressor`.

> Note: `BaggingClassifier` automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities.

In [5]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [5]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

### 7.2.2 Out-of-Bag Evaluation

`BaggingClassifier` only samples about 63% of the training instances (when 
`bootstrap=True`, with replacement), so the remaining 37% are called *out-of-bag (oob)* instances.

Set `oob_score=True` when creating a `BaggingClassifier` to request an automatic oob evaluation after training.

In [10]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.904

In [6]:
from sklearn.metrics import accuracy_score

In [12]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.904

In [15]:
bag_clf.oob_decision_function_[:10] # Shorten output by taking 10 rows

array([[0.4137931 , 0.5862069 ],
       [0.37755102, 0.62244898],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.00512821, 0.99487179],
       [0.1185567 , 0.8814433 ],
       [0.31188119, 0.68811881],
       [0.00540541, 0.99459459],
       [0.98324022, 0.01675978],
       [0.98543689, 0.01456311]])

## 7.3 Random Patches and Random Subspaces

`BaggingClassifier` can also sample the features too (`max_features` and `bootstrap_features`).

*Random Patches* method - Sampling both training instances and features.  
*Random Subspaces* method - Keeping all training instances (`bootstrap=False`, `max_samples=1.0`) and sampling the features (`bootstrap_features=True`, `max_features<1.0`).

## 7.4 Random Forests

A Random Forest is an ensemble of Decision Trees, generally trained with the bagging method. Use `RandomForestClassifier` (or `RandomForestRegressor`) instead of `BaggingClassifier` and `DecisionTreeClassifier` because the former is more optimized.

In [7]:
from sklearn.ensemble import RandomForestClassifier

In [17]:
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

Generally most of Decision Tree and Bagging hyperparameters are available to use - with some exceptions.

Random Forests introduce extra randomness; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features.

In [18]:
# Using these hyperparameters
# Roughly equivalent to Random Forest Classifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

### 7.4.1 Extra-Trees

*Extremely Randomized Trees (Extra-Trees)* ensemble - Using random thresholds for each feature rather than searching for the best possible thresholds.

Use `ExtraTreesClassifier` (`ExtraTreesRegressor`) for Extra-Trees and they have the same hyperparameters as Random Forests.

> Note: Training Extra-Trees is much faster because finding the best possible threshold is the most time-consuming task.

> Note: Whether or not Extra-Trees perform better (eg. higher accuracy) than Random Forests is unclear. Best way is try both and compare their cross-validation.

### 7.4.2 Feature Importance

Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity (**recall: Gini impurity!**) on average (across all trees in the forest). It can be accessed using `feature_importances_`.

In [8]:
from sklearn.datasets import load_iris

In [20]:
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10149417945091706
sepal width (cm) 0.02588642676804873
petal length (cm) 0.4421831688641625
petal width (cm) 0.4304362249168718


Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

## 7.5 Boosting

*Boosting (hypothesis boosting)* - Any Ensemble method that can combine several weak learners into a strong learner (ie. train predictors sequentially such that each corrects its predecessor).

Two main popular methods of boosting:

- *AdaBoost* (Adaptive Boosting)
- *Gradient Boosting*

### 7.5.1 AdaBoost

Adaboost pays a bit more attention to the training instances that the predecessor underfitted for each subsequent predictor.

For example:

1. Trains on a base classifier and makes its prediction. 
2. Increase the relative weight of the misclassified training instances.
3. Trains a second classifier using the updated weights and makes prediction.
4. Repeat

> Note: SVMs are generally not good base predictors for AdaBoost; they are slow and tend to be unstable with it.

Similar to Gradient Descent (converging), but instead of tweaking a single predictor's parameters to minimize cost function, AdaBoost adds predictors to the ensemble, gradually making it better.

> Note: It cannot be parallelized (predictor 1 -> predictor 2; 2 can only be trained after 1 is trained), and as a result, does not scale well as bagging or pasting.

Scikit-Learn uses a multiclass version of AdaBoost called *SAMME (Stagewise Additive Modeling using a Multiclass Exponential loss function)*. When there's only 2 classes, it is equivalent to AdaBoost.

If predictors have `predict_proba()` use `SAMME.R` ("R"=Real), which relies on class probabilities and performs better.

A *Decision Stump* is a Decision Tree with `max_depth=1`.

In [9]:
from sklearn.ensemble import AdaBoostClassifier

In [10]:
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=200)

> Note: If AdaBoost ensemble is overfitting the training set, reduce the number of estimators or more strongly regularize the base estimator.

### 7.5.2 Gradient Boosting

Similar to AdaBoost, *Gradient Boosting* sequentially adds predictors to an ensemble and correcting its predecessor.

But instead of tweaking the instance weights, it fits the new predictor to the *residual errors* made by the previous predictor.

(Gradient Boosting + Decision Trees + Regression) is called *Gradient Tree Boosting* or *Graident Boosted Regression Trees (GBRT)*.

Now an example using a noisy quadratic training set.

In [1]:
# NOTE: COPIED FROM ACCOMPANYING JUPYTER NOTEBOOK

import numpy as np

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

In [2]:
from sklearn.tree import DecisionTreeRegressor

In [9]:
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

# Now train a second DecisionTreeRegressor on residual errors from first
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

# Now train a third regressor on residual errors from second
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

# NOTE: COPIED "X_new" FROM ACCOMPANYING NOTEBOOK
X_new = np.array([[0.8]])

# Make predictions by adding up predictions of all trees
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred

array([0.75026781])

> Note: See book's graphs to see how ensemble predictions take the sum of the individual predictor's errors.

Another way to train GBRT ensembles is to use Scikit-Learn's `GradientBoostingRegressor` and has hyperparameters to control growth of Decision Trees (`max_depth`, `min_samples_leaf`) and of ensemble training (`n_estimators`).

In [4]:
from sklearn.ensemble import GradientBoostingRegressor

In [5]:
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3)

`learning_rate` scales the contribution of each tree. If set to low value (0.1), then you need more trees in ensemble to fit the training set, but predictions will generalize better.  
=> This technique is called *shrinkage*.

If there's not enough trees, => underfitting.
If there's too many trees, => overfitting.

Use early stopping to find optimal number of trees. Use `staged_predict()`: returns an iterator over the predictions made by the ensemble at each stage (with 1 tree, with 2 trees, etc.)

Following example with train GBRT ensemble with 12 trees, measures validation error at each stage to find optimal number of trees. Then trains another GBRT with optimal number of trees.

In [6]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [16]:
X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) 
            for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=48)

In [17]:
bst_n_estimators

48

`warm_start=True` makes Scikit-Learn keep existing trees when `fit()` is called, allowing incremental training.

Following example stops training when validation error does not improve for 5 iterations in a row.

In [22]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break   # early stopping

gbrt.n_estimators

53

`GradientBoostingRegressor` also supports `subsample` which specifies the fraction of training instances to be used for training each tree (eg. `subsample=0.25` -> each tree is trained on 25% of the training instances, selected randomly).  
=> This is called **Stochastic Gradient Boosting**.

> Note: Gradient Boosting can be used with other cost functions, controlled by `loss` hyperparameter.

XGBoost (Extreme Gradient Boosting) is a popular Python library that is optimized for Gradient Boosting. XGBoost's API is similar to Scikit-Learn's.

    import xgboost

    xgb_reg = xgboost.XGBRegressor()
    xgb_reg.fit(X_train, y_train)
    y_pred = xgb_reg.predict(X_val)

    # XGBoost can take care of early stopping

    xgb_reg.fit(X_train, y_train, eval_set=[(X_val, y_val)], eary_stopping_rounds=2)
    y_pred = xgb_reg.predict(X_val)

## 7.6 Stacking

*Stacking (stacked generalization)* - Instead of using trivial function (eg. hard voting) to aggregate the predictions of all predictors in an ensemble, why don't we train a model to perform this aggregation?

*Blender (meta learner)* - The final predictor used to predict the target value, using a hold-out set.

For example,

1. Split training set into subset 1 and subset 2
2. Use subset 1 to train the predictors
3. Use predictors to make predictions on subset 2
4. Results are new "clean" predictions because they were never seen (held-out)
5. Create new training set using these predicted values as input features
6. Blender is trained on new training set and gives prediction

> Note: Refer to book for detailed chart on stacking using a Blender.

Scikit-Learn does not support stacking directly. You can use an open source implementation such as `DESlib` or make your own implementation.