# Ensemble Learning & Random Forests

Suppose you pose a complex question to thousands of random people, then aggregate their answers. In many cases, you will find that this aggregated answer is better than an expert's answer. This is called the *wisdom of the crowd*. Similarly, if you aggregate the predictions of a group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an *ensemble*; thus, this technique is called *ensemble learning*, & an ensemble learning algorithm is called an *ensemble method*.

As an example of an ensemble method, you can train a group of decision tree classifiers, each on a different random subset of the training set. To make predictions, you obtain the predictions of all the individual trees, then predict the class that gets the most votes. Such an ensemble of decision trees is called a *random forest*, & despite its simplicity, it is one of the most powerful machine learning algorithms available today.

As discussed before, you will often use ensemble methods near the end of a project, once you have already built a few good predictors, to combine them into an even better predictor. In fact, the winning solutions in machine learning competitions often involve several ensemble methods.

In this lesson, we will discuss the most popular ensemble methods, including *bagging*, *boosting*, & *stacking*. We will also explore random forests.

---

# Voting Classifiers

Suppose you have trained a few classifiers, each on achieving about 80% accuracy. You may have a logistic regression classifier, an SVM classifier, a random forest classifier, a K-nearest neighbors classifier, & perhaps a few more.

<img src = "Images/Diverse Classifiers.png" width = "450" style = "margin:auto"/>

A very simple way to create an even better classifier is to aggregate the predictions of each classifier & predict the class that gets the most votes. This majority-vote classifier is called a *hard-voting* classifier.

<img src = "Images/Hard Voting Classifier.png" width = "450" style = "margin:auto"/>

Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the best classifier in the ensemble. In fact, even if each classifier is a *weak learner* (meaning it does only slightly better than random guessing), the ensemble can still be a *strong learner* (achieving high accuracy), provided there are a sufficient number of weak learners & they are sufficiently diverse.

How is this possible? The following analogy can help shed some light on this mystery. Suppose you have a slightly biased coin that has 51% chance of comming up heads & 49% chance of coming up tails. If you toss it 1,000 times, you will generally get more or less 510 heads & 490 tails, & hence a majority of heads. If you do the math, you will find that the probability of obtaining a majority of heads after 1,000 tosses is close to 75%. The more you toss the coin, the higher the probability (e.g., the probability climbs over 97%). This is due to the *law of large numbers*: as you keep tossing the coin, the ratio of heads gets closer & closer to the probability of heads (51%). The below figure shows 10 series of biased coin tosses. You can see that as the number of tosses increases, the ratio of heads approaches 51%. Eventually all 10 series end up so close to 51% that they are consistently above 50%.

<img src = "Images/Law of Large Numbers.png" width = "500" style = "margin:auto"/>

Similarly, suppose you build an ensemble containing 1,000 classifiers that are individually correct only 51% of the time (barely better than random guessing). If you predict the majority voted class, you can hope for up to 75% accuracy! However, this is only true if all classifiers are perfectly independent, making uncorrelated errors, which is clearly not the case because they are trained on the same data. They are likely to make the same errors, so there will be many majority votes for the wrong class, reducing the ensemble's accuracy.

The following code creates & trains a voting classifier in scikit-learn, composed of three diverse classifiers (the training set is the moons dataset).

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples = 500, noise = 0.30, random_state = 42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state =42)

log_classifier = LogisticRegression()
forest_classifier = RandomForestClassifier()
svm_classifier = SVC()
voting_classifier = VotingClassifier(estimators = [("lr", log_classifier),
                                                   ("rf", forest_classifier),
                                                   ("svc", svm_classifier)],
                                    voting = "hard")
voting_classifier.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

Let's look at each classifier's accuracy on the test set.

In [15]:
from sklearn.metrics import accuracy_score

for classifier in (log_classifier, forest_classifier, svm_classifier, voting_classifier):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    print(classifier.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.88
SVC 0.896
VotingClassifier 0.912


There you have it! The voting classifier slightly outperforms all the individual classifiers.

If all classifiers are able to estimate class probabilities (i.e., they all have a `predict_proba()` method), then you can tell scikit-learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called *soft voting*. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to do it replace `voting = "hard"` with `voting = "soft"` & ensure that all classifiers can estimate class probabilities. This is not the case for the `SVC` class by default, so you need to set its `probability` hyperparameter to `True` (this will make the `SVC` class use cross-validation to estimate class probabilities, slow down training, & it will add a `predict_proba()` method). If you modify the preceding code to use soft voting, you will find that the voting classifier achieves even higher accuracy.

In [16]:
log_classifier = LogisticRegression()
forest_classifier = RandomForestClassifier()
svm_classifier = SVC(probability = True)

voting_classifier = VotingClassifier(estimators = [("lr", log_classifier),
                                                   ("rf", forest_classifier), 
                                                   ("svc", svm_classifier)],
                                    voting = "soft")
voting_classifier.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()),
                             ('svc', SVC(probability=True))],
                 voting='soft')

In [17]:
for classifier in (log_classifier, forest_classifier, svm_classifier, voting_classifier):
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    print(classifier.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912


---

# Bagging & Pasting

One way to get a diverse set of classifier is to use very different training algorithms, as just discessed. Another approach is to use the same training algorithm for every predictor & train random subsets of the training set. When sampling with replacement, this method is called *bagging* (short for *bootstrap aggregating*). When sampling is performed *without* replacement, it is called *pasting*.

In other words, both bagging & pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows training instances to be sampled several times for the same predictor. This sampling & training process is represented below.

<img src = "Images/Bagging.png" width = "500" style = "margin:auto"/>

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the *statistical mode* (i.e., the most frequent predictions, just like a hard voting classifier) for classifier, or the average for regression. Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias & variance. Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

As you can see, predictors can all be trained in parallel, via differnt CPU cores or even different servers. Similarly, predictions can be made in parallel. This is one of the reasons bagging & pasting are such popular methods: they scale very well.

## Bagging & Pasting in Scikit-Learn

Scikit-learn offers a simple API for both bagging & pasting with the `BaggingClassifier` class (or `BaggingRegressor` for regression). The following code trains an ensemble of 500 decision tree classifiers: each is trained on 100 training instances randomly sampled from the training set with replacement (this is an example of bagging, but if you want to use pasting, just set `bootstrap = False`). The `n_jobs` parameter tells scikit-learn the number of CPU cores to use for training & predictions.

In [18]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_classifier = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                                   max_samples = 100, bootstrap = True, n_jobs = 7)
bag_classifier.fit(X_train, y_train)
y_pred = bag_classifier.predict(X_test)

The below figure compares the decision boundary of a single decision tree with a decision boundary of a bagging ensemble of 500 trees (from the above code), both trained on the moons dataset. As you can see, the ensemble's predictions will likely generalise much better than the single decision tree's predictions: the ensemble has a comparable bias but a smaller variance (it makes roughly the same number of errors on the training set, but the decision boundary is less irregular).

<img src = "Images/Effect of Bagging.png" width = "600" style = "margin:auto"/>

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slgihtly higher bias than pasting; but the extra diversity also means that the predictors end up beling less correlated, so the ensemble's variance is reduced. Overall, bagging often results in better models, which explains why it is generally preferred. However, if you have spare time & CPU power, you can use cross-validation to evaluate both bagging & pasting & select the one that works best.

## Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all. By default, a `BaggingClassifier` samples *m* training instances with replacement (`bootstrap = True`), where *m* is the size of the training set. This means that only about 63% of the training instances are sampled on average for each predictor. The remaining 37% of the training instances that are not sampled are called *out-of-bag* (oob) instances. Note that they are not the same 37% for all predictors.

Since a predictor never sees the oob instances during training, it can be evaluated on these instances, without the need for a separate validation set. You can evaluate the ensemble itself by averaging out the oob evalusations of each predictor.

In scikit-learn, you can set the `oob_score = True` when creating a `BaggingClassifier` to request an automatic oob evaluation after training. The following code demonstrates this. The resulting evaluation score is available through the `oob_score_` variable.

In [19]:
bag_classifier = BaggingClassifier(DecisionTreeClassifier(), n_estimators = 500,
                                   bootstrap = True, n_jobs = 7, oob_score = True)
bag_classifier.fit(X_train, y_train)
bag_classifier.oob_score_

0.8986666666666666

According to this oob evaluation, the `BaggingClassifier` is likely to achieve about 90% accuracy on the test set. Let's verify this.

In [20]:
from sklearn.metrics import accuracy_score

y_pred = bag_classifier.predict(X_test)
accuracy_score(y_test, y_pred)

0.92

Yup, close enough!

The oob decision function for each training instance is also available through the `oob_decision_function_` variable. In this case (since the base estimator has a `predict_proba()` method), the decision function returns the class probabilities for each training instance. For example, the oob evaluation estimates that the first training instance has a 61.2% probability of belonging to the positive class (& 38.8% of belonging to the negative class):

In [21]:
bag_classifier.oob_decision_function_

array([[0.35869565, 0.64130435],
       [0.33497537, 0.66502463],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.08152174, 0.91847826],
       [0.35227273, 0.64772727],
       [0.01587302, 0.98412698],
       [0.98369565, 0.01630435],
       [0.97765363, 0.02234637],
       [0.76966292, 0.23033708],
       [0.        , 1.        ],
       [0.80232558, 0.19767442],
       [0.85786802, 0.14213198],
       [0.96446701, 0.03553299],
       [0.05154639, 0.94845361],
       [0.        , 1.        ],
       [0.99411765, 0.00588235],
       [0.95811518, 0.04188482],
       [0.98958333, 0.01041667],
       [0.        , 1.        ],
       [0.355     , 0.645     ],
       [0.87096774, 0.12903226],
       [1.        , 0.        ],
       [0.9754902 , 0.0245098 ],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.66857143, 0.33142857],
       [0.

---

# Random Patches & Random Subspaces

The `BaggingClassifier` class supoorts sampling the features as well. Sampling is controlled by two hyperparameters: `max_features` & `bootstrap_features`. They work the same way as `max_samples` & `bootstrap`, but for feature sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features. 

This technique is particularly useful when you are dealing with high-dimensional inputs (such as images). Sampling both training instances & features is called the *random patches method*. Keeping all training instances (by setting `bootstrap = False` & `max_samples = 1.0`) but sampling features (by setting `bootstrap_features` to `True` &/or `max_features` to a value smaller to `1.0`) is called the *random subspaces method*.

Sampling features results in even more predictor diversity, training a bit more bias for a lower variance.

---

# Random Forests

As we have discussed, a random forest is ensemble of decision trees, generally trained via the bagging method (or sometimes pasting), typically with `max_samples` set to the size of the training set. Instead of building a `BaggingClassifier` & passing it a `DecisionTreeClassifier` you can instead use the `RandomForestClassifier` class, which is more convenient & optimised for decision trees (similarly, there is a `RandomForestsRegressor` class for regression tasks). The following code uses all available CPU cores to train a random forest classifier with 500 trees (each limited to maximum 16 nodes):

In [22]:
from sklearn.ensemble import RandomForestClassifier

forest_classifier = RandomForestClassifier(n_estimators = 500, max_leaf_nodes = 16, n_jobs = 7)
forest_classifier.fit(X_train, y_train)
y_pred = forest_classifier.predict(X_test)

With a few exceptions, a `RandomForestClassifier` has all the hyperparameters of a `DecisionTreeClassifier` (to control how trees are grown), plus all the hyperparameters of a `BaggingClassifier` to control the ensemble itself.

The random forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. The algorithm results in a greater tree diversity, which (again) trades higher bias for a lower variance, generally yielding an overall better model. The following `BaggingClassifier` is roughly equivalent to the previous `RandomForestClassifier`.

In [23]:
bag_classifier = BaggingClassifier(DecisionTreeClassifier(splitter = "random", max_leaf_nodes = 16),
                                   n_estimators = 500, max_samples = 1.0, bootstrap = True, n_jobs = 7)

## Extra-Trees

When you are growing a tree in a random forest, at each node, only a random subset of the features is considered for splitting. It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds (like regular decision trees do). 

A forest of such extremely random trees is called an *extremely randomised trees ensemble* (or *extra-trees* for short). Once again, this technique trades more bias for a lower variance. It also makes extra-trees much faster to train than regular random forests, because finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree.

You can create an extra-trees classifier using scikit-learn's `ExtraTreesClassifier` class. Its API is identical to the `RandomForestClassifier` class. Similarly, the `ExtraTreesRegressor` class has the same API as the `RandomForestRegressor` class.

## Feature Importance

Yet another great quality of random forests is that they make it easy to measure the relative importance of each feature. Scikit-learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). More precisely, it is a weighted average, where each node's weight is equal to the number of training samples that are associated with it.

Scikit-learn computes this score automatically for each feature after training, then it scales the results so that the sum of all importances is equal to 1. You can acces the result using the `feature_importances_` variable. For example, the following code trains a `RandomForestClassifier` on the iris dataset & outputs each features importance. It seems that the most important features are the petal length (44%) & width (44%), while sepal length & width are rather unimportant in comparison (10% & 2%, respectively).

In [24]:
from sklearn.datasets import load_iris

iris = load_iris()
forest_classifier = RandomForestClassifier(n_estimators = 500, n_jobs = 7)
forest_classifier.fit(iris["data"], iris["target"])

for name, score in zip(iris["feature_names"], forest_classifier.feature_importances_):
    print(name, score)

sepal length (cm) 0.09206149132219045
sepal width (cm) 0.023084001517177224
petal length (cm) 0.42419275364100606
petal width (cm) 0.46066175351962635


Similarly, if you train a random forest classifier on the MNIST dataset & plot each pixel's importance, you get the image represented below.

<img src = "Images/MNIST Pixel Importance.png" width = "550" style = "margin:auto"/>

Random forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

---

# Boosting

*Boosting* (originally called *hypothesis boosting*) refers to any ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are *AdaBoost* (short for *adaptive boosting*) & *gradient boosting*. 

## AdaBoost

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more & more on the hard cases. This is the technique used by AdaBoost.

For example, when training an AdaBoost classifier, the algorithm first trains a base classifier (such as decision tree) & uses it to make predictions on the training set. The algorithm then increases the relative weight of misclassified training instances. Then it trains a second classifier, using the updated weights, & again makes predictions on the training set, updates the instance weights & so on.

<img src = "Images/AdaBoost.png" width = "500" style = "margin:auto"/>

The below figure shows the decision boundaries of five consecutive predictors on the moons dataset (in this example, each predictor  is a highly regularised SVM classifier with a RBF kernel). The first classifier gets many instances wrong, so their weights get boosted. The second classifier therefore does a better job on these instances, & so on. The plot on the right represents the same sequence of predictors, except that the learning rate is halved (i.e., the misclassified instance weights are boosted half as much at every iteration). As you can see, this sequential learning technique has some similarities with gradient descent, except that instead of tweaking a single predictor's parameters to minimise a cost function, AdaBoost adds predictors to the ensemble, gradually making it better.

<img src = "Images/Decision Boundaries Consecutive Predictors.png" width = "550" style = "margin:auto"/>

Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, except that predictors have different weights depending on their overall accuracy on the weighted training set.

Let's take a closer look at the AdaBoost algorithm. Each instance weight $w^{(i)}$ is initially set to 1/*m*. A first predictor is trained, & its weighted error rate $r_1$ is computed on the training set.

$$r_j = \frac{\begin{split}
\sum^{m}_{i = 1} w^{(i)} \\
\hat{y}^{(i)}_j \neq y^{(i)}
\end{split}}{\sum^{m}_{i = 1} w^{(i)}} \quad where\ \hat{y}^{(i)}_j\ is\ the\ j^{th}\ predictor's\ prediction\ for\ the\ i^{th}\ instance.$$

The predictor's weight $\alpha_j$ is then computed using the below equations, where $\eta$ is the learning rate hyperparameter (defaults to 1). The more accurate the predictor is,the higher its weight will be. if it is just guessing randomly, then its wieght will be close to zero. However, if it is most often wrong (i.e., less accurate than random guessing), then its weight will be negative.

$$\alpha_j = \eta\ log \frac{1 - r_j}{r_j}$$

Next, the AdaBoost algorithm updates the instance weights using the below equation, which boosts the weights of the misclassified instances.

$$\begin{split}
for\ i = 1, 2, ..., m \\
w^{(i)} \leftarrow \Biggl\{ \begin{split} 
w^{(i)} \quad if \hat{y_j}^{(i)} = y^{(i)} \\
w^{(i)} e^{\alpha_j} \quad if \hat{y_j}^{(i)} \neq y^{(i)}
\end{split}
\end{split}$$

Then all the instance weights are normalised (i.e., divided by $\sum^{m}_{i = 1} w^{(i)}$.

Finally, a new predictor is trained using the updated weights, & the whole process is repeated (the new predictor's weight is computed, the instance weights are updated, then another predictor is trained, & so on). The algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found.

To make predictions, AdaBoost simply computes the predictions of all the predictors & weighs them using the predictor weights $\alpha_j$. The predicted class is the one that receives the majority of the weighted votes.

$$\hat{y}(x) = \underset{k}{argmax} \quad \underset{\hat{y}_j(x) = k}{\sum^{N}_{j = 1}} \alpha_j \quad where\ N\ is\ the\ number\ of\ predictors.$$  

Scikit-learn uses a multiclass version of AdaBoost called *SAMME* (which stands for *stagewise additive modeling using a multiclass exponential loss function*). When there are just two classes, SAMME is equivalent to AdaBoost. If the predictors can estimate class probabilities (i.e., if they have a `predict_proba()` method), scikit-learn can use a variant of SAMME called *SAMME.R* (the *R* stands for "Real"), which relies on class probabilities rather than predictions & generally performs better.

The following code trains an AdaBoost classifier based on 200 *decision stumps* using scikit-learn's `AdaBoostClassifier` class (as you might expect, there is also an `AdaBoostRegressor` class). A decision stump is a decision tree with `max_depth = 1` -- in other words, a tree composed of a single decision node plus two leaf nodes. This is the default base estimator for the `AdaBoostClassifier` class.

In [25]:
from sklearn.ensemble import AdaBoostClassifier

ada_classifier = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 1), n_estimators = 200,
                                    algorithm = "SAMME.R", learning_rate = 0.5)
ada_classifier.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=200)

## Gradient Boosting

Another very popular boosting algorithm is *gradient boosting*. Just like AdaBoost, gradient boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the *residual errors* made by the previous predictor.

Let's go through a simple regression example, using decision trees as the base predictors (of course, gradient boosting also works great with regression tasks). This is called *gradient tree boosting* or *gradient boosted regression trees* (GBRT). First, let's fit a `DecisionTreeRegresssor` to the training set (for example, a noisy quadratic training set):

In [26]:
from sklearn.tree import DecisionTreeRegressor
import numpy as np

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3 * X[:, 0]**2 + 0.05 * np.random.randn(100)

tree_regression = DecisionTreeRegressor(max_depth = 2)
tree_regression.fit(X, y)

DecisionTreeRegressor(max_depth=2)

Next, we'll train a second `DecisionTreeRegressor` on the residual errors made by the first predictor.

In [28]:
y2 = y - tree_regression.predict(X)
tree_regression2 = DecisionTreeRegressor(max_depth = 2)
tree_regression2.fit(X, y2)

DecisionTreeRegressor(max_depth=2)

Then we train a third regressor on the residual errors made by the second predictor.

In [29]:
y3 = y2 - tree_regression2.predict(X)
tree_regression3 = DecisionTreeRegressor(max_depth = 2)
tree_regression3.fit(X, y3)

DecisionTreeRegressor(max_depth=2)

Now we have an ensemble containing three trees. It can make predictions on a new instance simply by adding up the predictions of all the trees.

In [31]:
X_new = np.array([[0.8]])
y_pred = sum(tree.predict(X_new) for tree in (tree_regression, tree_regression2, tree_regression3))
y_pred

array([0.75026781])

<img src = "Images/Gradient Boost.png" width = "550" style = "margin:auto"/>

The above figure respresents the predictions of the three trees in the left column, & the ensemble's predictions in the right column. In the first row, the ensemble has just one tree, so its predictions are exactly the same as the first tree's predictions. In the second row, a new tree is trained on the residual errors of the first tree. On the right, you can see that the ensemble's predictions are equal to the sum of the predictions of the first two trees. Similarly, in the third row, another tree is trained on the residual errors of the second tree. You can see that the ensemble's predictions gradually get better as trees are added to the ensemble.

A simpler way to train GBRT ensembles is to use scikit-learn's `GradientBoostingRegressor` class. Much like the `RandomForestRegressor` class, it has hyperparameters to control the growth of decision trees (e.g., `max_depth`, `min_samples_leaf`), as well as hyperparameters to control the ensemble training, such as the number of trees (`n_estimators`). The following code creates the same ensemble as the previous one.

In [34]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth = 2, n_estimators = 3, learning_rate = 1.0)
gbrt.fit(X, y)

GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3)

The `learning_rate` hyperparameter scales the contribution of each tree. If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalise better. This is a regularisation technique called *shrinkage*. The below figure shows two GBRT ensembles trained with a low learning rate: the one on the left does not have enough trees to fit the training set, while the one on the right has too many trees & overfits the training set.

<img src = "Images/GBRT Ensemble.png" width = "550" style = "margin:auto"/>

In order to find the optimal number of trees, you can use early stopping. A simple way to implement this is to use the `staged_predict()` method: it returns an iterator over the predictions made by the ensemble at each stage of training (with one tree, two trees, etc.). The following code trains a GBRT ensemble with 120 trees, then measures the validation error at each stage of training to find the optimal number of trees, & finally trains another GBRT ensemble using the optimal number of trees.

In [36]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth = 2, n_estimators = 120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
best_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth = 2, n_estimators = best_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=110)

The validation errors are represented in the left of the below figure, & the best model's predictions are represented on the right.

<img src = "Images/GBRT Early Stopping.png" width = "600" style = "margin:auto"/>

It is also possible to implement early stopping by actually stopping training early (instead of training a large number of trees first & then looking back to find the optimal number). You can do so by setting `warm_start = True`, which makes scikit-learn keep existing trees when the `fit()` method is called, allowing incremental training. The following code stops training when the validation error does not improve for five iterations in a row.

In [37]:
gbrt = GradientBoostingRegressor(max_depth = 2, warm_start = True)

min_val_error = float("inf")
error_going_up = 0

for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break

The `GradientBoostingRegressor` class also supports a `subsample` hyperparameter, which specifies the fraction of training instances to be used for training each tree. For example, if `subsample = 0.25`, then each tree is trained on 25% of the training instances, selected randomly. As you can probably guess by now, this technique trade a higher bias for a lower variance. It also speeds up training considerable. This is called *stochastic gradient boosting*.

It is worth noting that an optimised implementation of gradient boosting is available in the popular python library `xgboost`, which stands for extreme gradient boosting. This package was initially developed by Tianqi Chen as part of the Distributed (Deep) Machine Learning Community (DMLC), & it aims to be extremely fast, scalable, & portable. In fact, xgboost is often an important component of the winning entries in ML competitions. XGBoost's API is quite similar to scikit-learns:

In [41]:
import xgboost

xgb_regression = xgboost.XGBRegressor()
xgb_regression.fit(X_train, y_train)
y_pred = xgb_regression.predict(X_val)

XGBoost also offers several nice features, such as automatically taking care of early stopping:

In [43]:
xgb_regression.fit(X_train, y_train, 
                   eval_set = [(X_val, y_val)],
                   early_stopping_rounds = 2)
y_pred = xgb_regression.predict(X_val)

[0]	validation_0-rmse:0.20391
[1]	validation_0-rmse:0.15822
[2]	validation_0-rmse:0.12129
[3]	validation_0-rmse:0.09943
[4]	validation_0-rmse:0.08471
[5]	validation_0-rmse:0.07288
[6]	validation_0-rmse:0.06541
[7]	validation_0-rmse:0.06082
[8]	validation_0-rmse:0.05825
[9]	validation_0-rmse:0.05654
[10]	validation_0-rmse:0.05598
[11]	validation_0-rmse:0.05572
[12]	validation_0-rmse:0.05564
[13]	validation_0-rmse:0.05587


Definitely check it out!

---

# Stacking

The last ensemble method we will discuss in this lesson is called *stacking* (short for *stacked generalisation*). It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don't we train a model to perform this aggregation? The below figure shows scuh an ensemble performing a regression task on a new instance. Each of the bottom tree predictors predicts a different value (3.1, 2.7, 2.9), & then the final predictor (called the *blender*, or *meta learner*) takes these predictions as inputs & makes the final prediction (3.0).

<img src = "Images/Blending Predictor.png" width = "550" style = "margin:auto"/>

To train the belnder, a common approach is to use a hold-out set. Let's see how it works. First, the training set is split into two subsets. The first subset is used to train the predictors in the first layer.

<img src = "Images/Blending Predictor HoldOut.png" width = "550" style = "margin:auto"/>

Next, the first layer's predictors are used to make predictions on the second (held-out) set. This ensures that the predictions are 'clean", since the predictors never saw these instances during training. For each instance in the hold-out set, there are three predicted values. We can create a new training set using these predicted values as input features (which makes this new training set 3D), & keeping the target values. The blender is trained on this new training set, so it learns to predict the target value, given the first layer's predictions.

<img src = "Images/Training the Blender.png" width = "550" style = "margin:auto"/>

It is actually possible to train several different blenders this way (e.g., one using linear regression, another using random forest regression), to get a whole layer of blenders. The trick is to split the training set into three subsets: the first one is used to train the first layer, the second one is used to create the training set used to train the second layer (using predictions made by the predictors of the first layer), & the third one is used to create the training set to train the third layer (using predictions made by the predictors of the second layer). Once this is done, we can make a prediction for a new instance by going through each layer sequentially.

<img src = "Images/Predictions in Multilayer Stacking Ensemble.png" width = "500" style = "margin:auto"/>

Unfortunately, scikit-learn does not support stacking directly, it isn't too hard to roll out your own implementation. Alternatively, you can use an open source implementation such as `DESlib`.