# Ensemble Learning and Random Forests

Suppose you pose a complex question to thousands of random people, then
aggregate their answers. In many cases you will find that this aggregated answer
is better than an expert’s answer. This is called the wisdom of the crowd.
Similarly, if you aggregate the predictions of a group of predictors (such as
classifiers or regressors), you will often get better predictions than with the best
individual predictor. A group of predictors is called an ensemble;

As an example of an Ensemble method, you can train a group of Decision Tree
classifiers, each on a different random subset of the training set. To make
predictions, you obtain the predictions of all the individual trees, then predict the
class that gets the most votes (see the last exercise in Chapter 6). Such an
ensemble of Decision Trees is called a Random Forest, and despite its simplicity,
this is one of the most powerful Machine Learning algorithms available today.

As discussed in Chapter 2, you will often use Ensemble methods near the end of
a project, once you have already built a few good predictors, to combine them
into an even better predictor. In fact, the winning solutions in Machine Learning
competitions often involve several Ensemble methods (most famously in the
Netflix Prize competition).

# Voting Classifiers

Suppose you have trained a few classifiers, each one achieving about 80%
accuracy. You may have a Logistic Regression classifier, an SVM classifier, a
Random Forest classifier, a K-Nearest Neighbors classifier, and perhaps a few
more (see Figure).

![image.png](attachment:image.png)

A very simple way to create an even better classifier is to aggregate the
predictions of each classifier and predict the class that gets the most votes. This
majority-vote classifier is called a hard voting classifier (see Figure).

![image.png](attachment:image.png)

Somewhat surprisingly, this voting classifier often achieves a higher accuracy
than the best classifier in the ensemble. In fact, even if each classifier is a weak
learner (meaning it does only slightly better than random guessing), the
ensemble can still be a strong learner (achieving high accuracy), provided there
are a sufficient number of weak learners and they are sufficiently diverse.

How is this possible? The following analogy can help shed some light on this
mystery. Suppose you have a slightly biased coin that has a 51% chance of
coming up heads and 49% chance of coming up tails. If you toss it 1,000 times,
you will generally get more or less 510 heads and 490 tails, and hence a majority
of heads. If you do the math, you will find that the probability of obtaining a
majority of heads after 1,000 tosses is close to 75%. The more you toss the coin,
the higher the probability (e.g., with 10,000 tosses, the probability climbs over
97%). This is due to the law of large numbers: as you keep tossing the coin, the
ratio of heads gets closer and closer to the probability of heads (51%). Figure 7-3
shows 10 series of biased coin tosses. You can see that as the number of tosses
increases, the ratio of heads approaches 51%. Eventually all 10 series end up so
close to 51% that they are consistently above 50%.

![image.png](attachment:image.png)

The following code creates and trains a voting classifier in Scikit-Learn,
composed of three diverse classifiers (the training set is the moons dataset,
introduced in Chapter 5)

In [1]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.ensemble import VotingClassifier
# from sklearn.linear_model import LogisticRegression
# from sklearn.svm import SVC


# log_clf = LogisticRegression()
# rnd_clf = RandomForestClassifier()
# svm_clf = SVC()


# voting_clf = VotingClassifier(
#             estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
#             voting='hard')
# voting_clf.fit(X_train, y_train)

# from sklearn.metrics import accuracy_score

# for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
#     clf.fit(X_train1, y_train)
#     y_pred = clf.predict(X_test1)
#     print(clf.__class__.__name__, accuracy_score(y_test, y_pred))


![image.png](attachment:image.png)

There you have it! The voting classifier slightly outperforms all the individual
classifiers.
If all classifiers are able to estimate class probabilities (i.e., they all have a
predict_proba() method), then you can tell Scikit-Learn to predict the class
with the highest class probability, averaged over all the individual classifiers.
This is called soft voting. It often achieves higher performance than hard voting
because it gives more weight to highly confident votes. All you need to do is
replace voting="hard" with voting="soft" and ensure that all classifiers can
estimate class probabilities. 

# Bagging and Pasting

One way to get a diverse set of classifiers is to use very different training
algorithms, as just discussed. Another approach is to use the same training
algorithm for every predictor and train them on different random subsets of the
training set. When sampling is performed with replacement, this method is called
bagging  (short for bootstrap aggregating ). When sampling is performed
without replacement, it is called pasting.
In other words, both bagging and pasting allow training instances to be sampled
several times across multiple predictors, but only bagging allows training
instances to be sampled several times for the same predictor. This sampling and
training process is represented (in Figure).

![image.png](attachment:image.png)

Once all predictors are trained, the ensemble can make a prediction for a new
instance by simply aggregating the predictions of all predictors. The aggregation
function is typically the statistical mode (i.e., the most frequent prediction, just
like a hard voting classifier) for classification, or the average for regression.
Each individual predictor has a higher bias than if it were trained on the original
training set, but aggregation reduces both bias and variance.  Generally, the net
result is that the ensemble has a similar bias but a lower variance than a single
predictor trained on the original training set.


As you can see in Figure ABOVE, predictors can all be trained in parallel, via
different CPU cores or even different servers. Similarly, predictions can be made
in parallel. This is one of the reasons bagging and pasting are such popular
methods: they scale very well.

# Bagging and Pasting in Scikit-Learn

In [2]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

X, y = make_moons(n_samples=300, noise=0.15)
X_train, X_test, y_train, y_test = train_test_split (X,y,test_size=0.25, stratify=y)

tree_clf = DecisionTreeClassifier()
bag_clf = BaggingClassifier(
        DecisionTreeClassifier(), n_estimators=500,
        max_samples=220, bootstrap=True, n_jobs=-1, oob_score=True)

bag_clf.fit(X_train, y_train)
tree_clf.fit(X_train, y_train)
# y_pred = bag_clf.predict(X_test)
bag_clf.score(X_test, y_test), tree_clf.score(X_test, y_test)

(0.9733333333333334, 0.96)

The BaggingClassifier automatically performs soft voting instead of hard voting if the base
classifier can estimate class probabilities (i.e., if it has a predict_proba() method), which is
the case with Decision Tree classifiers.

Bootstrapping introduces a bit more diversity in the subsets that each predictor is
trained on, so bagging ends up with a slightly higher bias than pasting; but the
extra diversity also means that the predictors end up being less correlated, so the
ensemble’s variance is reduced. Overall, bagging often results in better models,
which explains why it is generally preferred. However, if you have spare time
and CPU power, you can use cross-validation to evaluate both bagging and
pasting and select the one that works best.

# Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given
predictor, while others may not be sampled at all. By default a
BaggingClassifier samples m training instances with replacement
(bootstrap=True), where m is the size of the training set. This means that only
about 63% of the training instances are sampled on average for each predictor.
The remaining 37% of the training instances that are not sampled are called out-of-bag (oob) instances.
Note that they are not the same 37% for all predictors.
Since a predictor never sees the oob instances during training, it can be evaluated
on these instances, without the need for a separate validation set. You can
evaluate the ensemble itself by averaging out the oob evaluations of each
predictor.

In Scikit-Learn, you can set oob_score=True when creating a
BaggingClassifier to request an automatic oob evaluation after training. The
following code demonstrates this. The resulting evaluation score is available
through the oob_score_ variable:

In [3]:
from sklearn.metrics import accuracy_score


y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred), bag_clf.oob_score_

(0.9733333333333334, 0.9688888888888889)

The oob decision function for each training instance is also available through the
oob_decision_function_ variable. In this case (since the base estimator has a
predict_proba() method), the decision function returns the class probabilities
for each training instance.

# Random Patches and Random Subspaces

The BaggingClassifier class supports sampling the features as well. Sampling
is controlled by two hyperparameters: max_features and
bootstrap_features. They work the same way as max_samples and
bootstrap, but for feature sampling instead of instance sampling. Thus, each
predictor will be trained on a random subset of the input features.
This technique is particularly useful when you are dealing with high-dimensional
inputs (such as images). Sampling both training instances and features is called
the Random Patches method.  Keeping all training instances (by setting
bootstrap=False and max_samples=1.0) but sampling features (by setting
bootstrap_features to True and/or max_features to a value smaller than
1.0) is called the Random Subspaces method.
Sampling features results in even more predictor diversity, trading a bit more
bias for a lower variance.

# Random Forests

As we have discussed, a Random Forest  is an ensemble of Decision Trees,
generally trained via the bagging method (or sometimes pasting), typically with
max_samples set to the size of the training set. Instead of building a
BaggingClassifier and passing it a DecisionTreeClassifier, you can
instead use the RandomForestClassifier class, which is more convenient and
optimized for Decision Trees  (similarly, there is a RandomForestRegressor
class for regression tasks). The following code uses all available CPU cores to
train a Random Forest classifier with 500 trees (each limited to maximum 16
nodes)

In [4]:
from sklearn.ensemble import RandomForestClassifier


rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

rnd_clf.score(X_test, y_test)

0.9733333333333334

With a few exceptions, a RandomForestClassifier has all the hyperparameters
of a DecisionTreeClassifier (to control how trees are grown), plus all the
hyperparameters of a BaggingClassifier to control the ensemble itself

The Random Forest algorithm introduces extra randomness when growing trees;
instead of searching for the very best feature when splitting a node (see
Chapter 6), it searches for the best feature among a random subset of features.
The algorithm results in greater tree diversity, which (again) trades a higher bias
for a lower variance, generally yielding an overall better model. The following
BaggingClassifier is roughly equivalent to the previous
RandomForestClassifier:

In [5]:
bag_clf = BaggingClassifier(
DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

# Extra-Trees

When you are growing a tree in a Random Forest, at each node only a random
subset of the features is considered for splitting (as discussed earlier). It is
possible to make trees even more random by also using random thresholds for
each feature rather than searching for the best possible thresholds (like regular
Decision Trees do)

A forest of such extremely random trees is called an Extremely Randomized
Trees ensemble  (or Extra-Trees for short). Once again, this technique trades
more bias for a lower variance. It also makes Extra-Trees much faster to train
than regular Random Forests, because finding the best possible threshold for
each feature at every node is one of the most time-consuming tasks of growing a
tree

You can create an Extra-Trees classifier using Scikit-Learn’s
ExtraTreesClassifier class. Its API is identical to the
RandomForestClassifier class. Similarly, the ExtraTreesRegressor class
has the same API as the RandomForestRegressor class.

### TIP

It is hard to tell in advance whether a RandomForestClassifier will perform better or worse
than an ExtraTreesClassifier. Generally, the only way to know is to try both and compare
them using cross-validation (tuning the hyperparameters using grid search).

# Feature Importance

Yet another great quality of Random Forests is that they make it easy to measure
the relative importance of each feature. Scikit-Learn measures a feature’s
importance by looking at how much the tree nodes that use that feature reduce
impurity on average (across all trees in the forest). More precisely, it is a
weighted average, where each node’s weight is equal to the number of training
samples that are associated with it

Scikit-Learn computes this score automatically for each feature after training,
then it scales the results so that the sum of all importances is equal to 1. You can
access the result using the feature_importances_ variable. For example, the
following code trains a RandomForestClassifier on the iris dataset and outputs each feature’s importance. 

In [6]:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10342825446248312
sepal width (cm) 0.029766441727716732
petal length (cm) 0.442252961069407
petal width (cm) 0.42455234274039316


Random Forests are very handy to get a quick understanding of what features
actually matter, in particular if you need to perform feature selection.

# Boosting

Boosting (originally called hypothesis boosting) refers to any Ensemble method
that can combine several weak learners into a strong learner. The general idea of
most boosting methods is to train predictors sequentially, each trying to correct
its predecessor. There are many boosting methods available, but by far the most
popular are AdaBoost  (short for Adaptive Boosting) and Gradient Boosting.
Let’s start with AdaBoost.

# AdaBoost

One way for a new predictor to correct its predecessor is to pay a bit more
attention to the training instances that the predecessor underfitted. This results in
new predictors focusing more and more on the hard cases. This is the technique
used by AdaBoost.

For example, when training an AdaBoost classifier, the algorithm first trains a
base classifier (such as a Decision Tree) and uses it to make predictions on the
training set. The algorithm then increases the relative weight of misclassified
training instances. Then it trains a second classifier, using the updated weights,
and again makes predictions on the training set, updates the instance weights,
and so on (see Figure 7-7).

![image.png](attachment:image.png)

Figure 7-8 shows the decision boundaries of five consecutive predictors on the
moons dataset (in this example, each predictor is a highly regularized SVM
classifier with an RBF kernel ). The first classifier gets many instances wrong,
so their weights get boosted. The second classifier therefore does a better job on
these instances, and so on. The plot on the right represents the same sequence of
predictors, except that the learning rate is halved (i.e., the misclassified instance
weights are boosted half as much at every iteration). As you can see, this
sequential learning technique has some similarities with Gradient Descent,
except that instead of tweaking a single predictor’s parameters to minimize a
cost function, AdaBoost adds predictors to the ensemble, gradually making it
better.

![image.png](attachment:image.png)

Once all predictors are trained, the ensemble makes predictions very much like
bagging or pasting, except that predictors have different weights depending on
their overall accuracy on the weighted training set.

### WARNING

There is one important drawback to this sequential learning technique: it cannot be parallelized
(or only partially), since each predictor can only be trained after the previous predictor has
been trained and evaluated. As a result, it does not scale as well as bagging or pasting.

## Let’s take a closer look at the AdaBoost algorithm.
![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

To make predictions, AdaBoost simply computes the predictions of all the
predictors and weighs them using the predictor weights α . The predicted class is the one that receives the majority of weighted votes (see Equation 7-4).

![image.png](attachment:image.png)

Scikit-Learn uses a multiclass version of AdaBoost called SAMME  (which
stands for Stagewise Additive Modeling using a Multiclass Exponential loss
function). 

When there are just two classes, SAMME is equivalent to AdaBoost.
If the predictors can estimate class probabilities (i.e., if they have a
predict_proba() method), Scikit-Learn can use a variant of SAMME called
SAMME.R (the R stands for “Real”), which relies on class probabilities rather
than predictions and generally performs better.

The following code trains an AdaBoost classifier based on 200 Decision Stumps
using Scikit-Learn’s AdaBoostClassifier class (as you might expect, there is
also an AdaBoostRegressor class). A Decision Stump is a Decision Tree with
max_depth=1—in other words, a tree composed of a single decision node plus
two leaf nodes. This is the default base estimator for the AdaBoostClassifier
class:

In [33]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
        DecisionTreeClassifier(max_depth=3), 
        n_estimators=200,
        algorithm="SAMME.R", learning_rate=0.3)

ada_clf.fit(X_train, y_train)
ada_clf.score(X_test, y_test)

0.9733333333333334

### Tip

If your AdaBoost ensemble is overfitting the training set, you can try reducing the number of
estimators or more strongly regularizing the base estimator.

# Gradient Boosting

AdaBoost, Gradient Boosting works by sequentially adding predictors to an
ensemble, each one correcting its predecessor. However, instead of tweaking the
instance weights at every iteration like AdaBoost does, this method tries to fit
the new predictor to the residual errors made by the previous predictor.

Let’s go through a simple regression example, using Decision Trees as the base
predictors (of course, Gradient Boosting also works great with regression tasks).
This is called Gradient Tree Boosting, or Gradient Boosted Regression Trees
(GBRT). First, let’s fit a DecisionTreeRegressor to the training set (for
example, a noisy quadratic training set):

In [90]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()

X = data.data
y = data.target

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)


DecisionTreeRegressor(max_depth=2)

### Next, we’ll train a second DecisionTreeRegressor on the residual errors made by the first predictor:

In [91]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(max_depth=2)

### Then we train a third regressor on the residual errors made by the second predictor:

In [92]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

DecisionTreeRegressor(max_depth=2)

Now we have an ensemble containing three trees. It can make predictions on a
new instance simply by adding up the predictions of all the trees:

In [98]:
y_pred = sum(tree.predict([X[88]]) for tree in (tree_reg1, tree_reg2, tree_reg3))

Figure 7-9 represents the predictions of these three trees in the left column, and
the ensemble’s predictions in the right column. In the first row, the ensemble has
just one tree, so its predictions are exactly the same as the first tree’s predictions.
In the second row, a new tree is trained on the residual errors of the first tree.
![image.png](attachment:image.png)


On the right you can see that the ensemble’s predictions are equal to the sum of the
predictions of the first two trees. Similarly, in the third row another tree is
trained on the residual errors of the second tree. You can see that the ensemble’s
predictions gradually get better as trees are added to the ensemble.
A simpler way to train GBRT ensembles is to use Scikit-Learn’s
GradientBoostingRegressor class. Much like the RandomForestRegressor
class, it has hyperparameters to control the growth of Decision Trees (e.g.,
max_depth, min_samples_leaf), as well as hyperparameters to control the
ensemble training, such as the number of trees (n_estimators). The following
code creates the same ensemble as the previous one:

A simpler way to train GBRT ensembles is to use Scikit-Learn’s
GradientBoostingRegressor class. Much like the RandomForestRegressor
class, it has hyperparameters to control the growth of Decision Trees (e.g.,
max_depth, min_samples_leaf), as well as hyperparameters to control the
ensemble training, such as the number of trees (n_estimators). The following
code creates the same ensemble as the previous one

In [128]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

gbrt = GradientBoostingRegressor(max_depth=8, n_estimators=32, learning_rate=0.5)
gbrt.fit(X_train, y_train)

GradientBoostingRegressor(learning_rate=0.5, max_depth=8, n_estimators=32)

In [131]:
gbrt.score(X_test, y_test), gbrt.score(X_train, y_train)

(0.8004157968673229, 0.9629776479958986)

The learning_rate hyperparameter scales the contribution of each tree. If you
set it to a low value, such as 0.1, you will need more trees in the ensemble to fit
the training set, but the predictions will usually generalize better. This is a
regularization technique called shrinkage. Figure 7-10 shows two GBRT
ensembles trained with a low learning rate: the one on the left does not have
enough trees to fit the training set, while the one on the right has too many trees
and overfits the training set.

![image.png](attachment:image.png)

In order to find the optimal number of trees, you can use early stopping (see
Chapter 4). A simple way to implement this is to use the staged_predict()
method: it returns an iterator over the predictions made by the ensemble at each
stage of training (with one tree, two trees, etc.). The following code trains a
GBRT ensemble with 120 trees, then measures the validation error at each stage
of training to find the optimal number of trees, and finally trains another GBRT
ensemble using the optimal number of trees:

In [133]:
from sklearn.metrics import mean_squared_error
import numpy as np

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)


GradientBoostingRegressor(max_depth=2, n_estimators=120)

The validation errors are represented on the left of Figure 7-11, and the best
model’s predictions are represented on the right.

![image.png](attachment:image.png)

It is also possible to implement early stopping by actually stopping training early
(instead of training a large number of trees first and then looking back to find the
optimal number). You can do so by setting warm_start=True, which makes
Scikit-Learn keep existing trees when the fit() method is called, allowing
incremental training. The following code stops training when the validation error
does not improve for five iterations in a row:

In [134]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)
min_val_error = float("inf")
error_going_up = 0

for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break # early stopping

The GradientBoostingRegressor class also supports a subsample
hyperparameter, which specifies the fraction of training instances to be used for
training each tree. For example, if subsample=0.25, then each tree is trained on
25% of the training instances, selected randomly. As you can probably guess by
now, this technique trades a higher bias for a lower variance. It also speeds up
training considerably. This is called Stochastic Gradient Boosting.


## NOTE

It is possible to use Gradient Boosting with other cost functions. This is controlled by the loss
hyperparameter (see Scikit-Learn’s documentation for more details).

It is worth noting that an optimized implementation of Gradient Boosting is
available in the popular Python library XGBoost, which stands for Extreme
Gradient Boosting. This package was initially developed by Tianqi Chen as part
of the Distributed (Deep) Machine Learning Community (DMLC), and it aims to
be extremely fast, scalable, and portable. In fact, XGBoost is often an important
component of the winning entries in ML competitions. XGBoost’s API is quite
similar to Scikit-Learn’s:

In [137]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)

y_pred = xgb_reg.predict(X_val)

### XGBoost also offers several nice features, such as automatically taking care of early stopping:

In [140]:
xgb_reg.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=2)
# y_pred = xgb_reg.predict(X_val)
xgb_reg.score(X_val, y_val)

[0]	validation_0-rmse:1.44398
[1]	validation_0-rmse:1.11300
[2]	validation_0-rmse:0.89249
[3]	validation_0-rmse:0.75693
[4]	validation_0-rmse:0.66966
[5]	validation_0-rmse:0.61604
[6]	validation_0-rmse:0.58576
[7]	validation_0-rmse:0.56395
[8]	validation_0-rmse:0.53962
[9]	validation_0-rmse:0.52924
[10]	validation_0-rmse:0.52310
[11]	validation_0-rmse:0.51945
[12]	validation_0-rmse:0.51543
[13]	validation_0-rmse:0.51172
[14]	validation_0-rmse:0.50709
[15]	validation_0-rmse:0.50064
[16]	validation_0-rmse:0.49854
[17]	validation_0-rmse:0.49719
[18]	validation_0-rmse:0.49495
[19]	validation_0-rmse:0.49341
[20]	validation_0-rmse:0.49138
[21]	validation_0-rmse:0.48961
[22]	validation_0-rmse:0.48906
[23]	validation_0-rmse:0.48906
[24]	validation_0-rmse:0.48660
[25]	validation_0-rmse:0.48542
[26]	validation_0-rmse:0.48557
[27]	validation_0-rmse:0.48584


0.8234520905731895

# Stacking

The last Ensemble method we will discuss in this chapter is called stacking
(short for stacked generalization).  It is based on a simple idea: instead of using
trivial functions (such as hard voting) to aggregate the predictions of all
predictors in an ensemble, why don’t we train a model to perform this aggregation?

Figure 7-12 shows such an ensemble performing a regression task
on a new instance. Each of the bottom three predictors predicts a different value
(3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta
learner) takes these predictions as inputs and makes the final prediction (3.0).

![image.png](attachment:image.png)

To train the blender, a common approach is to use a hold-out set.  Let’s see how
it works. First, the training set is split into two subsets. The first subset is used to
train the predictors in the first layer (see Figure 7-13).

![image.png](attachment:image.png)

Next, the first layer’s predictors are used to make predictions on the second
(held-out) set (see Figure 7-14). This ensures that the predictions are “clean,”
since the predictors never saw these instances during training. For each instance
in the hold-out set, there are three predicted values. We can create a new training
set using these predicted values as input features (which makes this new training
set 3D), and keeping the target values. The blender is trained on this new
training set, so it learns to predict the target value, given the first layer’s
predictions.

![image.png](attachment:image.png)

It is actually possible to train several different blenders this way (e.g., one using
Linear Regression, another using Random Forest Regression), to get a whole
layer of blenders. The trick is to split the training set into three subsets: the first
one is used to train the first layer, the second one is used to create the training set
used to train the second layer (using predictions made by the predictors of the
first layer), and the third one is used to create the training set to train the third
layer (using predictions made by the predictors of the second layer). Once this is
done, we can make a prediction for a new instance by going through each layer
sequentially, as shown in Figure 7-15.

![image.png](attachment:image.png)

### Unfortunately, Scikit-Learn does not support stacking directly,
but it is not too
hard to roll out your own implementation (see the following exercises).
Alternatively, you can use an open source implementation such as brew.