# Ensemble Learning

"In this chapter, we will discuss the most popular Ensemble methods, including *bagging*, *boosting*, *stacking*, and a few others. We will also explore Random Forests."

#### Voting Classifiers

If you have a few classifiers trained on the same data, perhaps a Logistic Regression classifer, SVM classifer, Random Forest classifier, etc. and each achieves ~80% accuracy on the test set, you can aggregate all these classifiers into one *voting classifier* that gets above 80% accuracy.

The simplest way to do this is to simply take the mode of the predictions, or which ever class make up the majority of predictions from each classifer. "This majority-vote classifer is called a **hard voting** classifier."

#### Important Note about Voting Classifiers

"Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble's accuracy."

"The following code creates and trains a voting classifier in Scikit-Learn, composed of three diverse classifiers:"

In [3]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

x, y = make_moons(n_samples=1000, noise=0.4)
x_train, x_test, y_train, y_test = train_test_split(x, y)
y_train[0]

0

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')
voting_clf.fit(x_train,y_train)



VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='warn', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gini',...
                                        

In [12]:
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.86
RandomForestClassifier 0.86
SVC 0.876
VotingClassifier 0.88




"There you have it! The voting classifier slightly outperforms all the individual classifiers."

"If all classifiers are able to estimate class probabilities (i.e., they have a `predict_proba()` method), then you can tell Scikit-Learn to predict the class with the highest class probability, averaged over all the individual classifiers. This is called **soft voting**. It often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to do is relpace `voting='hard'` with `voting='soft'` and ensure that all classifiers can estimate class probabilities. This is not the case of the `SVC` class by default, so you need to set its `probability` hyperparameter to `True` (this will make the SVC class use cross-validiation to estimate class probabilities, slowing down training and it will add a `predict_proba()` method). If you modify the preceding code to use soft voting, you will find that the voting classifier achieves over 91% accuracy!"

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC(probability=True)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='soft')
voting_clf.fit(x_train,y_train)



VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='warn', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gini',...
                                        

In [18]:
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.86
RandomForestClassifier 0.852
SVC 0.876
VotingClassifier 0.868




#### Bagging and Pasting

While we can use a wide variety of estimators and various machine learning algorithms to get a diverse group of estimators, we can also use similar estimators, but train them on different data.

**Bagging** means that sampling is performed with replacement, so the same predictor can sample the same data multiple times. **Pasting**, however, has no replacement in its sampling method.

Using bagging/pasting, "once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the *statistical mode*... for classification, or the average for regression. Each individual predictor has a higher bias than if it were trained on the original training set, but the aggregation reduces both bias and variance. Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set."

Luckily, predictors can all be trained in parallel, and predictions can be made in parallel, via different CPU cores or even different servers. Therefore, bagging and pasting are very popular methods.

#### Bagging and Pasting in Scikit-Learn

"Scikit-Learn offers a simple API for both bagging and pasting with the `BaggingClassifier` class (or `BaggingRegressor` for regression). The following code trains an ensemble of 500 Decision Tree classifiers, each trained on 100 training instances randomly sampled from the training set with replacement (this is an example of bagging, but if you want to use pasting instead, just set `bootstrap=False`)."

In [19]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(x_train, y_train)
y_pred = bag_clf.predict(x_test)

In [20]:
accuracy_score(y_test, y_pred)

0.884

#### Out-of-Bag Evaluation

With bagging, most estimators will only see ~63% of the training data and miss ~37%. Which instances are included in the "missed" and "trained on" columns changes for each estimator, so all the data is likely seen by the ensemble. However, we can take the 37% missed on each estimator and use that for evaluations, "without the need for a separate validation set or cross-validation. You can evaluate the ensemble itself by averaging out the **out-of-bag** (oob) instances of each predictor."

In [22]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(x_train, y_train)
bag_clf.oob_score_

0.848

The out of bag evaluation is a pretty good indicator for how well the whole ensemble will do on the test set.

#### Random Patches and Random Subspaces

Not only can you sample instances randomly to get more diverse estimators, you can also randomly sample features! The `BaggingClassifer` class uses "two hyperparameters: `max_features` and `bootstrap_features`. They work the same way as `max_samples` and `bootstrap`, but for feature sampling instead of instance sampling. Thus, each predictor will be trained on a random subset of the input features."

"Sampling both training instances and features is called the *Random Patches* method. Keeping all training instances (i.e., `bootstrap=False` and `max_samples=1.0`) but sampling features (i.e., `bootstrap_features=True` and/or `max_features` smaller than 1.0) is called the *Random Subspaces* method."

"Sampling features results in even more predictor diversity, training a bit more bias for a lower variance."

#### Random Forests

"As we have discussed, a **Random Forest** is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with max_samples set to the size of the training set. Instead of building a `BaggingClassifier` and passing it a `DecisionTreeClassifier`, you can instead use the `RandomForestClassifer` class, which is more convenient and optimized for Decision Trees."

In [4]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(x_train, y_train)

y_pred_rf = rnd_clf.predict(x_test)

"The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model."

#### Extra-Trees

So far we have discussed the two forms of randomness that allow Random Forests to have more bias and less variance, instance sampling and feature sampling. However, the "Extremely Randomized Trees" ensemble uses another form of randomness. Instead of calculating the optimal threshold of each feature at each node, it instead uses random thresholds for each feature.

"Once again, this trades more bias for a lower variance. It also makes Extra-Trees much faster to train than regular Random Forests since finding the best possible threshold for each feature at every node is one of the most time-consuming tasks of growing a tree."

To use Extra-Trees in Scikit-Learn, use the `ExtraTreesClassifer` class.

Note: "It is hard to tell in advance whether a `RandomForestClassifer` will perform better or worse than an `ExtraTreesClassifer`. Generally, the only way to know is to try both and compare them using cross-validation (and tuning the hyperparameters using grid search)."

#### Feature Importance

Another helpful aspect of Random Forests is that it is very easily to calculate feature importance. "Scikit-Learn measures a feature's importance by looking at hw much the tree nodes that use that feature reduce impurity on average (across all trees in the forest)."

In [5]:
from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])

for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10499802484536648
sepal width (cm) 0.024466016172046443
petal length (cm) 0.4369996001078748
petal width (cm) 0.43353635887471215


"Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection."

#### Boosting

"**Boosting** (originally called *hypothesis boosting*) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor."

#### AdaBoost

"One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted."

"For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and so on."


**Note:** "There is one important drawback to this sequential learning technique: it cannot be parallelized (or only partially), since each predictor can only be trained after the previous predictor has been trained and evaluated. As a result, it does not scale as well as bagging or pasting."

"Scikit-Learn actually uses a multiclass version of AdaBoost called *SAMME*... When there are just two classes, SAMME is equivalent to AdaBoost. Moreover, if the predictors can estimate class probabilities, Scikit_learn can use a variant of SAMME called *SAMME.R*, which relies on class probabilities rather than predictions and generally performs better.

In [7]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm='SAMME.R', learning_rate=0.5)
ada_clf.fit(x_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(class_weight=None,
                                                         criterion='gini',
                                                         max_depth=1,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort=False,
                                                         random_state=None,
                             

If you AdaBoost ensemble is overfitting the training set, you can try reducing the number of estimators or more strongly regularize the base estimator.

#### Gradient Boosting

"Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor on the **residual errors** amde by the previous predictor."

![gradientBoost](gradientBoost.jpg)

"Figure 7-9 represents the predictions of [three] trees in the left column, and the ensemble's predictions in the right column. In the first row, the ensemble has just one tree, so its predictions are exactly the same as the first tree's predictions. In the second row, a new tree is trained on the residual errors of the first tree. On the right, you can see that the ensemble's predictions are equal to the sum of the predictions of the first two trees. Similarly, in the third row another tree is trained on the residual errors of the second tree. You can see that the ensemble's predictions gradually get better as trees are added to the ensemble."

In [8]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(x,y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=1.0, loss='ls', max_depth=2,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=3,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

The regularization technique knows as **shrinkage** refers to shrinking the learning rate to a low number. This will require more estimators, but in the end, should generalize better.