# Chapter 7
## Ensemble Learning and Random Forests
_Enemble learning_ is essentially multiple machine learning models aggregating to produce a consensus answer. If you were to do this with decision trees, this would end up being a random forests.

Here, we will discuss popular methods such as bagging, boosting, stacking, and a few others.

## Voting Classifiers
Suppose you have a group of classifers that each acchieve about 80% accuracy. One simple way to get an even better classifer is to aggregate all of the results. This majority-vote classifer is called _hard voting_. This works most effectivly when the predictors are as independent as possible.

Here's a breif example of a voting classifer using three diverse classifers.

In [1]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

# TODO: Split into train and test sets
X, y = make_moons(n_samples=10000, noise=0.4)
X_train, X_test, y_train, y_test = train_test_split(X, y)



from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)), ('rf', RandomF...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

An a quick look at all of the accuracys...

In [2]:
from sklearn.metrics import accuracy_score


for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.82
RandomForestClassifier 0.8356
SVC 0.86
VotingClassifier 0.8532


  if diff:


## Bagging and Pasting
One way to get a set of diverse classifiers is to do the method above and use very different algorithems.. or you can use the same algorithm on different random subsets of the traning set. When done with replacement, it is called _bagging_(bootstrap aggregating), when without replacement it's _pasting_.

Once all trained, you simplely aggregate the predictions. This is called _statistical mode_ for classification(hard voting), and average for regression.

The result is that compared to a single predictor, the bias is similar but the variance is lower. The predictors can be trained in parallel.

### Bagging and Pasting in Scikit-Learn
Here's 500 Decision Tree Classifiers with bagging!

In [3]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1
)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

Bagging produces more variance than pasting, but overall less variance.

## Out-of-Bag Evaluation
With bagging, some instanes may not be sampled at all. Statistically speaking, about 63% of the training instances are sampled, leaving 37% of instances are not sampled, called _out-of-bag_(oob) instances.

Scikit let's you use this like sooo

In [4]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True
)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.8418666666666667

According to this, we should get about an 85 percent accuracy... Let's test this!

In [5]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.8448

## Random Patches and Random Subspaces
This does feature sampling instead of feature sampling. 'Tis usefull when working on high-dimensional data(images for example). Sampling both training and features is called _Random Patches method_. Keeping all of the instances buyt sampling features is called _Random Subspaces_ method.

## Random Forests
'Tis an ensemble of Decision Trees! Here it is as follows!

In [6]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

## Feature Importance
Decision trees have this intrisic property that all of the important features are towards the root of the tree, with the least important being towards the leaves. Scikit has feature importance built in to it to view it!

In [7]:
from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09664357853359043
sepal width (cm) 0.02130162777026598
petal length (cm) 0.45080323010544354
petal width (cm) 0.4312515635907


## Boosting
Originally called _hypothesis boosting_ refers to any Ensemble method combine several weak learners into a strong learner. Generally, the boosting methods train sequentially, each trying to correct it's predicessor. While there are many boosting methods, the most popular are _AdaBoost_ (Short for _Adaptive Boosting_) and _Gradient Boosting_. First, AdaBoosting.

### AdaBoost
One way for a new predictor to correct it's predicessor is to pay more attention to the instances that were underfitted. By giving them more weight, the instance gets to be recognized by later classifiers.

**IMPORTANT NOTE:** The one major drawback of this sequential learning technique is that it _cannot_ be parallelized (or only partially). As a result, its doesn't scale as well as bagging or pasting.

Refer to the book for the equation. Scikit acutally uses a multiclass version of AdaBoost called _SAMME_(_Stagewise Additive Modeling using a Multiclass Exponential loss function_). It the predictors use probibilites, then you can use _SAMME.R_ (_R_ for "Real") for generally better performance.

Here's an AdaBoost classifier based one 200 _Decision Stumps_ using the `AdaBoosterClassifier`. A Decision Stump is a Decision Tree with max_depth=1. This is the default estimator for the `AdaBoostClassifier`.

In [8]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5
)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=0.5, n_estimators=200, random_state=None)

**NOTE:** If AdaBoost is overfitting, try reducing the number of estimators or more strongly regularize the base estimator.

### Gradient Boosting
Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each correcting it's predicessor. However, instead of tweaking weights, this method tries to fit the new predictor to the _residual errors_ made by the previous predictor.

Were going to go through simple regression using Decision Trees as the base predictors. This is called _Gradient Tree Boosting_ or _Gradient Boosted Regression Trees_(GBRT). First, we fit a `DecisionTreeRegressor` on the training set.

In [9]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

Now train a second `DecisionTreeRegressor` on the residual errors made by the first predictor

In [10]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

Then train a third regressor...

In [11]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

Now we can have an ensemble containing three trees. It can make predictions one a new instance by adding up the predicitions of all three trees:

In [14]:
X_new = X - 1
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

A simpler way to train GBRT ensembles is to use Scikit's `GradientBoostingRegressor` as follows.

In [15]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=1.0, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=3, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)

The learning_rate hyperparameter can be use to make better generalizations, called _shrinkage_. To find the optimal number of trees you can use early stopping, or use `staged_predict()`:

In [16]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors)

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=2, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=119, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False)

You can also implement early stopping. Here, we stop if there's no improvements 5 times in a row.

In [17]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break

You can also use random subsamples to trade higher bias for lower variance. This is called _Stochastic Gradient Boosting_.

## Stacking
Short for _stacked generalization_, it's idea is fairly simple. Insead of using trivial functions(such as hard voting) to aggregate the predictions of the predictors in an ensemble why don't we train a model to perform this aggregation!!! The final predictor(called a _blender_ or _meta learner_) takes the predictions as inputs and makes a final prediction.

To train a blender, a common approach is to use a hold-out set. First you split a training set into two subsets. The first subset is used to train the predictors on the first layer. Next the first layer predictors made predictions on the second set. We can create a new training set useing thes input values as features. The blender is trained on this so it learns to predict the target value givent the first layer's predictions.

Scikit doesn't support this, but it's easy to implement this, or use an open source implementation such as brew.