# Chapter 7: Ensemble Learning and Random Forests

<i>Ensemble Learning</i> is the technique of aggregating the decisions made by many Machine Learning models in order to get a final result. An ensemble learning algorithm is called an <i>Ensemble method</i>. An ensemble of Decision Trees is called a <i>Random Forest</i>. Models that win Machine Learning competitions often combine several Ensemble methods, e.g. the winner of the [Netflix Prize competition](http://netflixprize.com/).

## Voting Classifiers

A <i>hard voting</i> classifier is a model that train multiple classifiers and then predicts the class that gets the most votes. Even if the models are <i>weak learners</i> (slightly better than random guessing) then an ensemble could be a <i>strong learner</i> provided there are enough diverse models in the ensemble.

This is possible due to the fact that even if the models are just slightly better than random guessing, the more models' decisions you consider the more likely that the majority will select the correct class. However, this only is true if the models are different enough to not make the same errors while classifying data.

In [14]:
# Example of a VotingClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import warnings

warnings.filterwarnings("ignore")

X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
  estimators=[('lr',log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
  voting='hard')

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
  clf.fit(X_train, y_train)
  print('{}:'.format(clf.__class__.__name__), clf.score(X_test, y_test))

LogisticRegression: 0.864
RandomForestClassifier: 0.896
SVC: 0.888
VotingClassifier: 0.904


## Bagging and Pasting

One way to use an Ensemble method is to train different types of classifiers, as shown above. Another is to train the same type of model on random subsets of the training set. Random sampling with replacement is called [bagging](http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf). Sampling without replacement is called [pasting](https://link.springer.com/article/10.1023/A:1007563306331).

Bagging and pasting classifiers using the statistical mode, just like hard voting classifiers. Regressors will tend to use the average. Each individual predictor has a higher bias on the whole training set, but aggregation reduces both bias and variance. Generally the bias remains similar to the bias of a single model but have a lower variance, so ensemble models are less likely to overfit the training data.

### Bagging and Pasting in Scikit-Learn

In [16]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Setting bootstrap=False will have the classifier use pasting
# instead of bagging.
bag_clf = BaggingClassifier(
  DecisionTreeClassifier(), n_estimators=500,
  max_samples=100, bootstrap=True, n_jobs=-1)

bag_clf.fit(X_train, y_train)
bag_clf.score(X_test, y_test)

0.904

Bootstrapping introduces more diversity into the subsets that each predictor is trained on, so bagging ends up with slightly higher bias than pasting, but lower variance. Generally, bagging results in better models thant pasting.

### Out-of-Bag Evaluation

Since bagging randomly selects proper subsets of the training set to train each model, the instances not included in a particular subset used for training a single model is called an <i>out-of-bag</i> instance. You can evaluate an ensemble by taking an average of how each model does on its oob instances.

In [17]:
# An example of including the oob_score of a BaggingClassifier in evaluation.

bag_clf = BaggingClassifier(
  DecisionTreeClassifier(), n_estimators=500,
  bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.8933333333333333

In [19]:
# The score on the test set should be approximately the average oob_score_.

bag_clf.score(X_test, y_test)

0.904

In [24]:
# The oob_decision_function variable contains what the average probability
# of each instance belongs to each class when its an oob instance.

bag_clf.oob_decision_function_[:10]

array([[0.39393939, 0.60606061],
       [0.34183673, 0.65816327],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.06741573, 0.93258427],
       [0.35057471, 0.64942529],
       [0.01546392, 0.98453608],
       [1.        , 0.        ],
       [0.97340426, 0.02659574]])

## Random Patches and Random Subspaces

The `BaggingClassifier` also supports sampling the features as well, using the `max_features` and `bootstrap_features` hyperparameters. This is helpful for training sets with a large number of features.

Sampling both the training instances and features is called the <i>Random Patches</i> method. Keeping all training instances but sampling features is called the <i>Random Subspaces</i> method.

## Random Forest

A <i>Random Forest</i> is an ensemble of Decision Trees generally trained via the bagging method typically with `max_samples` set to the training set size.

In [25]:
# An example of training Scikit-Learn's RandomForest class.

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
rnd_clf.score(X_test, y_test)

0.92

With a few exceptions, `RandomForestClassifier` has all of the hyperparameters of both a `DecisionTreeClassifier` and `BaggingClassifier`.

In [0]:
# The following BaggingClassifier is equivalent to the previous
# RandomForestClassifier.

bag_clf = BaggingClassifier(
  DecisionTreeClassifier(splitter='random', max_leaf_nodes=16),
  n_estimators=500, max_samples=1., bootstrap=True, n_jobs=-1)

### Extra-Trees

When you train Random Forests, you can add additional randomness by having the Decision Trees use random thresholds for each feature rather than trying to find the optimal threshold. These forests are called [Extremely Randomized Trees](https://orbi.uliege.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf) (or <i>Extra-Trees</i> for short).

Extra-Trees take much less time to train and introduce less variance into the system for the price of more bias. It is difficult to tell whether a `RandomForestClassifier` or an `ExtraTreesClassifier` will perform better for a certain Machine Learning problem, so often you have to try both to see which is a better model to use.

### Feature Importance

Random Forest classifiers can also be used to determine <i>feature importance</i> which Scikit-Learn measures as a weighted average how often a feature is used to reduce a node's Gini impurity. The weights are how many training instances are associated in a node or its descendants.

Scikit-Learn's `RandomForestClassifier` computes the feature importances automatically during training, then scales the result so that the sum of the importances equals 1.

In [28]:
# An example of using RandomForestClassifier to determine and compare
# feature importances of a training set.

from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
rnd_clf.fit(iris.data, iris.target)
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
  print(name, score)

sepal length (cm) 0.09232365491637388
sepal width (cm) 0.024334044306895022
petal length (cm) 0.4486165186699227
petal width (cm) 0.4347257821068083
