# Chapter 7: Ensemble Learning and Random Forests

*Ensemble* - A group of predictors  
*Ensemble Learning* - Technique of aggregating the predictions of a group of predictors  
*Ensemble method* - An Ensemble Learning algorithm

## 7.1 Voting Classifiers

Suppose you have a few classifiers, each with 80% accuracy. A simple way to make a better classifier is to aggregate the predictions of each classifier and predict the class that gets the most votes.

*Hard voting* classifier - Majority-vote classifier

> Note: Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms, increasing chances of making different errors and thus improving the ensemble's accuracy.

Let's take a look at moons dataset.

In [1]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

In [2]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
from sklearn.datasets import make_moons

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [14]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

Now look at each classifier's accuracy on test set.

In [15]:
from sklearn.metrics import accuracy_score

In [16]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.904


*Soft voting* - Predict the class with the highest class **probability**, averaged over all the individual classifiers.  
It gives more weight to highly confident votes => higher performance than hard voting.

> Note: Just replace `voting='hard'` with `voting='soft'` and ensure all classifiers can estimate class probabilities (eg. have `predict_proba()`).

## 7.2 Bagging and Pasting

Another way is to train the same algorithm on different random subsets of training set.  

*Bagging (bootstrap aggregating)* - When sampling is performed **with** replacement  
*Pasting* - When sampling is performed **without** replacement

The aggregation function is often the *statistical mode* - the most frequent prediction (classification) or average (regression).

> Note: Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance.

Predictors for bagging/pasting can be trained in parallel => scales very well (ie. popular methods).

### 7.2.1 Bagging and Pasting in Scikit-Learn

Scikit-Learn has `BaggingClassifier` and `BaggingRegressor`.

> Note: `BaggingClassifier` automatically performs soft voting instead of hard voting if the base classifier can estimate class probabilities.

In [4]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [5]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

### 7.2.2 Out-of-Bag Evaluation

`BaggingClassifier` only samples about 63% of the training instances (when 
`bootstrap=True`, with replacement), so the remaining 37% are called *out-of-bag (oob)* instances.

Set `oob_score=True` when creating a `BaggingClassifier` to request an automatic oob evaluation after training.

In [10]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.904

In [11]:
from sklearn.metrics import accuracy_score

In [12]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.904

In [15]:
bag_clf.oob_decision_function_[:10] # Shorten output by taking 10 rows

array([[0.4137931 , 0.5862069 ],
       [0.37755102, 0.62244898],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.00512821, 0.99487179],
       [0.1185567 , 0.8814433 ],
       [0.31188119, 0.68811881],
       [0.00540541, 0.99459459],
       [0.98324022, 0.01675978],
       [0.98543689, 0.01456311]])

## 7.3 Random Patches and Random Subspaces

`BaggingClassifier` can also sample the features too (`max_features` and `bootstrap_features`).

*Random Patches* method - Sampling both training instances and features.  
*Random Subspaces* method - Keeping all training instances (`bootstrap=False`, `max_samples=1.0`) and sampling the features (`bootstrap_features=True`, `max_features<1.0`).

## 7.4 Random Forests

A Random Forest is an ensemble of Decision Trees, generally trained with the bagging method. Use `RandomForestClassifier` (or `RandomForestRegressor`) instead of `BaggingClassifier` and `DecisionTreeClassifier` because the former is more optimized.

In [16]:
from sklearn.ensemble import RandomForestClassifier

In [17]:
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

Generally most of Decision Tree and Bagging hyperparameters are available to use - with some exceptions.

Random Forests introduce extra randomness; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features.

In [18]:
# Using these hyperparameters
# Roughly equivalent to Random Forest Classifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

### 7.4.1 Extra-Trees

*Extremely Randomized Trees (Extra-Trees)* ensemble - Using random thresholds for each feature rather than searching for the best possible thresholds.

Use `ExtraTreesClassifier` (`ExtraTreesRegressor`) for Extra-Trees and they have the same hyperparameters as Random Forests.

> Note: Training Extra-Trees is much faster because finding the best possible threshold is the most time-consuming task.

> Note: Whether or not Extra-Trees perform better (eg. higher accuracy) than Random Forests is unclear. Best way is try both and compare their cross-validation.

### 7.4.2 Feature Importance

Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity (**recall: Gini impurity!**) on average (across all trees in the forest). It can be accessed using `feature_importances_`.

In [19]:
from sklearn.datasets import load_iris

In [20]:
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10149417945091706
sepal width (cm) 0.02588642676804873
petal length (cm) 0.4421831688641625
petal width (cm) 0.4304362249168718


Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

## 7.5 Boosting

### 7.5.1 AdaBoost

### 7.5.2 Gradient Boosting

## 7.6 Stacking