In [1]:
import numpy as np
import pandas as pd

# Ensemble Learning

If you aggregate the predictions of a group of predictors (classifiers/regressors) - you will get better predictions than with the best individual predictor. A group of predictors is called an ensemble - Ensemble Learning technique, and the algorithm is thus called Ensemble Method. 

For Decision Trees, you can get them to train each on a different random subset of the training set, obtain individual predictions, and then predict the class that gets the most votes. 

###### Random Forest - One of the most powerful Machine Learning algorithms.

Ensemble methods used more near the end of a project - after predictors are already built, to combine them into an even better predictor. ML competition winners usually involve seversl Ensemble methods. 

Bagging, boosting, stacking and others are discussed in this chapter.

# Voting Classifiers

For instance, we might have already trained a few classifiers. To create an even better classifier, we can aggregate the predictions of each, and predict the one with most votes. Hard voting classifier - the majority vote method. 

This voting classifier performs better than each classifier in the ensemble. With enough learners, even if each classifier is weak and useless, the ensemble performs a lot better.

This is due to the law of large numbers - with 1000 classifiers only 51% correct, the ensemble can perform better than 75% accuracy. However, since classifiers are trained on the same dataset, sometimes they make correlated errors, and the majority vote will be for the wrong class which reduces the accuracy. 

Ensembles therefore work best when the predictors are independent from one another as possible. To get diverse classifiers train them on different algorithms. 

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

# Casual X, y from the moon dataset
X, y = make_moons(n_samples=500, noise=0.30, random_state=24)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=24)

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="lbfgs", random_state=24)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=24)
svm_clf = SVC(gamma="scale", random_state=24)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(random_state=24)),
                             ('rf', RandomForestClassifier(random_state=24)),
                             ('svc', SVC(random_state=24))])

In [4]:
# Each classifier's accuracy score

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
    
# Supposed to outperform the rest

LogisticRegression 0.848
RandomForestClassifier 0.904
SVC 0.896
VotingClassifier 0.896


Soft voting - for probability predictions, Scikit-Learn will predict the class with the highest class probability, averaged over all the individual classifiers. 

Usually higher performance than hard voting because it gives weight to highly confident votes.  

In [5]:
# Using soft voting - have to give svm a predict_proba hyperparameter - uses cross-validation
svm_clf_proba = SVC(gamma='scale', probability=True, random_state=24)

voting_clf_soft = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf_proba)],
    voting='soft')

for clf in (log_clf, rnd_clf, svm_clf_proba, voting_clf_soft):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.848
RandomForestClassifier 0.904
SVC 0.896
VotingClassifier 0.896


# Bagging and Pasting

One way to get diverse classifiers is to use different training algorithms. Another is to use the same algorithm but train predictors on different subsets of the training set. 

Both bagging and pasting allow training instances to be sampled several times across multiple predictors, but only bagging allows the instances to be sampled several times by the same predictor. 


Bagging - sampling with replacement (bootstraping)<br>
Pasting - sampling without replacement (pasting small votes of classification in large databases and online)


After predictors are all trained, the aggregation function uses statistical mode (most frequent prediction, like hard voting) for classification, or the average for regression. Aggregation reduces both bias and variance. Generally, the ensemble has a similar bias but lower variance than a single predictor trained on the training set.

Predictors can be trained and predictions be made all in parallel with each other. Bagging and pasting are extremely popular because they scale very well. 

Generally, bagging has lower variance while pasting has less bias. Overall, bagging tends to result in better models. Cross-validation can also be preformed to compare both. 

In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    max_samples=100, bootstrap=True, n_jobs=-1) # auto performs soft voting when there is predict_proba method

bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

In [7]:
# Out of bag Evaluation
# On average 37% of the training instance is never touched. It could therefore be used as a validation set. 

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500,
    bootstrap=True, n_jobs=-1, oob_score=True)# Set oob_score to True for an auto oob evaluation

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_ # Should predict accuracy score on the test set as well

0.8853333333333333

In [8]:
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred) # Close enough

0.912

## Random Patches and Random Subspaces

BaggingClassifier also supports sampling features, other than instances. Each predictor is trained on a random subset of the input features. Useful with high-dimensional inputs (images). 

Random Patches - the method of training both instances and features <br>
Random Subspaces - the method of keeping all training instances and only sampling features. This is done by (bootstrap=False, max_samples=1.0) and (bootstrap_features=True and/or max_features smaller than 1.0)

Result - sampling features creates predictor diversity - higher bias but lower variance. 

# Random Forest

Ensemble of decision trees generally trained via bagging, typically with max_samples = 1.0 (size of the entire training set). 

Randomness when growing trees. Instead of searching for the best feature when splitting a node, it searches for the best feature among a random subset of them. Results in greater tree diversity - trading higher bias for a lower variance; which generally produces a better model. 

One good quality of Random Forests is how easy it is to measure the relative importance of each feature. It measures a feature's importance by looking at how much the tree nodes using the feature reduce impurity on weighted average (across all trees in forest). Each node's weight is equil to its number of training samples. Scikit-learn does this automatically, scaling the sum to 1. Accessible via feature_importances_ variable. 

In [9]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

# Equivalent to:
bag_rnd_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter='random', max_leaf_nodes=16),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

In [10]:
# Feature Importance

from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])

for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)
    
# Very handy to know what features matter, if feature selection is needed

sepal length (cm) 0.09455583982195351
sepal width (cm) 0.02334699878827842
petal length (cm) 0.45526036950717214
petal width (cm) 0.426836791882596


## Boosting (hypothesis boosting)

Ensemble method that combines several weak learners into a strong learner. To train predictors sequentially, each trying to correct its predecessor. Many different methods. 

Most popular techniques are AdaBoost (adaptive boosting), and Gradient Boosting. 

### AdaBoost 

One way to get better is to pay more attention to the instances that the predecessor underfitted - focusing more and more on the hard cases. 

To build an AdaBoost classifier, a base classifier is trained and makes predictions. The relative weight of misclassified training instances is then increased. A second classifier is trained with the updated weights, and so on. 

The sequential techniques are similiar with Gradient Descent, except that instead of tweaking a single predictor's parameters to minimize the cost function, AdaBoost adds predictors to the ensemble and makes it better at each iteration. 

After training predictions are made similar to bagging/pasting, but each predictor have a different weight depending on their accuracy. A significant drawback, therefore, is that predictions cannot be parallelized, and therefore does not scale as well.

SKLearn uses the multiclass variation of AdaBoost called SAMME. It behaves normally unless the predictors can estimate class probabilities (predict_proba), in whicn case it uses class probabilities and tend to perform better. 

If AdaBoost is overfitting, simply reduce the number of estimators, or regularize the base one more strongly. 

In [11]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm='SAMME.R', learning_rate = 0.5)

ada_clf.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=200)

### Gradient Boosting

As another sequential trainer algorithm, instead of tweaking the instance weights at every iteration, Gradient Boosting tries to fit the new predictor to the residual errors made by the previous one. 

To train a Gradient Tree Boosting:

In [12]:
# All from the official notebook. !!! Learn to generate random datasets!

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

In [13]:
# How to train with gradient boosting - the old way

from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

X_new = np.array([[0.8]])

# The ensemble of 3 trees, makes predictions simply by adding up the predictions of all 3
# y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

In [14]:
# Simpler way
from sklearn.ensemble import GradientBoostingRegressor

grbt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
grbt.fit(X, y)

GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3)

The learning_rate hyperparameter (0 to 1) scales the contribution of each tree. A low value means more trees are needed to fit the training set, but the predictions will generalize better. This regularization technique is called shrinkage. 

To find the optimal number of trees, early stopping is good. To do this is to use the staged_predict() method, which returns an iterator over the predictions made by the ensemble at each stage of training (1, 2, 3 trees, etc). 

In [15]:
# This trains a large number of trees before looking back to find the optimal number

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
best_n_estimators = np.argmin(errors) # indices of minimums

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=best_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=84)

In [16]:
# Actually stopping training early - stops when validation error does not improve for five iterations in a row

gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True)

min_val_error = float('inf')
error_going_up = 0

for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break


Stochastic Gradient Boosting - setting the subsample for each tree to be trained on a fraction of the instances selected randomly. This trades bias for variance and speeds up training as well. 

XGBoost - the python library containing optimized implementation of Gradient Boosting  It's extrmeley fast, scalable and portable. An important component of the winning entries in ML competitions. 

In [17]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

# Taking care of early stopping automatically 
xgb_reg.fit(X_train, y_train,
            eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)

ModuleNotFoundError: No module named 'xgboost'

# Stacking (stacked generalization)

Instead of using trivial functions (like hard voting) to
aggregate the predictions, why don't we train a model? Each predictor would predict a value, passed as inputs as the blender wh produces the final prediction. 

A common approach - hold-out set. We divide the training set in half, and use the first half to train the predictors at the first layer. They are then made to make predictions on the second half. A training set can thus be created, using the predictions as input features and keeping the target values. The blender is trained on this training set, learning to predict the target value given the first layer's predictions. We can train layers of blenders, simply by dividing the training set into more subsets. 

Scikit-Learn does not support stacking. It's not hard to self-learn, but open-source implementations like brew are also good. 