# Ensemble Learning and Random Forests
### Voting Classifiers

Suppose you have a collection of diverse classifiers trained on the same data. Combining the results of each classifier will be more accurate than each classifier on its own. Even a large number of weak classifiers (barely better than random guessing) can give accurate predictions when combined.

Hard voting - the class with the most predictions wins, i.e. if there are five models and three are predicting the same class for a given instance, that class is the predicted value.

Soft voting - the averages of the probabilities are used to predict the outcome.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')
    # Hard voting = whichever class gets the majority of votes wins.
    # Soft voting = 
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
                             ('rf', RandomForestClassifier(random_state=42)),
                             ('svc', SVC(random_state=42))])

In [4]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912


### Bagging and Pasting

Similar methods for training the same algorithm on many random subsets of the same data set. Easily scalable since different subset predictors can be trained in parallel. 

Bagging (short for bootstrap aggregating) - sampling with replacement.

Pasting - sampling without replacement.

In [5]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)

### Out-of-Bag (OOB) Evaluation

With bagging, because sampling with replacement is happening, some subset of the training instances will be sampled multiple times while others will not be sampled at all - these are the out-of-bag instances.

Since they were not used for training, they can be used for evaluation without the need for a separate validation set.

In [6]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.896

In [7]:
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(X_test)
accuracy_score(y_test, y_pred) # Should be pretty close to the oob score.

0.912

### Random Forests

Essentially just ensemble learning with decision trees using bagging.

In [8]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

In [9]:
# This code creates a model roughly equivalent to the RandomForestClassifier above.
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
    n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

In [10]:
# Scikit-Learn's random forests automatically compute the importance of each feature in the training set.
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09300684177935435
sepal width (cm) 0.025287631615271887
petal length (cm) 0.4349688801089876
petal width (cm) 0.44673664649638634


## Boosting

Unlike the parallel methods above, boosting refers to a variety of sequential techniques where each predictor builds on the results of the previous one.

AdaBoost works by making sure that underfitted training instances of the previous predictor are boosted by the next predictor.

Gradient boosting works by fitting the _residual errors_ of the previous predictor.

In [11]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=200)

In [12]:
# Gradient boosting example:
import numpy as np
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X, y)

DecisionTreeRegressor(max_depth=2)

In [13]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(max_depth=2)

In [14]:
y3 = y2 - tree_reg1.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

DecisionTreeRegressor(max_depth=2)

In [15]:
X_new = np.array([[0.8]])
y_pred_gb = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))

In [16]:
# Of course, in practice there is a scikitlearn implementation we can use.
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X, y)

GradientBoostingRegressor(learning_rate=1.0, max_depth=2, n_estimators=3)

In [17]:
# To find the optimal number of trees, we can check the mse at each stage of training.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_val, y_pred)
          for y_pred in gbrt.staged_predict(X_val)]
bst_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=35)

In [18]:
# Using warm_start=True, we can train incrementally to actually stop when validation error stops improving.
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)

min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_val)
    val_error = mean_squared_error(y_val, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break  # early stopping

In [19]:
gbrt.n_estimators

40

In [20]:
# There is an optimized implementation of gradient boosting called XGBoost that is quite popular.
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)

  from pandas import MultiIndex, Int64Index


In [21]:
# Early stopping is implemented:
xgb_reg.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)

[0]	validation_0-rmse:0.19792
[1]	validation_0-rmse:0.14902
[2]	validation_0-rmse:0.11354
[3]	validation_0-rmse:0.09599
[4]	validation_0-rmse:0.08507
[5]	validation_0-rmse:0.07984
[6]	validation_0-rmse:0.07805
[7]	validation_0-rmse:0.07747
[8]	validation_0-rmse:0.07805
