## Ensemble Learning

### Content
- [Ensemble Learning](#ensemble)

<a id='ensemble'></a> 
### Ensemble Learning and Random Forests
> Ensemble learning represents the usage of multiple predictors that make a prediction. The final class/value will be determined by a majority vote (hard voting classifier).

There are different Ensemble methods:
- bagging
- boosting
- stacking

Hard Voting Classifier
- majority vote from the used classifiers


In [24]:
# imports
from sklearn.datasets import make_moons, load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

In [4]:
# create the dataset to showcase code
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [6]:
# create classifiers
log_clf = LogisticRegression(solver='lbfgs', random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svc_clf = SVC(gamma='scale', random_state=24)

# with voting we can set hard (majority) or soft (weight) 
voting_clf = VotingClassifier(estimators=[('lr', log_clf), 
                                          ('rf', rnd_clf), 
                                          ('svc', svc_clf)],
                              voting='hard')

In [7]:
# fit the hard voting classifier
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=42,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                               

In [12]:
# get the accuracy for each clf individual and combined score
for clf in (log_clf, rnd_clf, svc_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912


As expected the Ensamble performed better than each algorithm by itself.

#### Soft Voting
based on the probability of each predicted value. To use soft voting, all estimators have to be able to estimate the class probabilities (predict_proba())

In [16]:
# create classifiers
# need to change SVC, since we want to predict the probabilites for soft voting
log_clf = LogisticRegression(solver='lbfgs', random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svc_clf = SVC(gamma='scale', probability=True ,random_state=24)

# with voting we can set hard (majority) or soft (weight) 
voting_clf = VotingClassifier(estimators=[('lr', log_clf), 
                                          ('rf', rnd_clf), 
                                          ('svc', svc_clf)],
                              voting='soft', )

In [17]:
# fit the soft voting classifier
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=42,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                               

In [18]:
# get the accuracy for each clf individual and combined score
for clf in (log_clf, rnd_clf, svc_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92


### Bagging and Pasting
> Use the same algorithm for every predictor and train them on different random subsets of the training set.
- Bagging: sampling with replacement (bootstrap=True)
- Pasting: sampling without replacement (bootstrap=False)
- uses the statistical mode (most frequent prediction) for classifications and mean for regressions to make a prediction on unseen data 
- each Bagging and Pasting can be used with sklearn for regression and classification
- Bagging often results in better models

In [21]:
bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=42),
                           n_estimators=500,
                           max_samples=100,
                           bootstrap=True,
                           random_state=42,
                           n_jobs=-1)
bag_clf.fit(X_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='gini',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=42,
  

In [22]:
y_pred = bag_clf.predict(X_test)

In [23]:
print(accuracy_score(y_test, y_pred))

0.904


In [25]:
# Result of a normal DecisionTree
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
tree_preds = tree_clf.predict(X_test)
print(accuracy_score(y_test, tree_preds))

0.856


### Out-of-Bag evaluation
> With Bagging, some instances may be sampled several times for any given predictor, while others my nor be sampled at all. OOB uses these instances without the need for a separate validation set.

In [27]:
# set oob_score=True in BaggingClassifier
bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=42),
                           n_estimators=500,
                           bootstrap=True,
                           oob_score=True,
                           random_state=42)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.8986666666666666

In [28]:
# check the accuracy prediction from oob with real prediction 
y_pred = bag_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.912


In [31]:
# returns the class probabilities for each instance
bag_clf.oob_decision_function_[:5]

array([[0.32352941, 0.67647059],
       [0.35625   , 0.64375   ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ]])

#### Random Patches and Random Subspaces
> BaggingClassifier supports sampling features, allowing the predictor to be trained on a random subset of the input features.
- useful when dealing with high dimensional space (images)

> sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.

### Random Forests and Extra Trees
- A Random Forest is an ensemble of Decision Trees, which introduces more randomness when growing trees. Rather than searching for the best feature when splitting it searches for the best feature in a random feature set. -> more diversity / higher bias for lower variance -> generally the better model
- Extra-Tress introduce even more randomness for higher diversity and bias-variance trade-off.
    - random thresholds when searching for best possible thresholds 
    
> Hard to tell which algorithm performs better 

### Feature importance
> Random Forests make it easy to measure the relative importance of each feature

In [34]:
# check the feature importance of the iris dataset
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10018256738063595
sepal width (cm) 0.02339091270538694
petal length (cm) 0.44045983501268
petal width (cm) 0.43596668490129703


So they are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection

In [6]:
# SAMME.R retuns class probabilities (predict_proba)
ada_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                            n_estimators=200,
                            algorithm='SAMME.R',
                             learning_rate=0.5,
                             random_state=42
                            )

In [7]:
ada_clf.fit(X_train, y_train )

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.5, n_estimators=200, random_state=42)

In [8]:
y_pred = ada_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.896


#### Gradient Boosting
> works like AdaBoost but fits the new predictor to the residual errors made by the previous predictor.

regularization technique: shrinkage
- set a low value for the learning rate to get predictions that will usually generalize better.

##### Find optimal number of trees
- Early stopping with the staged_predict() method 

##### Stochastic Gradient Boosting
> using Gradient Boosting with the hyperparameter subsample to train each tree on a randomly selected subsample of the training data.

- trades higher bias for lower variance 
- speeds up training considerably

In [15]:
# create some data
np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=49)

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)
gbrt.fit(X_train, y_train)

errors = [mean_squared_error(y_test, y_pred) for y_pred in gbrt.staged_predict(X_test)]
bst_n_estimators = np.argmin(errors) + 1

gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=bst_n_estimators, random_state=42)
gbrt_best.fit(X_train, y_train)

GradientBoostingRegressor(max_depth=2, n_estimators=56, random_state=42)

In [30]:
min_error = np.min(errors)

In [31]:
# the gbrt_best is trained with 56 trees
bst_n_estimators

56

> It is also possible to implement early stopping by actually stopping training early 
- we have to set `warm_start` = True, which reuses the solution of the previous call to fit

In [36]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)

# implement early stopping, when the validation error does not improve for five iterations in a row

# base variables
# set first error to inf so that we can continue with every error, regardless of size
min_val_error = float('inf')
error_going_up = 0

# for loop
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_test)
    val_error = mean_squared_error(y_test, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up +=1
        if error_going_up == 5:
            break

In [37]:
print(gbrt.n_estimators)

61


In [38]:
print('Minimum validation MSE:', min_val_error)

Minimum validation MSE: 0.002712853325235463


#### XGBoost
> XGBoost is an optimized implementation of Gradient Boosting. It aims to be extremely fast, scalable, and portable

In [41]:
import xgboost

In [47]:
xgb_reg = xgboost.XGBRegressor(random_state=42)
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_test)
val_error = mean_squared_error(y_test, y_pred)
print("Validation MSE:", val_error)

Validation MSE: 0.00400040950714611


It automatically takes care of early stopping:

In [50]:
xgb_reg.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_test)
val_error = mean_squared_error(y_test, y_pred)
print("Validation MSE:", val_error)

[0]	validation_0-rmse:0.22834
Will train until validation_0-rmse hasn't improved in 2 rounds.
[1]	validation_0-rmse:0.16224
[2]	validation_0-rmse:0.11843
[3]	validation_0-rmse:0.08760
[4]	validation_0-rmse:0.06848
[5]	validation_0-rmse:0.05709
[6]	validation_0-rmse:0.05297
[7]	validation_0-rmse:0.05129
[8]	validation_0-rmse:0.05155
[9]	validation_0-rmse:0.05211
Stopping. Best iteration:
[7]	validation_0-rmse:0.05129

Validation MSE: 0.0026308690413069744


### Stacking
> Instead of using trivial functions (hard voting) to aggregate the predictions of all predictors in an ensemble it uses a model to perform the aggregation.
> The final predictor is called blender or meta learner 

- scikit learn does not support stacking, but it is not too hard to implement or one can use DESlib

#### Implementation
1. split training set into two subsets 
2. use the first to train the predictors in the first layer
3. these predictors are used to make predictions on the second set
4. create a new training set with the predicted values as input features 
5. the blender is trained on this set, so it learns to predict the target value, based on the first layer's predictions