## Ensemble Learning
> Ensemble learning represents the usage of multiple predictors that make a prediction.
> Ensembling is about combining a collection of models to get a more performant model or help to address issues of overfitting by reducing the models variance 

- models used in ensemble learning are ofter called <b>weak learners</b>

There are different Ensemble methods:
- [Averaging](#avg)
- [Weighted Averaging](#wavg)
- [Hard or Max Voting](#hard)
- [Soft Voting](#soft)
- [Bagging and Pasting](#bag)
- [Random Patches and Random Subspaces](#random)
- [Random Forest and Extra Trees](#rf)
- [Boosting](#boosting)
- [Stacking](#stacking)


### Advantages and disadvantages of Ensemble Learning
| Advantages | Disadvantages |
|------------|---------------|
| higher predictive power| less interpretable/hard to sell in business to get insights |
| useful when there is linear and non-linear data |  hard to learn, wrong selections can lower predictive power |
| bias/variance can be reduced as well as under/overfitting | expensive both in time and space |
| less noisy and more stable |  |

In [78]:
# imports
from sklearn.datasets import make_moons, load_iris, load_diabetes, load_boston
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression, RidgeCV, Lasso, ElasticNet, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error

from sklearn.svm import LinearSVR, SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import StackingRegressor

from xgboost import XGBClassifier, XGBRegressor



import numpy as np
import matplotlib.pyplot as plt

In [1]:
# check version, stacking is supported from 0.22
print(sklearn.__version__)

0.23.1


<a id='avg'></a>
### Averaging
calculate the mean of the predictions from each predictor

In [50]:
# load data 
X, y = load_boston(return_X_y=True)
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# define models
knn = KNeighborsRegressor()
lasso = Lasso()
ridge = Ridge()
rf = RandomForestRegressor()

# fit models
knn.fit(X_train, y_train)
lasso.fit(X_train, y_train)
ridge.fit(X_train, y_train)
rf.fit(X_train, y_train)

# predict
knn_pred = knn.predict(X_test)
lasso_pred = lasso.predict(X_test)
ridge_pred = ridge.predict(X_test)
rf_pred = rf.predict(X_test)

# Averaging
avg_pred = np.mean([knn_pred, lasso_pred, ridge_pred, rf_pred], axis=0)

# Evaluation
print('knn Mean Absolute Error:', mean_absolute_error(y_test, knn_pred))
print('lasso Mean Absolute Error:', mean_absolute_error(y_test, lasso_pred))
print('ridge Mean Absolute Error:', mean_absolute_error(y_test, ridge_pred))
print('rf Mean Absolute Error:', mean_absolute_error(y_test, rf_pred))
print('AVERAGING Mean Absolute Error:', mean_absolute_error(y_test, avg_pred))

knn Mean Absolute Error: 3.6215748031496067
lasso Mean Absolute Error: 3.424610243097111
ridge Mean Absolute Error: 3.0503751260061707
rf Mean Absolute Error: 2.1682125984251974
AVERAGING Mean Absolute Error: 2.5746191369884746


> the averaging method didn't result in a lower error in this case but we can be sure that the prediction will generalize better since it is based one three models

<a id='wavg'></a>
### Weighted Averaging
we saw that simple averaging doesn't always result into lower error rates. Therefore we can assign weights to each prediction given the most weight to to ones that performs the best.

In [58]:
weighted_pred = knn_pred*0.1 + lasso_pred*0.2 + ridge_pred*0.3 + rf_pred*0.4

print('knn Mean Absolute Error:', mean_absolute_error(y_test, knn_pred))
print('lasso Mean Absolute Error:', mean_absolute_error(y_test, lasso_pred))
print('ridge Mean Absolute Error:', mean_absolute_error(y_test, ridge_pred))
print('rf Mean Absolute Error:', mean_absolute_error(y_test, rf_pred))
print('AVERAGING Mean Absolute Error:', mean_absolute_error(y_test, avg_pred))
print('Weighted AVERAGING Mean Absolute Error:', mean_absolute_error(y_test, weighted_pred))

knn Mean Absolute Error: 3.6215748031496067
lasso Mean Absolute Error: 3.424610243097111
ridge Mean Absolute Error: 3.0503751260061707
rf Mean Absolute Error: 2.1682125984251974
AVERAGING Mean Absolute Error: 2.5746191369884746
Weighted AVERAGING Mean Absolute Error: 2.395877374561435


> using the weighted averaging results in lower errors than simple averaging but in this case it still performed slightly worse than the best single predictor

<a id='hard'></a>
### Hard Voting or Max Voting
- is very much the same as averaging but for classification tasks  
- the vote with the majority will used as the prediction 

In [69]:
# create the dataset to showcase code
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [47]:
# create classifiers
log_clf = LogisticRegression(solver='lbfgs', random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svc_clf = SVC(gamma='scale', random_state=24)

# with voting we can set hard (majority) or soft (weight) 
voting_clf = VotingClassifier(estimators=[('lr', log_clf), 
                                          ('rf', rnd_clf), 
                                          ('svc', svc_clf)],
                              voting='hard')

# fit the hard voting classifier
voting_clf.fit(X_train, y_train)

# get the accuracy for each clf individual and combined score
for clf in (log_clf, rnd_clf, svc_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912


> As expected the max voting classifier performed better than each algorithm by itself.

<a id='soft'></a>
### Soft Voting
based on the probability of each predicted value. To use soft voting, all estimators have to be able to estimate the class probabilities (predict_proba())

In [48]:
# create classifiers
# need to change SVC, since we want to predict the probabilites for soft voting
log_clf = LogisticRegression(solver='lbfgs', random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svc_clf = SVC(gamma='scale', probability=True ,random_state=24)

# with voting we can set hard (majority) or soft (weight) 
voting_clf = VotingClassifier(estimators=[('lr', log_clf), 
                                          ('rf', rnd_clf), 
                                          ('svc', svc_clf)],
                              voting='soft')

# fit the soft voting classifier
voting_clf.fit(X_train, y_train)

# get the accuracy for each clf individual and combined score
for clf in (log_clf, rnd_clf, svc_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92


> slightly better results than hard voting, because there is more information in probabilities than just the majority vote (soft voting takes the certainty of a classifier into account)

<a id='bag'></a>
### Bagging and Pasting
Use the same algorithm for every predictor and train them on different random subsets of the training set.
- Parallel method: no dependency between base learners
- Bagging: sampling with replacement (bootstrap=True)
- Pasting: sampling without replacement (bootstrap=False)
- uses the statistical mode (most frequent prediction) for classifications and mean for regressions to make a prediction on unseen data 
- each Bagging and Pasting can be used with sklearn for regression and classification
- every model is trained on a different subset of data and all the results are combined, so the final model is less overfitted and variance is reduces
- effective for models which have high variance like classification and regression trees

Bagging out of the Box: `RandomForest` and `ExtraTreesREgressor`

Transform every algorithm into Bagging: `BaggingClassifier` or `BaggingRegressor`

In [63]:
# define bagging classifier (bootstrap=True)
bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=42),
                            n_estimators=500,
                            max_samples=100,
                            bootstrap=True,
                            random_state=42,
                            n_jobs=-1)

# define pasting classifier (bootsrtap=False)
past_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=42),
                            n_estimators=500,
                            max_samples=100,
                            bootstrap=False,
                            random_state=42,
                            n_jobs=-1)

# normal classifier
tree_clf = DecisionTreeClassifier(random_state=42)

# fit
bag_clf.fit(X_train, y_train)
past_clf.fit(X_train, y_train)
tree_clf.fit(X_train, y_train)

# predict
bag_pred = bag_clf.predict(X_test)
past_pred = past_clf.predict(X_test)
tree_preds = tree_clf.predict(X_test)

# Result
print('BaggingClassifier :', accuracy_score(y_test, bag_pred))
print('PastingClassifier :', accuracy_score(y_test, past_pred))
print('DesisionTree :', accuracy_score(y_test, tree_preds))

BaggingClassifier : 0.904
PastingClassifier : 0.92
DesisionTree : 0.856


> as expected Bagging and Pasting performed better than the normal classifier. Bagging is usually preferred because it adds extra diversity so that the ensemble's variance is reduced, at the cost of a little more bias (as we see in the results)

### Out-of-Bag evaluation
With Bagging, some instances may be sampled several times for any given predictor, while others my nor be sampled at all. OOB uses these instances without the need for a separate validation set.

In [27]:
# set oob_score=True in BaggingClassifier
bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=42),
                            n_estimators=500,
                            bootstrap=True,
                            oob_score=True,
                            random_state=42)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.8986666666666666

In [28]:
# check the accuracy prediction from oob with real prediction 
y_pred = bag_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.912


In [31]:
# returns the class probabilities for each instance
bag_clf.oob_decision_function_[:5]

array([[0.32352941, 0.67647059],
       [0.35625   , 0.64375   ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ]])

<a id='random'></a>
### Random Patches and Random Subspaces
Till now we sampled on the row basis. <b>Random Patches</b> combines sampling training instances and features, which is particularly useful for high-dimensional data. <b>Random Subspaces</b> keeps all training instances but samples input features.

> sampling features results in even more predictor diversity, trading a bit more bias for a lower variance.

<a id='rf'></a>
### Random Forests and Extra Trees
`RandomForest` is an ensemble of Decision Trees, which introduces more randomness when growing trees. Rather than searching for the best feature when splitting it searches for the best feature in a random feature set. -> more diversity / higher bias for lower variance -> generally the better model

`ExtraTress` introduce even more randomness for higher diversity and bias-variance trade-off.
    - random thresholds when searching for best possible thresholds 
    
> Hard to tell which algorithm performs better 

#### Feature importance
> Random Forests make it easy to measure the relative importance of each feature

In [34]:
# check the feature importance of the iris dataset
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.10018256738063595
sepal width (cm) 0.02339091270538694
petal length (cm) 0.44045983501268
petal width (cm) 0.43596668490129703


So they are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection

<a id='boosting'></a>
### Boosting
The general idea of most boosting methods is to train predictors <b>sequentially</b>, each trying to correct its predecessor. The goal is to convert a weak learner to a strong learner. There are two main boosting methods: AdaBoost and Gradient Boosting.

- iterative method that adjusts the weights of an observation based on the previous classification 
- Goal: decrease the base error (can overfitt)

#### AdaBoost
- pays more attention to the training instances that the predecessor underfitted

#### Gradient Boosting
- works like AdaBoost but fits the new predictor to the residual errors made by the previous predictor.

regularization technique: shrinkage
- set a low value for the learning rate to get predictions that will usually generalize better.

##### Find optimal number of trees
- Early stopping with the staged_predict() method 

##### Stochastic Gradient Boosting
> using Gradient Boosting with the hyperparameter subsample to train each tree on a randomly selected subsample of the training data.

- trades higher bias for lower variance 
- speeds up training considerably


In [71]:
# define boosting algos
ada_clf = AdaBoostClassifier()
gb_clf = GradientBoostingClassifier()
xgb_clf = XGBClassifier()

classifiers = [ada_clf, gb_clf, xgb_clf]
for clf in classifiers:
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)
    print(clf.__class__.__name__, ': Accuracy', accuracy_score(y_test, pred))

AdaBoostClassifier : Accuracy 0.904
GradientBoostingClassifier : Accuracy 0.888
XGBClassifier : Accuracy 0.872


> it is quite surprising to see that extreme gradient boosting performed worse then the others

#### Early Stopping
It is also possible to implement early stopping by actually stopping training early 
- we have to set `warm_start` = True, which reuses the solution of the previous call to fit

In [73]:
# load data 
X, y = load_boston(return_X_y=True)
# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [74]:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)

# implement early stopping, when the validation error does not improve for five iterations in a row

# base variables
# set first error to inf so that we can continue with every error, regardless of size
min_val_error = float('inf')
error_going_up = 0

# for loop
for n_estimators in range(1, 120):
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_test)
    val_error = mean_squared_error(y_test, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up +=1
        if error_going_up == 5:
            break

In [75]:
print(gbrt.n_estimators)

79


In [76]:
print('Minimum validation MSE:', min_val_error)

Minimum validation MSE: 12.979089325435462


#### XGBoost
> XGBoost is an optimized implementation of Gradient Boosting. It aims to be extremely fast, scalable, and portable

In [79]:
xgb_reg = XGBRegressor(random_state=42)
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_test)
val_error = mean_squared_error(y_test, y_pred)
print("Validation MSE:", val_error)

Validation MSE: 10.061070591701837


It automatically takes care of early stopping:

In [80]:
xgb_reg.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_test)
val_error = mean_squared_error(y_test, y_pred)
print("Validation MSE:", val_error)

[0]	validation_0-rmse:16.17967
Will train until validation_0-rmse hasn't improved in 2 rounds.
[1]	validation_0-rmse:11.82363
[2]	validation_0-rmse:9.00198
[3]	validation_0-rmse:7.09986
[4]	validation_0-rmse:5.75756
[5]	validation_0-rmse:4.89611
[6]	validation_0-rmse:4.25794
[7]	validation_0-rmse:3.95111
[8]	validation_0-rmse:3.78119
[9]	validation_0-rmse:3.64725
[10]	validation_0-rmse:3.50068
[11]	validation_0-rmse:3.42377
[12]	validation_0-rmse:3.37779
[13]	validation_0-rmse:3.35327
[14]	validation_0-rmse:3.32273
[15]	validation_0-rmse:3.27602
[16]	validation_0-rmse:3.24714
[17]	validation_0-rmse:3.22577
[18]	validation_0-rmse:3.22214
[19]	validation_0-rmse:3.22225
[20]	validation_0-rmse:3.20872
[21]	validation_0-rmse:3.21014
[22]	validation_0-rmse:3.18628
[23]	validation_0-rmse:3.18931
[24]	validation_0-rmse:3.19223
Stopping. Best iteration:
[22]	validation_0-rmse:3.18628

Validation MSE: 10.152398456601889


<a id='stacking'></a>
### Stacking (stacked generalization)
Instead of using trivial functions (hard voting) to aggregate the predictions of all predictors in an ensemble it uses a model to perform the aggregation. The final predictor is called blender or meta learner. The base learners are used to make a prediction which is then used as inputs in the meta leaner to make the final prediction

- with version 0.22 of Scikit-learn stacking is now supported by scikit learn
- before the update one had to use mlxtend or vecstack

In [82]:
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [84]:
ridge = RidgeCV()
lasso = Lasso()
elnet = ElasticNet()
rf = RandomForestRegressor(n_estimators=10, random_state=42)
gb = GradientBoostingRegressor(random_state=42)
xgb = XGBRegressor()

models = [ridge, lasso, elnet, rf, gb, xgb]

for model in models:
    model.fit(X_train, y_train)
    print(model.__class__.__name__, 'R2 score: {:.2f}'.format(model.score(X_test, y_test)))

RidgeCV R2 score: 0.68
Lasso R2 score: 0.65
ElasticNet R2 score: 0.66
RandomForestRegressor R2 score: 0.84
GradientBoostingRegressor R2 score: 0.87
XGBRegressor R2 score: 0.86


In [86]:
# build stacking model
estimators = [
    ('ridge', RidgeCV()),
    ('lasso', Lasso()),
    ('rf', RandomForestRegressor(n_estimators=10, random_state=42))]

reg = StackingRegressor(estimators=estimators, final_estimator=GradientBoostingRegressor(random_state=42))

reg.fit(X_train, y_train)
print('R2 score: {:.2f}'.format(reg.score(X_test, y_test)))

R2 score: 0.83


#### Multiple stacking layers
- assigning the `final_estimator` to a `StackingRegressor` or `Classifier`

In [87]:
final_layer = StackingRegressor(estimators = 
                                [('rf', RandomForestRegressor(n_estimators=10, random_state=42)),
                                 ('gb', GradientBoostingRegressor(random_state=42))], 
                                final_estimator=RidgeCV())

multi_layer_reg = StackingRegressor(estimators = 
                                [('lr', RidgeCV()),
                                 ('xgb', XGBRegressor()),
                                 ('svr', SVR(C=1, gamma=1e-6, kernel='rbf'))],
                                final_estimator=final_layer)
                                    
multi_layer_reg.fit(X_train, y_train)
print('R2 score: {:.2f}'.format(multi_layer_reg.score(X_test, y_test)))                      

R2 score: 0.88


## Resources
- Hands on Machine Learning, second edition
- https://scikit-learn.org/stable/modules/ensemble.html#stacking
- https://towardsdatascience.com/stacking-made-easy-with-sklearn-e27a0793c92b
- https://towardsdatascience.com/ensemble-learning-techniques-6346db0c6ef8
- https://towardsdatascience.com/advanced-ensemble-learning-techniques-bf755e38cbfb