# Boosting
Boosting (originally called hypothesis boosting) refers to any Ensemble method thatcan combine several weak learners into a strong learner. 

The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor

There are many boosting methods,  but Adaboost and Gradient boost are most popular

## Adaboost

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is the technique used by AdaBoost.

<img src='img_7.png'>

For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and so on.

<img src='img_8.png'>

Figure 7-8 shows the decision boundaries of five consecutive predictors on the
moons dataset. The first classifier gets many instances wrong, so their weights get boosted. The second classifier therefore does a better job on these instances, and
so on.

The plot on the right represents the same sequence of predictors except that the learning rate is halved (i.e., the misclassified instance weights are boosted half as much at every iteration)

As you can see, this sequential learning technique has some similarities with Gradient Descent, except that instead of tweaking a single predictor’s parameters to minimize a cost function, AdaBoost adds predictors to the ensemble, gradually making it better.


Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, except that predictors have different weights depending on their overall accuracy on the weighted training set.

`
There is one important drawback to this sequential learning technique: it cannot be parallelized (or only partially), since each predictor can only be trained after the previous predictor has been trained and evaluated. As a result, it does not scale as well as bagging or pasting`

<img src='img_9.png'>

<img src='img_10.png'>

###  $\eta$ is the learning rate hyperparameter

<img src='img_11.png'>

<img src='img_12.png'>


Scikit-Learn actually uses a multiclass version of AdaBoost called SAMME - Stagewise Additive Modeling using a Multiclass Exponential loss function

    * When there are just two classes, SAMME is equivalent to AdaBoost
    * If the predictors can estimate class probabilities, Scikit-Learn can use a variant of SAMME called SAMME.R (the R stands for “Real”), which relies on class probabilities rather than predictions and generally performs better

<img src='img_20.png' witdh=700 height=1000>

In [7]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [8]:
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, train_size=0.7)
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200, algorithm="SAMME.R", learning_rate=0.5)

In [9]:
ada_clf.fit(X_train, y_train)

##### Note:
If your AdaBoost ensemble is overfitting the training set, you can
try reducing the number of estimators or more strongly regulariz‐
ing the base estimator.

## Gradient Boosting


Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor.

nstead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor

Let’s go through a simple regression example using Decision Trees as the base predictors (of course Gradient Boosting also works great with regression tasks). This is called Gradient Tree Boosting, or Gradient Boosted Regression Trees (GBRT).



In [10]:
from sklearn.tree import DecisionTreeRegressor
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X_train, y_train)

Now train a second DecisionTreeRegressor on the residual errors made by the first
predictor:


In [12]:
y2 = y_train - tree_reg1.predict(X_train)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X_train, y2)

Then we train a third regressor on the residual errors made by the second predictor:

In [13]:
y3 = y_train - tree_reg2.predict(X_train)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X_train, y3)

In [15]:
y_pred = sum(tree.predict(X_test) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred

array([2.14134911e+00, 2.08166817e-17, 2.08166817e-17, 2.08166817e-17,
       2.08166817e-17, 2.14134911e+00, 2.14134911e+00, 4.12799745e+00,
       3.97629677e+00, 2.14134911e+00, 3.97629677e+00, 2.14134911e+00,
       3.97629677e+00, 2.14134911e+00, 2.14134911e+00, 3.97629677e+00,
       3.97629677e+00, 3.97629677e+00, 3.97629677e+00, 2.14134911e+00,
       2.14134911e+00, 2.14134911e+00, 3.97629677e+00, 3.97629677e+00,
       3.97629677e+00, 2.08166817e-17, 2.08166817e-17, 2.14134911e+00,
       3.97629677e+00, 2.08166817e-17, 2.08166817e-17, 2.08166817e-17,
       2.14134911e+00, 2.14134911e+00, 3.07748356e+00, 2.08166817e-17,
       3.97629677e+00, 3.97629677e+00, 2.08166817e-17, 2.14134911e+00,
       3.07748356e+00, 2.08166817e-17, 2.14134911e+00, 2.14134911e+00,
       2.14134911e+00])

<img src='img_13.png'>

Figure 7-9 represents the predictions of these three trees in the left column, and the
ensemble’s predictions in the right column. In the first row, the ensemble has just one
tree, so its predictions are exactly the same as the first tree’s predictions. In the second
row, a new tree is trained on the residual errors of the first tree. On the right you can
see that the ensemble’s predictions are equal to the sum of the predictions of the first
two trees.

You can see that the ensemble’s predictions gradually get better as
trees are added to the ensemble

In [18]:
from sklearn.ensemble import GradientBoostingRegressor

In [21]:
gb_reg = GradientBoostingClassifier(max_depth=2, n_estimators=3, learning_rate=1)

In [22]:
gb_reg.fit(X_train, y_train)

The learning_rate hyperparameter scales the contribution of each tree. If you set it to a low value, such as 0.1, you will need more trees in the ensemble to fit the training set, but the predictions will usually generalize better. This is a regularization technique called shrinkage

shrinkage: regularization technique which adjust learning rate

<img src='img_14.png'>

Figure 7-10 shows two GBRT ensembles trained with a low learning rate: the one on the left does not have enough trees to fit the training set, while the one on the right has too many trees and overfits the training set.

In order to find the optimal number of trees, you can use early stopping.

A simple way to implement this is to use the staged_predict() method:
1. it returns an iterator over the predictions made by the ensemble at each stage of training (with one tree, two trees, etc.)



In [27]:
import numpy as np
from sklearn.metrics import mean_squared_error

In [29]:
gb_reg = GradientBoostingRegressor(max_depth=2, n_estimators=120)

In [30]:
gb_reg.fit(X_train, y_train)

In [35]:
errors = [mean_squared_error(y_test, y_pred) for y_pred in gb_reg.staged_predict(X_test)]


In [32]:
best_gb_estimator = np.argmin(errors)

In [36]:
best_gb_estimator

27

In [39]:
gb_reg_best = GradientBoostingRegressor(max_depth=2, n_estimators=best_gb_estimator)
gb_reg_best.fit(X_train, y_train)

<img src='img_15.png'>

It is also possible to implement early stopping by actually stopping training early (instead of training a large number of trees first and then looking back to find the optimal number). 

 You can do so by setting warm_start=True. 

In [49]:
gbrt = GradientBoostingRegressor(max_depth = 2, warm_start=True)

In [50]:
min_val_error = float("inf")

In [51]:
error_going_up = 0

In [52]:
for n_estimators in range(1, 120):
    print(">>> n_estimators: ", n_estimators)
    gbrt.n_estimators = n_estimators
    gbrt.fit(X_train, y_train)
    y_pred = gbrt.predict(X_test)
    val_error = mean_squared_error(y_test, y_pred)
    if val_error < min_val_error:
        min_val_error = val_error
        error_going_up = 0
    else:
        error_going_up += 1
        if error_going_up == 5:
            break

>>> n_estimators:  1
>>> n_estimators:  2
>>> n_estimators:  3
>>> n_estimators:  4
>>> n_estimators:  5
>>> n_estimators:  6
>>> n_estimators:  7
>>> n_estimators:  8
>>> n_estimators:  9
>>> n_estimators:  10
>>> n_estimators:  11
>>> n_estimators:  12
>>> n_estimators:  13
>>> n_estimators:  14
>>> n_estimators:  15
>>> n_estimators:  16
>>> n_estimators:  17
>>> n_estimators:  18
>>> n_estimators:  19
>>> n_estimators:  20
>>> n_estimators:  21
>>> n_estimators:  22
>>> n_estimators:  23
>>> n_estimators:  24
>>> n_estimators:  25
>>> n_estimators:  26
>>> n_estimators:  27
>>> n_estimators:  28
>>> n_estimators:  29
>>> n_estimators:  30
>>> n_estimators:  31
>>> n_estimators:  32
>>> n_estimators:  33


The GradientBoostingRegressor class also supports a subsample hyperparameter, which specifies the fraction of training instances to be used for training each tree.

For example, if subsample=0.25, then each tree is trained on 25% of the training instances, selected randomly.

As you can probably guess by now, this trades a higher bias for a lower variance.

 It also speeds up training considerably. This technique is called Stochastic Gradient Boosting.

`It is possible to use Gradient Boosting with other cost functions.
This is controlled by the loss hyperparameter (see Scikit-Learn’s
documentation for more details).
`