# Boosting and Adaboost

## Learning objectives
- implement
    - your first boosted model - Adaboost
- understand
    - the ideas behind boosting
    - how to apply boosting to decision trees and implement Adaboost

## Intro - What is boosting?

We begin by describing an algorithm called AdaBoost (Adaptive Boosting).
AdaBoost is a classification algorithm that combines a sequence of weak classifiers to repeatedly modified versions of the data, which increasingly prioritise misclassified examples. The example labels are coded as $Y \in {-1, 1}$.

Weak classifiers are those who's predictions are only slightly better than random guessing.
In this case, we use classification trees with a depth of 1.
We call such limited trees "stumps".
AdaBoost converts many "weak learners" into a single "strong learner" by combining these stumps.

AdaBoost combines the predictions of all of the classifiers to make a prediction by evaluating:
## $H(x) =  sign(\Sigma^L_{l=1} \alpha_l H_l(x)) $

This is simply the sign of a weighted combination of predictions.

### But where do the weights for each model come from?

The models are applied sequentially, and because of their limited capacity, each of them is likely to make some mistake.
For the case where each model is a stump, mistakes will be made on any dataset that is not linearly separable.

The error of a model is calculated as:
## $err_l = \frac{\Sigma_{i=1}^m w_i I(y_i \neq H_l(x_i))}{\Sigma_{i=1}^m w_i}$

The weight of the prediction of each model is then computed based on the error rate given by:
## $ \alpha_l = \frac{1}{2} log(\frac{1-err_l}{err_l})$
# show graph of this error curve

Each model other than the first is trained on a dataset bootstrapped from the original.
**But the sampling of examples will be weighted.**
The weights of each example are increased if they were incorrectly classified by the previous model, and decreased if they were already classified correctly.

For the first model in the sequence, the importance of classifying each example correctly is equal.
That is, we weight the error contribution for each example in the dataset by the same equal amount, $w_i= \frac{1}{m}$.
For the next weighted sample from the dataset, 
To sample the bootstrapped dataset for the next model in the sequence to be trained on, we set the weight of each example to:
## $w_i \leftarrow w_i \cdot e^{\alpha_l \cdot I(y_i \neq H_l(x_i))}$
Where $I(true) = +1$ and $I(false) = -1$.
This shifts the weight of correctly classified examples down and that of incorrectly classified examples up.

### What does this weight calculation do?
Most examples may be correctly classified by our very simple weak classifier stumps.
It is the edge cases that need extra attention.
So sequentially, the importance of examples which are not able to be classified correctly by the previous model are increased and vice versa.
Models later in the sequence hence focus on harder to classify examples.
As depth increases, the importance of easy to classify examples dimishes and tends to zero.
This effectively removes them from the dataset, leaving less examples for the later models to classify.
Less examples are separable with a simpler decision boundary.

The weighting of each model prediction serves to increase the influence of models that correctly classify examples from the bootstrapped dataset which they are trained on.

# algorithm outline

![](images/boosting.jpg)

In [1]:
import sklearn.tree

class BoostedModel:
    def __init__(self, model_class, layers=10):
        self.model_class = model_class
        self.models = []

    def calc_error(self, Y_hat, Y):
        """Compute classifier error rate"""
        diff = Y_hat - Y
        return np.sum(diff)

    def calc_model_weight(self, error):
        return 0.5 * np.log( (1 - error) / error)

    def resample(self, X, error):
        weights = calc_model_weight(error)
        weights = weights / sum(weights) # normalise
        return np.random.choice(X, len(X), weights)



    def fit(self, X, Y):
        error = Y
        for layer_idx in range(layers):
            model = self.model_class()
            model.fit(X, error)
            error = model.predict(X) - Y
            self.models.append(model)

    def predict(self, X):
        prediction = np.zeros(len(X))
        for model in self.models:
            prediction += model.predict(X)
        return prediction


## Let's visualise our predictions

Firstly visualise the predictions for each classifier, then visualise their successive combination.

## Challenges
- perform adaptive boosting with a model that is not a decision tree
- adapt the above code to work for a regression model