# Boosting and Adaboost

## Learning objectives
- implement
    - your first boosted model - Adaboost
- understand
    - the ideas behind boosting
    - how to apply boosting to decision trees and implement Adaboost

## Intro - What is boosting?

We begin by describing an algorithm called AdaBoost (Adaptive Boosting).
AdaBoost is a classification algorithm that combines a sequence of weak classifiers to repeatedly modified versions of the data, which increasingly prioritise misclassified examples. The example labels are coded as $Y \in {-1, 1}$.

Weak classifiers are those who's predictions are only slightly better than random guessing.
In this case, we use classification trees with a depth of 1.
We call such limited trees "stumps".
AdaBoost converts many "weak learners" into a single "strong learner" by combining these stumps.

AdaBoost combines the predictions of all of the classifiers to make a prediction by evaluating:
## $H(x) =  sign(\Sigma^L_{l=1} \alpha_l H_l(x)) $

This is simply the sign of a weighted combination of predictions.

### But where do the weights for each model come from?

The models are applied sequentially, and because of their limited capacity, each of them is likely to make some mistake.
For the case where each model is a stump, mistakes will be made on any dataset that is not linearly separable.

The error of a model is calculated as:
## err_l = proportion incorrect
## $err_l = \frac{\Sigma_{i=1}^m w_i I(y_i \neq H_l(x_i))}{\Sigma_{i=1}^m w_i}$

The weight of the prediction of each model is then computed based on the error rate given by:
## $ \alpha_l = \frac{1}{2} log(\frac{1-err_l}{err_l})$
# show graph of this error curve

Large negative model weight = your model sucks.
Large positive model weight = your model rocks.
Zero model weight = your model is as good as a random guessing.

Each model other than the first is trained on a dataset bootstrapped from the original.
**But the sampling of examples will be weighted.**
The weights of each example are increased if they were incorrectly classified by the previous model, and decreased if they were already classified correctly.

For the first model in the sequence, the importance of classifying each example correctly is equal.
That is, we weight the error contribution for each example in the dataset by the same equal amount, $w_i= \frac{1}{m}$.
For the next weighted sample from the dataset, 
To sample the bootstrapped dataset for the next model in the sequence to be trained on, we set the weight of each example to:
## $w_i \leftarrow w_i \cdot e^{- \alpha_l \cdot y_i H_l(x_i))}$
Let's consider what this means for a variety of cases:
- positive model weight and correct classification: weight of example pushed down
- negative model weight and correct classification: weight of example pushed up
- positive model weight and incorrect classification: weight of example pushed up
- negative model weight and incorrect classification: weight of example pushed down


### What does this weight calculation do?
Most examples may be correctly classified by our very simple weak classifier stumps.
It is the edge cases that need extra attention.
So sequentially, the importance of examples which are not able to be classified correctly by the previous model are increased and vice versa.
Models later in the sequence hence focus on harder to classify examples.
As depth increases, the importance of easy to classify examples dimishes and tends to zero.
This effectively removes them from the dataset, leaving less examples for the later models to classify.
Less examples are separable with a simpler decision boundary.

The weighting of each model prediction serves to increase the influence of models that correctly classify examples from the bootstrapped dataset which they are trained on.

# algorithm outline

![](images/boosting.jpg)

In [39]:
import sklearn.tree
from utils import get_classification_data
import numpy as np

class AdaBoost:
    def __init__(self, n_layers=10):
        self.n_layers = n_layers
        self.models = []

    def calc_model_error(self, predictions, labels):
        """Compute classifier error rate"""
        diff = predictions == labels
        diff = diff.astype(int)
        # weighted_diffs = weights * diff
        return np.mean(diff)

    def encode_labels(self, labels):
        labels[labels == 0] = -1
        labels[labels == 1] = +1
        return labels

    def calc_model_weight(self, error, delta=0.01):
        print('error', error)
        z = (1 - error) / (error + delta) + delta
        print('z', z)
        return 0.5 * np.log(z)

    def sample(self, X, Y, weights):
        idxs = np.random.choice(len(X), len(X), p=weights, replace=True)
        return X[idxs], Y[idxs]

    def update_weights(self, predictions, labels, model_weight):
        print(predictions)
        print(labels)
        print(model_weight)
        weights = np.exp(- model_weight * predictions * labels)
        weights /= np.sum(weights)
        return weights

    def fit(self, X, Y):
        features = X
        labels = self.encode_labels(Y)
        for layer_idx in range(self.n_layers):
            model = sklearn.tree.DecisionTreeClassifier(max_depth=1)
            model.fit(features, labels)
            self.models.append(model)
            predictions = model.predict(X)
            model_error = self.calc_model_error(predictions, labels)
            model_weight = self.calc_model_weight(model_error)
            print(model_weight)

            example_weights = self.update_weights(predictions, labels, model_weight)
            print(example_weights)
            features, labels = self.sample(X, Y, example_weights)
            print(f'trained model {layer_idx}')
            print()

    def predict(self, X):
        prediction = np.zeros(len(X))
        for model in self.models:
            prediction += model.predict(X)
        return prediction

X, Y = get_classification_data()
adaBoost = AdaBoost()
adaBoost.fit(X, Y)

error 0.9
z 0.11989010989010986
-1.0605898533359355
[ 1  1  1 -1 -1  1  1 -1 -1  1]
[ 1 -1  1 -1 -1  1  1 -1 -1  1]
-1.0605898533359355
[0.10965044 0.013146   0.10965044 0.10965044 0.10965044 0.10965044
 0.10965044 0.10965044 0.10965044 0.10965044]
trained model 0

error 0.6
z 0.6657377049180329
-0.20342976123755235
[ 1  1  1 -1 -1  1  1 -1 -1  1]
[ 1 -1  1  1 -1 -1  1 -1  1  1]
-0.20342976123755235
[0.11543411 0.07684884 0.11543411 0.07684884 0.11543411 0.07684884
 0.11543411 0.11543411 0.07684884 0.11543411]
trained model 1

error 0.6
z 0.6657377049180329
-0.20342976123755235
[ 1  1  1  1 -1  1  1 -1  1  1]
[ 1 -1  1  1  1  1  1  1 -1  1]
-0.20342976123755235
[0.11543411 0.07684884 0.11543411 0.11543411 0.07684884 0.11543411
 0.11543411 0.07684884 0.07684884 0.11543411]
trained model 2

error 0.4
z 1.4734146341463412
0.19379129372317094
[-1 -1  1 -1 -1  1  1 -1 -1  1]
[-1  1 -1  1 -1 -1  1  1  1  1]
0.19379129372317094
[0.07787866 0.11474756 0.11474756 0.11474756 0.07787866 0.1147475

## Let's visualise our predictions

Firstly visualise the predictions for each classifier, then visualise their successive combination.

## How can we boost other models?

## Challenges
- perform adaptive boosting with a model that is not a decision tree
- adapt the above code to work for a regression model