# Boosting and Adaboost



## Introduction



>The boosting ensemble method combines a sequence of weak classifiers that are fit on successively modified versions of a dataset. This method increasingly prioritises the examples misclassified by the previous model.



## Bagging vs Boosting



Not all ensemble methods are designed to regularise the overall model. In boosting, ensembling is employed to increase the capacity of individual models.

Similar to bagging, boosting creates an ensemble of weak learners (models that do more than make random guesses) to form a single strong learner.
Bagging simply combines the predictions of different models that were fit to the same dataset independently (trained in parallel).
Conversely, boosting combines the predictions of different models that were fit depending on the performance of the previous model (trained in sequence).

<p align=center><img width=900 src=images/bagging_vs_boosting.jpg></p>



## AdaBoost



Boosting algorithms vary in how they adjust the weights of the examples that are sampled in each successive bootstrapped dataset and in how they weight the contribution of each hypothesis to the final prediction.
AdaBoost is the most popular boosting algorithm that is employed for classification problems. The name AdaBoost is short for **adaptive boosting**, and it is a classification algorithm.

In AdaBoost, the example labels are coded as $Y=-1$ and $Y=+1$ for examples in the negative and positive classes, respectively.

Furthermore, each model is a significantly weak classification tree with a depth of 1. Such limited trees are called 'stumps'.
AdaBoost converts many 'weak learners' into a single 'strong learner' by combining these stumps. Furthermore, it combines the predictions of all of the classifiers to make a final prediction by evaluating **the sign** term:

<p align=center><img width=900 src=images/adaboost_hypothesis.jpg></p>

This is simply the sign of a weighted combination of predictions.

If the sign is positive, the example will be classified as a member of the positive class; otherwise, it will be classified as a member of the negative class. Intuitively, this indicates that the predictions of the models in the boosting sequence push or pull the hypothesis over the point where the decision boundary lies, i.e at zero, with the prediction from each model being scaled by that model's weight, $\alpha$.



### The origin of the weights for each model



The models are applied sequentially, and due to their limited capacity, they are each likely to make a mistake.

The error of a model is calculated as follows:

<p align=center><img width=900 src=images/err_l.jpg></p>

Subsequently, the weight of the prediction of each model is computed based on the error rate given by

<p align=center><img width=900 src=images/boosting_model_weight.PNG></p>

A high negative model weight is indicative of a poor-performing model.
A high positive model weight is indicative of a high-performing model.
A zero model weight is indicative of a moderately performing model, i.e. the model is equivalent to one that makes random guesses.

The weights of each example increase if they were incorrectly classified by the previous model and decrease if they were classified correctly.

For the first model in the sequence, the importance of classifying each example correctly is equal. That is, we weight the error contribution for each example in the dataset by the same amount, $w_i= \frac{1}{m}$.
For the next weighted sample from the dataset, to sample the bootstrapped dataset for the next model in the sequence to be trained on, we set the weight of each example to the following:

<p align=center><img width=900 src=images/boosting_example_weight.jpg></p>

Next, we consider what this means for a variety of cases.
- Positive model weight and correct classification: weight of the example pushed down.
- Negative model weight and correct classification: weight of the example pushed up.
- Positive model weight and incorrect classification: weight of the example pushed up.
- Negative model weight and incorrect classification: weight of the example pushed down.

**Note**: Even though a model in a certain position in the boosting sequence may not fit every example in the dataset (because they may not all be chosen to be a member of the training sample), the weights for **every** example are updated based on whether the model performs a correct classification.
If this were not the case, the weights of examples that were predicted correctly by previous models (which are unlikely to be sampled as the training data for any subsequent model) would not be updated later.
This prevents us from increasingly focusing on misclassified examples and losing sight of the big picture, i.e. achieving a high performance for all examples. 



### The role of this weight calculation
Most examples can be correctly classified by very simple, weak classifier stumps. It is the edge cases that require extra attention.
Therefore, sequentially, the importance of examples that could not be correctly classified by the previous model is increased and vice versa.
Thus, models later in the sequence focus on examples that are difficult to classify. As the depth increases, the importance of easy-to-classify examples diminishes, tending towards zero.
This effectively removes them from the dataset, leaving fewer examples for the later models to classify. The few examples can be separated with a relatively simple decision boundary.

The weighting of each model prediction serves to increase the influence of the models that correctly classify examples from the bootstrapped dataset on which they are trained.



## Adaboost Algorithm Outline

- Initialise each example weight as $\frac{1}{m}$.
- For however many models in your boosting sequence,
    - create a bootstrapped dataset by taking a sample from the original dataset, weighted by the example weights.
    - fit the model on this bootstrapped dataset.
    - compute the proportion of incorrect predictions weighted by the corresponding example weights.
    - use this to compute the model weight.
    - increase the example weight of poorly predicted examples, and decrease the example weight of well-predicted examples.

In [None]:
# Run this cell to download the necessary package to run the next cells
!wget "https://aicore-files.s3.amazonaws.com/Data-Science/data_utils/get_colors.py" "https://aicore-files.s3.amazonaws.com/Data-Science/data_utils/utils.py"

In [None]:
import sklearn.tree
from utils import get_classification_data, calc_accuracy, visualise_predictions, show_data
import numpy as np
import matplotlib.pyplot as plt
import json

def encode_labels(labels):
    labels[labels == 0] = -1
    labels[labels == 1] = +1
    return labels

class AdaBoost:
    def __init__(self, n_layers=20):
        self.n_layers = n_layers
        self.models = [] # init empty list of models

    def sample(self, X, Y, weights):
        idxs = np.random.choice(range(len(X)), size=len(X), replace=True, p=weights)
        X = X[idxs]
        Y = Y[idxs]
        return X, Y

    def calc_model_error(self, predictions, labels, example_weights):
        """Compute the classifier error rate"""
        diff = predictions != labels
        diff = diff.astype(float)
        diff *= example_weights
        diff /= np.sum(example_weights)
        return np.sum(diff)

    def calc_model_weight(self, error, delta=0.01):
        z = (1 - error) / (error + delta) + delta
        return 0.5 * np.log(z)

    def update_weights(self, predictions, labels, model_weight):
        weights = np.exp(- model_weight * predictions * labels)
        weights /= np.sum(weights)
        return weights

    def fit(self, X, Y):
        example_weights = np.full(len(X), 1/len(X)) # assign initial importance of classifying each example as uniform and equal
        for layer_idx in range(self.n_layers):
            model = sklearn.tree.DecisionTreeClassifier(max_depth=1)
            bootstrapped_X, bootstrapped_Y = self.sample(X, Y, example_weights)
            model.fit(bootstrapped_X, bootstrapped_Y)
            predictions = model.predict(X) # make predictions for all examples
            model_error = self.calc_model_error(predictions, Y, example_weights)
            model_weight = self.calc_model_weight(model_error)
            model.weight = model_weight
            self.models.append(model)
            example_weights = self.update_weights(predictions, Y, model_weight)
            # print(f'trained model {layer_idx}')
            # print()

    def predict(self, X):
        prediction = np.zeros(len(X))
        for model in self.models:
            prediction += model.weight * model.predict(X)
        prediction = np.sign(prediction) # comment out this line to visualise the predictions in a more interpretable way
        return prediction

    def __repr__(self):
        return json.dumps([m.weight for m in self.models])
        return json.dumps([
            {
                'weight': model.weight
            }
            for model in self.models
        ], indent=4)

X, Y = get_classification_data(sd=1)
Y = encode_labels(Y)
adaBoost = AdaBoost()
adaBoost.fit(X, Y)
predictions = adaBoost.predict(X)
print(f'accuracy: {calc_accuracy(predictions, Y)}')
visualise_predictions(adaBoost.predict, X, Y)
show_data(X, Y)
print(adaBoost)

In [None]:
fig = plt.figure()
fig.add_subplot(211)
X, Y = get_classification_data(variant='circles')

for i in range(20):
    adaBoost = AdaBoost(n_layers=i)
    adaBoost.fit(X, Y)
    predictions = adaBoost.predict(X)
    print(f'model {i}')
    print(f'accuracy: {calc_accuracy(predictions, Y)}')
    print(f'weights: {[ round(m.weight, 2) for m in adaBoost.models]}')
    visualise_predictions(adaBoost.predict, X, Y)
    # show_data(X, Y)
    print()

## Sklearn Implementation

In [None]:
import sklearn.ensemble

adaBoost = sklearn.ensemble.AdaBoostClassifier()
adaBoost.fit(X, Y)
predictions = adaBoost.predict(X)
calc_accuracy(predictions, Y)
visualise_predictions(adaBoost.predict, X, Y)


## Conclusion

At this point, you should have a good understanding of

- how to implement a boosted model: Adaboost.
- boosting and how to apply it to decision trees.