# Adaptive Boosting (Adaboost)

Adaboost is an ensemble learning method that combine multiple weak classifiers **(Stump)** to build one strong classifier. It is to iteratively train a sequence of stumps, each of which **focus on misclassified instance** classified by previous stump

**Step 1: Assign a sample weight for each sample <br>**
sample weight = $\frac{1}{N}$ <br>

**Step 2: Do a boostrap aggregating (bagging)**

**Step 3: Calculate the Gini Impurity for each variable <br>**
Gini Impurity for a Leaf = 1 - [\(probability of 'Yes'\)$^2$ + \(probability of 'No'\)$^2$] : $Gini=1 - \sum\limits_{i-1}^{n}(p_i)^2$ <br>
The variable with **lowest** Gini Impurity will be used to split the data

**Step 4: Build the stump** <br>
Use the variable with **lowest** Gini Impurity to build a stump

**Step 5: Use the stump to predict the train data** <br>
calcuate the total error which is the **sum of weights of incorrectly** classified samples <br>
total error = $\sum weights$

**Step 6: Calculate the Amount of say of each stump** <br>
It is the importance / weight of a stump <br>
***this weight is the importance of stump which is different from the sample weight assigned initially <br>
_Amount of say_ = $\frac{1}{2} \log(\frac{1 - \text{total error}}{\text{total error}})$ <br>
_The higher, the more important_

**Step 7: Update the sample weights** <br>
$\text{Incorrect sample} = \text{Sample weight} * e^{\text{amount of say}}$ <br>
$\text{Correct sample} = \text{Sample weight} * e^{-\text{amount of say}}$

Remember to normalize the sample weights since the sum up is not equal to 1 <br>
$\text{normalized sample weight} = \frac{\text{updated sample weight}}{\sum\text{updated sample weight}}$

**Step 7: Repeat 2 - 7 N times until enough stump** <br>

In [275]:
from copy import deepcopy
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
import numpy as np
bc = load_breast_cancer()
train_size = 400
train_x, train_y = bc.data[:train_size], bc.target[:train_size]
test_x, test_y = bc.data[train_size:], bc.target[train_size:]
np.random.seed(123456)

In [276]:
ensemble_size = 100
base_classifier = DecisionTreeClassifier(max_depth = 1)

indices = [x for x in range(train_size)]
base_learners = []

data_weights = np.zeros(train_size) + 1/train_size
learner_weights = np.zeros(ensemble_size)
learners_errors = np.zeros(ensemble_size)

In [277]:
for i in range(ensemble_size):
    weak_learner = deepcopy(base_classifier)
    data_indices = np.random.choice(400, 400, p=data_weights)
    sample_x, sample_y = train_x[data_indices], train_y[data_indices]
    
    weak_learner.fit(sample_x, sample_y)
    predictions = weak_learner.predict(train_x)
    errors = predictions != train_y
    corrects = predictions == train_y
    
    base_learners.append(weak_learner)
    
    # calculate the leaner's amount of say
    weighted_error = data_weights * errors
    learner_error = np.sum(weighted_error)
    learner_weight = np.log((1 - leaner_error) / learner_error) / 2
    
    #save amount of say and learner error into the learner list
    learners_errors[i] = learner_error
    learner_weights[i] = learner_weight
    
    data_weights[errors] = np.exp(data_weights[errors] * learner_weight)
    data_weights[corrects] = np.exp(-data_weights[corrects] * learner_weight)
    data_weights = data_weights / sum(data_weights)


In [278]:
ensemble_predictions = []
for learner, weight in zip(base_learners, learner_weights):
    prediction = learner.predict(test_x)
    ensemble_predictions.append(prediction * weight)

In [279]:
ensemble_predictions = np.mean(ensemble_predictions, axis=0) >= 0.5
ensemble_acc = metrics.accuracy_score(test_y, ensemble_predictions)

In [280]:
ensemble_acc

0.9112426035502958

### sklearn

In [281]:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import metrics

In [282]:
digits = load_digits()
train_size = 1500
train_x = digits.data[:train_size]
train_y = digits.target[:train_size]
test_x = digits.data[train_size:]
test_y = digits.target[train_size:]
np.random.seed(123456)

In [283]:
ensemble_size = 1000
ensemble = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 1), 
                              algorithm = 'SAMME', 
                              n_estimators = ensemble_size)

In [284]:
ensemble.fit(train_x, train_y)

In [285]:
ensemble_predictions = ensemble.predict(test_x)
ensemble_acc = metrics.accuracy_score(test_y, ensemble_predictions)

In [290]:
print(f'AdaBoost: {ensemble_acc:.2f}', )

AdaBoost: 0.82
