## Adaboost

The full name of Adaboost is Adaptive boosting, which is a process of learning multiple weak classifiers and performing linear combinations on them by changing the weights of training samples.

Generally speaking, there are two problems that all boosting methods have to face. One is how to change the weight or probability distribution of training samples during the training process, and the other is how to combine multiple weak classifiers into a strong classifier. 

In response to these two problems, Adaboost's solution is very simple. The first is to increase the weight of the samples that were misclassified by the weak classifier in the previous round, and reduce the weight of the correctly classified samples. The second is to linearly combine multiple weak classifiers to increase the weight of weak classifiers with good classification effect and reduce the weight of weak classifiers with large classification error rate.

Given a binary training data set $T=\{ (x_{1},y_{1}), (x_{2},y_{2}), \cdots , (x_{N},y_{N}) \}$ Each sample consists of input instances and corresponding labels: instance $x_{i} \in \chi \subseteq R^{n}$, label $y_{i} \in y \subseteq \{-1,1\}$. The algorithm flow is as follow:

(1) Initialize the weight distribution of the training samples, assuming that each sample has the same weight at the beginning of training, that is, the sample weights are uniformly distributed. $D_{1} = (w_{11}, \cdots, w_{1i}, \cdots, w_{1N}), w_{1i}=\frac{1}{N}, i=1,2, \cdots,N$

(2) For $m=1,2, \cdots, M$

Use the data set with initialized-uniform distribution $D_{m}$ to train to obtain a weak classifier $G_{m}(x): \chi \rightarrow \{-1, 1\}$

Calculate the classification error rate of $G_{m}(x)$ on training data: $e_{m}=P\left(G_{m}\left(x_{i} \neq y_{i}\right)\right)=\sum_{i=1}^{N} w_{m i} I\left(G_{m}\left(x_{i}\right) \neq y_{i}\right)$

Calculate the weight of weak classifier: $\alpha _{m}=\frac{1}{2} \log \frac{1-e_{m}}{e_{m}} $

Update the weight distribution on training samples: $D_{m+1}=\left(w_{m+1,1}, \ldots w_{m+1, i}, \ldots w_{m+1, N}\right), w_{m+1, i}=\frac{w_{m i}}{Z_{m}} \exp \left(-\alpha_{m} y_{i} G_{m}\left(x_{i}\right)\right)$, $Z_{m}$ is the  normalization factor $Z_{m}=\sum_{i=1}^{N} w_{m i} \exp \left(-\alpha_{m} y_{i} G_{m}\left(x_{i}\right)\right)$

(3) Construct the linear combination of multiple weak classifier: $f(x)=\sum^{M}_{i=1} \alpha_{m}G_{m}(x) $

The final model of Adaboost classifier can be expressed as:
$$
G(x) = sign(f(x)) = sign(\sum^{M}_{i=1} \alpha_{m}G_{m}(x))
$$

The idea behind Adaboost is very simple but this algorithm is efficient in practice. Adaboost usually uses decision stump as the weak classifier, which is very simple and flexible.

In [1]:
class DecisionStump():
    def __init__(self):
        # determine the label of a sample is 1 or -1 
        # based on threshold
        self.polarity = 1
        self.feature_index = None
        self.threshpld = None
        # accuracy of classification
        self.alpha = None

In [4]:
class Adaboost:
    def __init__(self, n_estimators=5):
        # number of weak classifier
        self.n_estimators = n_estimators

    def fit(self, X, y):
        m, n = X.shape
        # step(1): initialize the uniform distribution of weights
        w = np.full(m, (1/m))
        # initialize the classifier list
        self.estimators = []
        # step(2)
        for _ in range(self.n_estimators):
            # 2.a: train a weak classifier
            estimator = DecisionStump()
            min_error = float('inf')
            # traverse the features and select the best splitted feature 
            # based on the smallest classification error rate
            for i in range(n):
                # obtain the feature values
                values = np.expand_dims(X[:, i], axis=1)
                unique_values = np.unique(values)
                # try every feature value as threshold
                for threshold in unique_values:
                    p = 1
                    # initialize all predicted value to be 1
                    pred = np.ones(np.shape(y))
                    # set the predicted value (smaller than threshold) to be -1
                    pred[X[:, i] < threshold] = -1
                    # 2.b: calculate misclassification rate
                    error = sum(w[y != pred])

                    # If the classification error rate is greater than 0.5, 
                    # the positive and negative prediction flip is performed
                    if error > 0.5:
                        error = 1 - error
                        p = -1

                    # save the parameter once the smallest classification error rate is found
                    if error < min_error:
                        estimator.label = p
                        estimator.threshold = threshold
                        estimator.feature_index = i
                        min_error = error

            # 2.c: calculate the weight of base classifier
            estimator.alpha = 0.5 * np.log((1.0 - min_error) / (min_error + 1e-9))
            # initialize all predicted values to be 1
            preds = np.ones(np.shape(y))
            # obtain the negative index that is smaller than threshold
            negative_idx = (estimator.label * X[:, estimator.feature_index] < estimator.label * estimator.threshold)
            # set the negative class to be -1
            preds[negative_idx] = -1
            # 2.d: update the sample weight
            w *= np.exp(-estimator.alpha * y * preds)
            w /= np.sum(w)

            # save the weak classifier
            self.estimators.append(estimator)

    def predict(self, X):
        m = len(X)
        y_pred = np.zeros((m, 1))
        # calculate the predicted value for each weak classifier
        for estimator in self.estimators:
            predictions = np.ones(np.shape(y_pred))
            negative_idx = (estimator.label * X[:, estimator.feature_index] < estimator.label * estimator.threshold)
            predictions[negative_idx] = -1
            # 2.e: the prediction results of each weak classifier are weighted
            y_pred += estimator.alpha * predictions

        # return the final result
        y_pred = np.sign(y_pred).flatten()
        return y_pred

In [7]:
import numpy as np
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

data = datasets.load_digits()
X = data.data
y = data.target
digit1 = 1
digit2 = 8
idx = np.append(np.where(y==digit1)[0], np.where(y==digit2)[0])
y = data.target[idx]
# Change labels to {-1, 1}
y[y == digit1] = -1
y[y == digit2] = 1
X = data.data[idx]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7)

clf = Adaboost(n_estimators=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of AdaBoost by numpy:", accuracy)

Accuracy of AdaBoost by numpy: 0.884


In [8]:
from sklearn.ensemble import AdaBoostClassifier
clf_ = AdaBoostClassifier(n_estimators=5, random_state=0)
clf_.fit(X_train, y_train)
y_pred_ = clf_.predict(X_test)
accuracy = accuracy_score(y_test, y_pred_)
print("Accuracy of AdaBoost by sklearn:", accuracy)

Accuracy of AdaBoost by sklearn: 0.924
