<center><img src=https://pictures.topspeed.com/IMG/crop/201909/the-bugatti-chiron-h-21_800x0w.jpg alt="a different kind of boost"></center>
<center>img source: topspeed.com</center>

# <center><b>Adaboost⚙️</b></center>

**What you can expect from this notebook:** Since I did a notebook on bagging(random forests) [here](https://www.kaggle.com/code/vincentbrunner/ml-from-scratch-random-forests), I thought doing one about boosting would fit quite good at this point. So this notebook covers the theory behind Adaboost as well as code implementation and test.

<div class="alert alert-block alert-info">👉If you're just interested in the complete, with comments documented implementation of an Adaboost classifier using just numpy and the copy module, feel free to click on show hidden code: </div>

In [None]:
#  Adaboost implementation including the regulating learning rate parameter and a samme algorithm implementation
class AdaBoostClassifier():
    def __init__(self, base_estimator=True, n_estimators=50, learning_rate=1):
        if base_estimator:
            self.base_estimator = DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2)
        else:
            self.base_estimator = base_estimator
            
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        
        self.estimators = None
        self.estimator_weights = None
        
        #  to track performance, not nesseccary for the algorithm
        self.total_errors = None
        self.training_error = None
        self.validation_error = None
        
    def fit(self, X, y, X_val=None, y_val=None):
        #  resetting lists 
        self.estimators = []
        self.estimator_weights = []
        self.total_errors = []
        self.training_error = []
        self.validation_error = []
        lr = self.learning_rate
        
        #  0) initialise equal weights
        sample_weights = np.full(len(X), 1/len(X))
        
        for est_i in range(self.n_estimators):
            #  1) fit weak learner
            estimator = copy.copy(self.base_estimator)
            estimator.fit(X, y, sample_weights)
            
            #  2) calculate total error
            prediction = estimator.predict(X)
            total_error = np.where(prediction != y, sample_weights, 0).sum() 
            
            #  3) determine weight / amount of say in final prediction
            amount_of_say = lr * 0.5 * np.log((1 - total_error)/(total_error + 1e-10))

            #  3.5) save estimator and it's weight before going into the next iteration
            self.estimators.append(estimator)
            self.estimator_weights.append(amount_of_say)
            
            #  4) update weights
            sample_weights = np.where(prediction != y, sample_weights * np.exp(amount_of_say), sample_weights * np.exp(-1 * amount_of_say))
            
            #  5) renormalize weights
            sample_weights = sample_weights / sample_weights.sum()
            
            #  5.5) keep track of total- and training-error over iterations for documentation purposes
            self.total_errors.append(total_error)
            self.training_error.append(np.where(self.predict(X) != y, 1, 0).sum()/len(X))
            if type(X_val) != "NoneType":
                self.validation_error.append(np.where(self.predict(X_val) != y_val, 1, 0).sum()/len(X_val))
    
    def predict(self, X, verbose=False):
        predictions = np.stack([estimator.predict(X) for estimator in self.estimators], axis=1) 
        weighted_majority_vote = lambda x: np.unique(x)[np.argmax([np.where(x==categ, self.estimator_weights, 0).sum() for categ in np.unique(x)])]
        return np.apply_along_axis(weighted_majority_vote, axis=1, arr=predictions)

****

# <b><span style="color:#ebd1a4">|</span> Table of Contents 📄</b>
This notebook goes through all the principles neccessary to understand and code a classification or regression tree. The resulting models will be tested on actual datasets (no pseudo datasets this time^^).

<p style="font-family: Arial; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#1." style="color:#940000">1. Adaboost intuition</a></p>

<p style="font-family: Arial; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#2." style="color:#940000">2. Background knowledge:</a></p>

<p style="text-indent:10px; font-family: Arial; font-size: 14px; letter-spacing: 2px; line-height:1.3"><a href="#2.1." style="color:#940000">2.1. Boosting</a></p>

<p style="text-indent:10px; font-family: Arial; font-size: 14px; letter-spacing: 2px; line-height:1.3"><a href="#2.2." style="color:#940000">2.2. Exponential loss</a></p>

<p style="font-family: Arial; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#3." style="color:#940000">3. The algorithm step for step</a></p>

<p style="font-family: Arial; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#4." style="color:#940000">4. Implementation</a></p>

<p style="font-family: Arial; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#5." style="color:#940000">5. Fitting and evaluation</a></p>

<p style="font-family: Arial; font-size: 16px; font-weight: bold; letter-spacing: 2px; line-height:1.3"><a href="#6." style="color:#940000">6. Adaboost characteristics based on example</a></p>

<br>

##### <div class="alert alert-block alert-info">⚠️ <strong>Important:</strong> since making own visualisations would be too time-consuming for a first notebook I mainly embedded images from <strong>google image search</strong>. If you should find you're image here and <strong>want it to be removed</strong> please leave a comment or <strong>contact me</strong>. </div>

<p id="1."></p>

****

# <b>1 <span style="color:#ebd1a4">|</span> Adaboost intuition</b>

The main idea behind the Adaboost algorithm is to **combine multiple weak estimators** to create a **better estimator** than the individual estimators, reducing the bias (and variance compared to models with similarly low bias).

<center><img src=https://www.researchgate.net/profile/Zhuo-Wang-36/publication/288699540/figure/fig9/AS:668373486686246@1536364065786/Illustration-of-AdaBoost-algorithm-for-creating-a-strong-classifier-based-on-multiple.png></center>
<center> image source: researchgate.net </center>
<br>

The idea is relatively straightforward:
* train the estimators **one after the other**:
    * identify the mistakes the estimator made as well as its overall performance
    * fit the next estimator in a way that it **puts weight on correctly classifying samples the previous estimators didn't** and repeat
* for a prediction, pass the data through every single model and **aggregate all the results weighted by the performance** of the origin model the results came from

This way the weak **estimators "support each other"** and create a **final model with lower bias**, while the variance, in the best case, get's kept low too.

<p id="2.1."></p>

****

# <b>2.1 <span style="color:#ebd1a4">|</span> Boosting: the ensembling method Adaboost is based on</b>

**Main goal: reducing bias**<br>
**Combining estimators (usualy) by summation + based on weights -> weighted aggregation**<br>

Steps:
* Fit weak learner on a weighted dataset(***weak learner: estimator that makes predictions slightly better than random chance -> low variance, high bias***)
* After each iteration **boost** the sample weights of falsely classified samples
* Repeat till max number of estimators is reached or some other terminal condition is met
* The output consists of the prediction of each estimator usually weighted by the previously determined accuracy/performance of the estimator and added up

Bias reducing aspect of Boosting:
* Since the estimator aren't trained under the same conditions but to supplement the already trained ones, they **"add" to the capability of the model to make good predictions**
* Combined with the fact, that the final prediction takes the performance of the individual estimators into account, this reduces the bias in comparison to the weak learners

Variance reducing aspect of Boosting:
* This is rather situationally dependant 
* But in a way, the summation of the weighted predictions of the estimators can be seen as a sample statistic
* So adding more estimators does have a variance reducing aspect even tho the complexity every single estimator adds usually leads to an overall increase in variance
* In comparison to the base estimators the variance still grows, but **in comparison to other models with similarly low bias, boosting often achieves a lower variance**

***Boosting algorithms take weak learner with relatively high bias and combines them into a strong learner***

<p id="2.2."></p>

****

# <b>2.2 <span style="color:#ebd1a4">|</span> The exponential loss: the loss function Adaboost minimizes</b>

The exponential loss is one of the most common classification losses but for the sake of completeness it's going to be explained quickly:
* the ground truth $\large y_i$ is a set of lenght n where $\large y\epsilon\{1, -1\}$ -> **binary classification problem**
* an estimator $\large h(x)$ is fit on a set of training data $\large \{(x_1, y_1), ..., (x_n, y_n)\}$
* the exponential loss is given by:<br>

    $\Large e^{-y_ih(x_i)}$
    
**interpretation:**<br>
**the goal is for the output of h(x) to be clearly positive(or negative) if the ground trouth is 1(or -1)**

Let's first take a look at the inner term $\large y_ih(x_i)$:
* **if the target y_1 = -1 the term increases as h(x) decreases vica verca**
* **if the target y_1 = 1 the term increases as h(x) increases vica verca**

The outer term $e^{-x}$ looks like the following:<br>
<center><img src="https://qph.fs.quoracdn.net/main-qimg-a1eb1cddee1b74e3457e36543bbf8971"></center>
<center>img source: quora.com</center>

* when the $y_ih(x_i)$ term gets smaller, the loss get's exponentialy higher 
* -> **in order to decrease the loss, $y_ih(x_i)$ has to be as high as possible**

**Conclusion:** The exponential loss is reduced by the output of h(x) being as clear positive/negative as possible, depending on the sign of the ground truth

<p id="3."></p>

****

# <b>3 <span style="color:#ebd1a4">|</span> The algorithm step for step</b>

### derivation for binary classification:
#### **on an abstract lvl:**

As already mentioned in the boosting section, the output of this Algorithm looks like the following:<br>

$\large C(x) = \sum_{m=1}^M\alpha_mh(x)_m\>\>\>\>\>\>$where $\alpha_i$ is the weight of the estimator $h(x)_i$ and $M$ is the total amount of ensembled estimators.

At every iteration m an estimator $h_m$ is added to the ensemble weighted by a factor $\alpha_m$:<br>

$\large C(x)_m = C(x)_{m-1} + \alpha_mh(x)_m$

to minimize the sum of the exponential loss for all data points:<br>

$\Large \sum_{i=1}^ne^{-y_i(C(x_i)_{m-1} + \alpha_mh(x_i)_m)}$

since this term can be split into 2 parts(due to the summation in the exponent), h has to be trained to minimize:

$\Large \sum_{i=1}^ne^{-y_i\alpha_mh(x_i)_m}$

and with sample weights:<br>

$\Large \sum_{i=1}^nw_ie^{-y_i\alpha_mh(x_i)_m}$

the term $y_ih(x_i)_m$ is positive for data points correctly classified and negative for incorrectly classified data points. Assuming $h(x)\epsilon{-1, 1}$ the whole term can just take on 2 values: -1(for incorrectly classified data points) and 1(for correctly classified data points). Therefore it can be removed from the exponent by splitting the sum based on the value $y_ih(x_i)_m$ takes on:<br>

$\Large \sum_{y_i\neq h(x_i)_m}w_ie^{\alpha_m} + \sum_{y_i=h(x_i)_m}w_ie^{-\alpha_m}$

**finding $\alpha$:**
* The idea is to find an alpha that minimizes the loss above -> by finding a minimum of the loss -> a point where its derivative is equal to 0
this derivative is given by:<br>

$\Large \frac{\partial L}{\partial \alpha_m} = \sum_{y_i\neq h(x_i)_m}w_ie^{\alpha_m} -\sum_{y_i=h(x_i)_m}w_ie^{-\alpha_m}$

setting it equal to 0 and solving for $\alpha_m$ the following term is obtained:<br>

$\Large \alpha_m = \frac{1}{2}ln(\frac{\sum_{y_i= h(x_i)_m}w_i}{\sum_{y_i\neq h(x_i)_m}w_i})$ 

when normalizing w after each iteration, its total sum is equal to 1, so it can be written as:<br>

$\Large \alpha_m = \frac{1}{2}ln(\frac{1 - \sum_{y_i\neq h(x_i)_m}w_i}{\sum_{y_i\neq h(x_i)_m}w_i}) = \frac{1}{2}ln(\frac{1 - e}{e})$ where $e = \sum_{y_i\neq h(x_i)_m}w_i$ is the sum of all **sample weights of incorrectly classified samples** which in this context is often refered to as the ***total error***

<br>

#### **Quick recap/summary:**
This means that after having fitted an estimator h(x) to the weighted training data we can add it to the ensemble multiplied by its factor:<br>

$\Large \alpha_m = \frac{1}{2}ln(\frac{1 - e}{e})$

by doing this, the exponential error get's minimized.<br>
The step left, is to **update the sample weights**, boosting the ones of incorrectly classified samples and renormalizing afterwards.

#### **updating the sample weights:**
The weights are **updated by the exponential loss** calculated over the corresponding sample:

$\Large w_i\leftarrow w_ie^{-y_i\alpha_mh(x_i)_m}$

This way the weights of **the weights of incorrectly classified samples are boosted** and the rest is kept low.<br>
Since this doesn't result in weights that sum up to 1 but this assumption was made to come up with the formula for determining the optimal $\alpha_m$, the obtained new sample weights have to be **renormalized**:

$\Large w_i = \frac{w_i}{\sum_{i=1}^nw_i}$

<br>

### the resulting algorithm:
**initialize sample weights** $\large w_i = 1/n, i=1,2,...,n$
* **for $\large m ... M$:**

    1. **fit estimator $\large h(x)_m$ to training set utalising the weights W**
    2. **compute $\large e = \sum_{y_i \neq h(x)_m}w_i$**
    3. **compute $\large \alpha_m = \frac{1}{2}ln(\frac{1 - e}{e})$**
    4. **set $\large w_i\leftarrow w_ie^{-y_i\alpha_mh(x_i)_m}$**
    5. **renormalize $\large w_i = \frac{w_i}{\sum_{i=1}^nw_i}$**
    
$\large C(x) = \sum_{m=1}^M\alpha_mh(x)_m$

<br>

### in words:
**initialize sample weights** $\large w_i = 1/n, i=1,2,...,n$
* **repeat till max amount of estimators is reached:**

    1. **fit estimator $\large h(x)_m$ to training set utalising the weights W**
    2. & 3. **determine the best factor for $\large h(x)_m$ when adding to ensemble**
    4. **update weights based on the error: more emphasis on incorrectly classified samples**
    5. **renormalize the weights to sum up to 1**
    
**the final estimator consists of all the weak learners multiplied by their weights and summed up**

<br>

### generalise for problems where y isn't -1 and 1:
Note: the indicator function I ouputs 1 if the condition is met and 0 otherwise. 

**initialize sample weights** $\large w_i = 1/n, i=1,2,...,n$
* **for $\large m ... M$:**

    1. **fit estimator $\large h(x)_m$ to training set utilising the weights W**
    2. **compute $\large e = \sum_{i=1}^nw_i\mathbb{I}(y_i \neq h(x)_m)$**
    3. **compute $\large \alpha_m = \frac{1}{2}ln(\frac{1 - e}{e})$**
    4. **set $\large w_i\leftarrow w_i * \left\{ \begin{array}{ c l }e^{-\alpha_m} & \quad \textrm{if } y_i = h(x)_m \\ e^{\alpha_m} & \quad \textrm{if } y_i \neq h(x)_m \end{array}\right.$**
    5. **renormalize $\large w_i = \frac{w_i}{\sum_{i=1}^nw_i}$**
    
$\large C(x) = \underset{k}{\operatorname{argmax}}\sum_{m=1}^M\alpha_m\mathbb{I}(h(x)_m=k)$ -> weighted majority vote

<br>

##### <div class="alert alert-block alert-info">⚠️ <strong>Important:</strong> This is just the base Adaboost variant (discrete Adaboost). Today newer variants like the samme algorithm get used more frequent especialy for multiclass classification. If there is interest in this, I will <strong>update</strong> this notebook with the corresponding background and implementations.</div>


<p id="4."></p>

****

# <b>4 <span style="color:#ebd1a4">|</span> Python implementation</b>

**Note:** for Adaboost **every type of weak learner** can be used in theory but since they are often used and the most simple, this implementation will have a decision tree stomp as the default base estimator.

In [None]:
#  used for implementing the algorithm
import numpy as np # linear algebra 
import copy # deep copies of objects -> estimators

#  estimator to ensemble with Adaboost:
from sklearn.tree import DecisionTreeClassifier

#  used for data handeling and visualisation
import pandas as pd # loading and transforming data
import matplotlib.pyplot as plt # visualisations
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay # creating & visualising confusion matrices
from sklearn.model_selection import train_test_split # splitting data in train/test set

In [None]:
titanic_data = pd.read_csv("../input/titanic/train.csv", usecols=["Survived", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]).dropna()
titanic_data["Sex"] = titanic_data["Sex"].astype("category").cat.codes
titanic_data["Embarked"] = titanic_data["Embarked"].astype("category").cat.codes

features = titanic_data.loc[:, titanic_data.columns!="Survived"].to_numpy() # select everything but the target
labels = titanic_data.loc[:, "Survived"].to_numpy() # select the target

X_train, X_val, y_train, y_val = train_test_split(features, labels, test_size=0.2)

titanic_data.head()

In [None]:
#  Adaboost implementation including the regulating learning rate parameter and a samme algorithm implementation
class AdaBoostClassifier():
    def __init__(self, base_estimator=True, n_estimators=50, learning_rate=1):
        if base_estimator:
            self.base_estimator = DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2)
        else:
            self.base_estimator = base_estimator
            
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        
        self.estimators = None
        self.estimator_weights = None
        
        #  to track performance, not nesseccary for the algorithm
        self.total_errors = None
        self.training_error = None
        self.validation_error = None
        
    def fit(self, X, y, X_val=None, y_val=None):
        #  resetting lists 
        self.estimators = []
        self.estimator_weights = []
        self.total_errors = []
        self.training_error = []
        self.validation_error = []
        lr = self.learning_rate
        
        #  0) initialise equal weights
        sample_weights = np.full(len(X), 1/len(X))
        
        for est_i in range(self.n_estimators):
            #  1) fit weak learner
            estimator = copy.copy(self.base_estimator)
            estimator.fit(X, y, sample_weights)
            
            #  2) calculate total error
            prediction = estimator.predict(X)
            total_error = np.where(prediction != y, sample_weights, 0).sum() 
            
            #  3) determine weight / amount of say in final prediction
            amount_of_say = lr * 0.5 * np.log((1 - total_error)/(total_error + 1e-10))

            #  3.5) save estimator and it's weight before going into the next iteration
            self.estimators.append(estimator)
            self.estimator_weights.append(amount_of_say)
            
            #  4) update weights
            sample_weights = np.where(prediction != y, sample_weights * np.exp(amount_of_say), sample_weights * np.exp(-1 * amount_of_say))
            
            #  5) renormalize weights
            sample_weights = sample_weights / sample_weights.sum()
            
            #  5.5) keep track of total- and training-error over iterations for documentation purposes
            self.total_errors.append(total_error)
            self.training_error.append(np.where(self.predict(X) != y, 1, 0).sum()/len(X))
            if type(X_val) != "NoneType":
                self.validation_error.append(np.where(self.predict(X_val) != y_val, 1, 0).sum()/len(X_val))
    
    def predict(self, X, verbose=False):
        """
        * every estimator makes his predictions in the shape (len(X)) -> [a, b, ..., len(X)]
        * stack prediction of estimators to have them row wise(each row corresponds to a sample) -> [[a1, a2], [b1, b2], ..., len(X)]
        * at each row apply the weighted majority vote previously discussed
        """
        predictions = np.stack([estimator.predict(X) for estimator in self.estimators], axis=1) 
        weighted_majority_vote = lambda x: np.unique(x)[np.argmax([np.where(x==categ, self.estimator_weights, 0).sum() for categ in np.unique(x)])]
        return np.apply_along_axis(weighted_majority_vote, axis=1, arr=predictions)

<p id="5."></p>

# <b>5 <span style="color:#ebd1a4">|</span> Fitting and evaluation</b>

In [None]:
#  let's use a slightly complexer decision tree as base estimator:
base = DecisionTreeClassifier(max_depth=2, max_leaf_nodes=4)

#  fit Adaboost classifier with 100 estimators
adaboost = AdaBoostClassifier(base_estimator=base, n_estimators=500, learning_rate=1)
adaboost.fit(X_train, y_train, X_val, y_val)

#  make predictions:
predictions = adaboost.predict(X_val)

#  confusion matrix
cm = confusion_matrix(y_val, predictions, labels=[0, 1])
cm_displ = ConfusionMatrixDisplay(cm)
cm_displ.plot()
plt.show()

#  calculate accuracy:
accuracy = np.mean(predictions==y_val)

#  calculate recall:
recall = cm[1, 1]/cm[1, :].sum() # of the total actual positives, how much were classified correctly

#  calculate precision:
precision = cm[1, 1]/cm[:, 1].sum() # of all predicted positives, how much were True positives

#  not that neccessary for this problem, but for the completeness:
f1 = 2 * ((recall * precision)/(recall + precision)) 

print(f"accuracy = {accuracy},\nrecall = {recall},\nprecision = {precision},\nf1-score = {f1}")

<p id="6."></p>

# <b>6 <span style="color:#ebd1a4">|</span> Adaboost characteristics</b>

In [None]:
fig = plt.figure(figsize=(14, 8))

plt.plot(range(len(adaboost.training_error)), adaboost.training_error, color="red", label="training error")
plt.plot(range(len(adaboost.validation_error)), adaboost.validation_error, color="green", label="validation error")
plt.plot(range(len(adaboost.estimator_weights)), adaboost.estimator_weights, color="black", label="amount of say")
plt.plot(range(len(adaboost.total_errors)), adaboost.total_errors, color="blue", label="total error")

plt.xlabel("iteration")
plt.ylabel("error, amount of say")

plt.title("adaboost classifier performance summary")
plt.legend()
plt.show()

**Important things to take away:**
* Adaboost is likely to overfit, this can be avoided with many regularisation techniques but has to be kept in mind
* Adaboost performs extremely different on a problem depending on the base estimator used
* Adaboost puts the most weight on estimators from early iterations since later estimators "fix" just small mistakes and don't have a good alone standing performance

**That's all for this notebook, have a great day and happy learning!👋**

Papers I oriented this notebook at:<br>
[1](https://www.sciencedirect.com/science/article/pii/S002200009791504X?via%3Dihub) <br>
[2](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwjh79Ka5Mj4AhW0g_0HHQEaCM4QFnoECAYQAQ&url=https%3A%2F%2Fhastie.su.domains%2FPapers%2Fsamme.pdf&usg=AOvVaw2IeMtma-dd6YlB0Au3R6YC) <br>
[3](www.inf.fu-berlin.de/inst/ag-ki/adaboost4.pdf) <br>