# 1. AdaBoost Algorithm 
We are now going to talk about **boosting**, and introduce a realization of the boosting idea, the algorithm **AdaBoost**. It is currently still one of the most powerful ensemble methods in existence, and like the random forest it is considered a good off the shelf/plug and play model. 

The idea behind **boosting** is very different from **bagging** and **random forest**. Recall, in those instances we wanted a **low bias, high variance** model. In boosting, we actually want **high bias** models. In boosting nomenclature these are called **weak learners**. Weak learners generally have classification accuracies not much better than chance-in general about 50-60%. We will show later though, that even after ensembling the variance will remain low, and the test error will remain low, and the ensemble will not overfit-even when we keep adding more trees. 

<br>
## 1.1 Boosting 
The hypothesis behind boosting is that if we **combine many weak learners**, as a whole they will be a strong learner. As in our stacking example earlier, what we are going to do is weight each learner so that each base model has a different amount of influence on the output.

#### $$F(x) = \sum_{m=1}^M \alpha_mf_m(x)$$

<br>
### 1.1.1 Weak Learners
So, how do we create or find weak learners? A typical way to do that with boosting is to use decision trees with a max depth of 1. These are also called **decision stumps**. A decision stump is just one split, in one dimension. In other words, you are only splitting the space in half for each tree. In other words, it is actually remarkable that a combination of these can actually yield a good classifier. 

Another way is just to use a linear classifier like logistic regression. An added bonus of using such simple models is that they train extremely fast. This means that you can quickly train an ensemble of thousands of trees. 

<br>
### 1.1.2 Details
In Adaboost, which we will get to shortly, there are a couple of details which we have to make note of. First and foremost, Adaboost used **{-1, +1}** as targets, rather than **{0, 1}**. This will become clear why when we see the algorithm. So, for example if we were doing logistic regression we would rescale the output by multiplying by 2 and subtracting by 1. 

#### $$ModifiedLogisticOutput(x) = 2*LogisticOutput(x) - 1$$

The final output of Adaboost is:

#### $$F_M(x) = sign(\sum_{m=1}^M \alpha_mf_m(x))$$

Where $f_m$ represents the individual base learners, $F_M$ is the ensemble model, with $M$ being the number of base learners. Since the targets are -1 and +1, the decision boundary is 0, so we just take the sign to determine the final prediction. We typicaly use $\alpha$ as the symbol to weight each classifier, since $w$ will be used for something else-in particular, to tell us how important each sample is. 

<br>
## 1.2 AdaBoost
Let's now talk about the **AdaBoost** algorithm itself. The idea is that we are going to add each base model one at a time, which is called **additive modeling**. We train the base model on all of the training data, which means there is no resampling or bootstrapping here. The difference between this and the other algorithms that we have studied, is that we are going to weight how important each sample is, using $w_i$ for $i = 1...N$. We will modify $w_i$ for each round. So intitially they will all be equal, but if we get the prediction for the pair $x_i, y_i$ wrong, then we will increase $w_i$ for that round. In this way, the base model knows which samples are more important. Else we will decrease $w_i$.

You can imagine that this may require us to modify the cost function. For example, for logistic regression, you would need to multiple each individual cross entropy by the weight for that sample.

#### $$J_{weighted} = \sum_{i=1}^N w_i\Big[t_ilogy_i +(1-t_i)log(1-y_i) \Big]$$

For the decision tree, luckily the scikit learn API already allows us to pass in sample weights to our fit function!

Once we have trained the base model (on all data $X,Y$ with weights $w_i$, where $i=1..N$), we calculate its error weighted by $w_i$. We then compute $\alpha_m$, which represents how important this base model is to the final model as a function of the error. Note that if we had less error (our model was more accurate) then $\alpha_m$ should be bigger. Then we store $\alpha_m$ and we store the base model $f_m$. Once the loop is done, and we hit the specified number of base models, we exit the loop and we are done training. 

<br>
### 1.2.1 AdaBoost Pseudocode 
---
**Explained/with Equations**
* We want to start by giving $w_i$ a uniform distribution, so each sample has equal importantance. 
* Then in a loop, we create a base model $f_m$ and train it on all the data with the weights in $w$
* We then calculate the error for this iteration, $\epsilon_m$, which is also weighted by the sample weights
#### $$\epsilon_m = \frac{\sum_{i=1}^N w_i I(y_i \neq f_m(x_i))}{\sum_{i=1}^N w_i}$$
* We calculate $\alpha_m$ which is the log ratio of the weighted correct rate to the weighted error rate. 
#### $$\alpha_m = \frac{1}{2}log \Big[\frac{1-\epsilon_m}{\epsilon_m} \Big]$$
* Essentially this means that if the model is more correct, it gets a higher weight $\alpha$
* Note, the sum over $w$ when we calculate the error rate $\epsilon$ is not necessary if we normalize $w$ like we do in the second last step. It is left there for completion sake, and to remember that the error will be a number between 0 and 1. 
* Next, we update the $w_i$'s. We can see that if we are correct, the $y_i$ and $f_m(x_i)$ are the same sign, hence $w_i$ decreases. If we are wrong, then $y_i$ and $f_m(x_i)$ are of opposite signs, so $w_i$ increases. This is why we require the targets to be either -1 or +1.
#### $$w_i = w_i* exp\Big[-\alpha_my_if_m(x_i)\Big], i =1,...,N$$
* We then normalize $w$ because we treat it like a probability distribution 
#### $$w_i = \frac{w_i}{\sum_{i=1}^N w_i}$$
* The last step is to save $\alpha_m$ and $f_m(x)$

**Pseudocode**
```
Initialize w[i] = 1\N for i=1..N
for m=1..M:
  Fit f_m(x) with sample weights w[i]
  e_m = (w * (y != f_m(x) )) / w
  a_m = 0.5 * log((1 - e_m) / e_m)
  w = exp(-a_m * y * f_m(x))
  save a_m, f_m(x)
```

---

We can notice that the Adaboost algorithm is very specific to binary classification, requiring the two classes to be **{-1, +1}**. There are extensions in literature that discuss adaboost modifications for multiclass classification and regression. Our goal is just to get the main idea down. If you do desire to work with adaboost for multiclass classification or regression, scikit learn comes packaged with those already. As with random forest, the authors recommend using trees as the ideal base model, but linear classifiers are common as well.  