Boosting

1) Boosting Basics
* What is Boosting?
    * From Random Forest/Bagging: Combines set of "weak" learners to form strong learner (Ensemble Method)
        * "weak" in the error rate only slightly better than random guessing
    * Difference in Boosting: Sequentially apply weak classification algorithm to modified versions of the data $\rightarrow$ sequence of weak classifiers
        * uses **information** from **previous tree** to grow new tree

2) Boosting Algorithms
* **AdaBoost** - each tree is expert on minimizing errors of predecessor tree. Each tree will iteratively re-weight observations based on errors
![adaboost](adaboost.png)
    * Discrete Adaboost - AdaBoost algorithm for classification
        * Discrete Adaboost Algorithm:
            1. Initialize the observations weights $w_i=\frac{1}{N}, i=1,2,\dots,N$
            2. For $m=1$ to $M$:
                1. Fit a classifier $G_m(x)$ to the training data using weights $w_i$
                2. Compute error: $err_m=\frac{\sum_{i=1}^N w_i I(y_i \neq G_m(x_i))}{\sum_{i=1}^N w_i}$
                3. Compute weights: $\alpha_m=log(\frac{(1-err_m)}{err_m})$
                4. Set $w_i \leftarrow w_i \cdot e^{(\alpha_m \cdot I(y_i \neq G_m(x_i)))}, i=1,2,\dots,N$
            3. Output $G(x)=\sum_{m=1}^M \alpha_m G_m(x)$
        * Adaboost Intuitive Steps:
            1. The first weak tree, $G_1(X)$, is fit on the training data
            2. Subsequent weak classifier trees uses same classification algorithm, but modified weights
                * For previously misclassified, scale by $e^{\alpha_m}$ else $w_i$ remains the same
                * at each step $m$, observations previously misclassified by $G_{m-1}(x)$ have their weights increased
                * each successive classifier forced to **concentrate** on training observations **previously missed**
            3. Final strong classifier, $G(x)$, determined by weighted majority votes
                * $\alpha_1,\dots,\alpha_M$ as weight of votes (gives higher influence to more accurate classifiers)
                * the **smaller** the error of the weak classifier, the **greater** the weight
    * Adaboost Algorithm Details:
        * Uses **Forward Stagewise Additive Modeling** - adds new basis functions without adjusting previous parameters and coefficients
        * Adaboot uses **Exponential Loss** Function: $L(y,f(x))=e^{(-yf(x))}$ - uses this loss function because of computational advantage
        * Can be shown that the additive expansion in AdaBoost is estimating the function which justifies taking the sign as classification rule for final classifier
* **Gradient Boosted** - instead of fitting to reweighted training observations, fit residuals of the previous tree
![gradient_boost](gradient_boost.png)
* Gradient Boosted Regression Trees Steps:
    1. Set $\hat{f}=0$ and $r_i=y_i$ for all $i$ in the training set
    2. For $b=1,2,\dots,B$ and repeat:
        1. Fit a tree $\hat{f}^b$ with $d$ splits $(d+1)$ terminal nodes to the training data $(X,\gamma)$
        2. Update $\hat{f}$ by adding in a shrunken version of the new tree: $\hat{f}(x)\leftarrow \hat{f}(x)+\lambda\hat{f}^b(x)$
        3. Update the residuals: $\gamma_i \leftarrow \gamma_i - \lambda \hat{f}^b(x_i)$
    3. Output the boosted model: $\hat{f}(x) = \sum_{b=1}^B \lambda\hat{f}^b(x)$
* Bias/Variance For Number of Trees Used
![num_trees_boosted](num_trees_boosted.png)
    * A single boosted tree - high bias, low variance
    * A lot of boosted trees - low bias, high variance
    * Not like random forest where having a lot of trees reduces variance

3) Hyperparameter Tuning For Boosting - Tree Structure, Shrinkage, and Stochastic Gradient Boosting
* **Tree Structure** - adjusting hyperparameters for depth of tree and minimal samples per leaf
![max_depth_boosting](max_depth_boosting.png)
    * **max_depth** - controls degree of interactions
        * example: latitude and longitude
        * not often larger than 4 or 6
    * **min_samples_per_leaf** - may not want terminal nodes with too few leaves
* **Shrinkage** - adjusting hyperparameters for number of trees and learning rate
![n_estimators_and_learning_rate_boosting](n_estimators_and_learning_rate_boosting.png)
    * **n_estimators** - number of trees grown
    * **learning_rate** - lower learning rate requires **higher** n_estimators
        * as the learning rate goes down, the number of trees needed goes up
* **Stochastic Gradient Boosting** - adjusting maximium number of features per split and limiting the training set to a random subset
![stochastic_gradient_boosting](stochastic_gradient_boosting.png)
    * **max_features** - like random forest, choose a random subsample of features (great when you have too many features)
    * **sub_sample** - random subset of the training set
        * both randomly sampling the features and randomly subseting the training set can lead to **improved accuracy** and **reduced run-time**

4) Feature Importance For Boosting
* For bagged/RF regression trees, objective was to record the total amount of *RSS decreased due to splits over a given predictor, averaged over all $B$ trees* where large value indicates an important predictor
* Similarly, for bagged/RF classification trees, objective is to add up the *total amount that the Gini index is decreased by splits* over a given predictor, averaged over all $B$ trees

5) **Partial Dependence Plots** - these plots show dependence between target function, marginalizing over the values of all other features
* e.g. proportion of spam as represented by the log-odds
![partial_dependence_plot](partial_dependence_plot.png)