# Forests for the trees

1. Bagging
2. Random Forests
3. Boosting

The lecture draws from Chapter 8 of James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). "An introduction to statistical learning: with applications in r."

---

# 1. Bagging

At the end of the last lecture, we outlined two major limitations of regression and classification trees.

* **Lower predictive accuracy:** At least compared to linear models.
* **Non-robust:** A small change in the data can cause a massive change in the structure of the estimated tree.

More specifically this means that decision tree approaches suffer from high variance. In this lecture we will go over methods for dealing with this high variance **for the goal of prediction**.

One very popular method to increase the robustness of predictions from tree methods (and really all methods), is something known as _boostrap aggregation_, often known as _bagging_. 

As the name implies, bagging relies on a variation of the bootstrap. Remember, the traditional bootstrap involves using sampling with replacement to generate permuations of your data set as a way of approximating multiple replication tests. We originally talked about using the bootstrap as a way of building confidence intervals on your parameter estimates. 

Bagging works off of this same logic (and same resampling method), but instead of estimating confidence of a parameter, it uses averaging to _reduce the variance of a prediction_. Let's see how this works.

Given a set of $B$ independent, boostrapped predictor $X^*_1, ..., X^*_B$ and ressponse variables $Y^*_1, ..., Y^*_B$ (generated using sampling with replacement), then you can use these to learn a new set of models $\hat{f}^1(x), ..., \hat{f}^B(x)$. The averaged prediction

$$\hat{f}_{avg}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f}^b(x)$$

is a more stable, less variable estimate then the individual $\hat{f}(x)$ that you would get from the original, non-bootstrapped data set.

That is the simplicity and elegance of bagging. Leveraging the properties of the bootstrap, you can reduce the variance of your predictions.

Now bagging works to for any form of $\hat{f}(x)$, but it is particularly useful for decision tree methods. In this case, the bootstrap generates $B$ many trees. Remember from the last lecture, one of the problems with tree methods is that they are highly sensitive to the data set that is used to generate them. So each of the bootstrapped trees will be different than the others. But in this case, we are turning this bug into a feature: _the average of a set of preditions from individual trees with high variance is a prediction that has low variance_.

The method of averaging works well in the context of regression. However, it is not immediately clear how to aggregate across resampled trees for classification trees. In this case, each tree in the bag produces a specific class prediction. So the aggregated prediction, $\hat{f}_{avg}$, in this case is determined by _majority vote_: for a given value of $X$ return the most commonly observed class across resampled data sets.



### Out-of-bag (OOB) Error

Because bagging is focused on prediction, instead of just confidence on a parameter estimate, it can take advantage of data that is normally just discarded in order to evaluate accuracy of each resampled model fit $\hat{f}^i(x)$. 

During resampling with replacement, a certain amount of the data is left out. For example, consider the data vector $X$.

$$ X = [13, 45, 32, 16, 25, 33] $$

The first resampled version of $X$ (i.e., $X^*_1$) might look like.

$$ X*{*1} = [33, 33, 25, 16, 16, 25] $$

In this case, the value 45 and 13 are no longer included in $X^*_1$. In the bootstrap we would just ignore (or discard) those values. But in bagging, you can use those to form an independent test set for evaluating the model you fit to $X^*_1$ (which we called $\hat{f}^1(x)$ above).

**This means that you can evalute the hold out test accuracy of all models $\hat{f}^1(x), ..., \hat{f}^B(x)$ as you generate them!** We call this the _out of bag (OOB)_ error.

Now, on average, sampling with replacement methods leave out about 1/3 of the data on each pull. This means that, on average, the OOB is calculated on a test set about 1/3 of the total sample size.

The elegance of OOB, compared to using a traditional hold out set approach, is that it provides you with much better predictive fits. The figure below shows the variance of models when evaluated using OOB versus traditional hold out test sets.

![OOB](imgs/L21_OOB.png)

Notice that the OOB error is consistently lower than the test error. (You can ignore the Random Forest results, we are getting to that soon).



### Variable Importance

Bgagging typically results in improved prediction accuracy when compared against just a single tree. But this gain in accuracy comes at the cost of a loss of interpretability. This is due to the fact that each bagged model $\hat{f}^1(x)$ is different than the others (because each individual tree is highly variable). 

One way to get around this is to estimate the importance of a given feature variable on each pull from the bag. For regression this is the RSS of each predictor variable split. For classification this is the Gini index for each predictor variable. On each sample from the bag you get an estimate of variable importance for each feature (i.e., how much does splitting it help explain $Y$) and take the average these measures across all $B$ models you generate during bagging.

An example used in the book comes from an analysis of the $Heart$ data set. You can see which variables are more imporant in the classfication than others.

![Variable Importance](imgs/L21_VariableImportance.png)




---
# 2. Random Forests

One problem you sometimes run into when bagging is, when just a few predictor variables have a lot of importance on predicting $Y$, then there is a lot of redundancy (or correlation) across all  $\hat{f}^1(x), ..., \hat{f}^B(x)$ trees. This is because the same variables drive most of the effects and can dominate the predictions. This can result in missing meaningful variance in explaining $Y$. 

One way to get around this is, on each bag sample, only select _a random sample of m predictors_ that can be used to generate the tree. So on the first split only $m$ predictors are allowed to be even considered. Then on the next split a new fresh sample of $m$ predictors is only considered and so forth. Typically we set $m$ to be $\sqrt{p}$ in order to force each split to be as random as possible. This makes it so that the algorithm isn't even allowed to consider the majority of the possible predictors.

This method is known as _random forest_.

The art of random forest is that, by protecting the algorithm from being dominated by a few variables with high importance, it _decorrelates_ each estimated tree. If there is one strong predictor variable, on average $\frac{(p-m)}{p}$ of the splits will not be allowed to even consider it.  This substantially reduces the variance of $\hat{f}_{avg}(x)$. 

If we compare the Test and OOB error for bagging versus random forests, by replotting the figure we showed above, it becomes clear how much of an improvement this decorrelation is for prediction accuracy.

![OOB](imgs/L21_OOB.png)

All in all, selecting a small $m$ helps when you have a large number of correlated predictors. This assures that the trees are as random as possible.

---
# 3. Boosting

For our final method of improving prediction accuracy of decision trees, we turn to another general purpose method that just happens to work very well on trees.

Remember that in bagging and random forest, you are creating multiple copies of your original data set (via sampling with replacement), fitting a separate model to that tree, and taking a combination of all of those predictions together in order to find the best single prediction. This means that each model  $\hat{f}^1(x), ..., \hat{f}^B(x)$ is independent from each other.

Now imagine a scenario where, instaead of each model being independent, each model instead learns from the previous models. This is the logic of the method known as _boosting_. In boosting, each model is learned _sequentially_. 

The secret sauce is in what you use to train your models. Normally, in bagging and random forest, $\hat{f}(x)$ is trained on Y. In bootsting, $\hat{f}(x)$ is trained on the residuals for the previous run.

<br>
The algorithm works like this:

1. Start with $\hat{f}(x)=0$ and $r_i = y_i$.

2. For $b=1, ..., B$ repeat the following:


* (a) Fit a model $\hat{f}^b$ to with the predictors $X$ and the response variable $r$.

* (b) Update $\hat{f}(x)$ by adding a sparse version of the new model: 

$$ \hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^b(x)$$

* (c) Update the residuals

$$ r_i \leftarrow r_i + \lambda \hat{f}^b(x)$$

3. Generate boosted model $\hat{f}(x)$ 

$$\hat{f}(x) = \sum_{b=1}^B \lambda \hat{f}^b(x)$$ 

<br>

This model learns very slowly because it relies on the residuals from previous steps. This means that the efficiency of previous models determines the how well the later models do. The shrinkage parameter $\lambda$ further slows this down, forcing a more sparse, restricted subset of data to be considered for the fit. In the case of decision trees, it forces the trees to have different shapes. This puts a lot of energy into "eating" the residual variance.

In the context of decision trees, boosting has 3 parameters that you need to tune: the number of trees ($B$), the shrinkage parameter ($\lambda$), and the number of allowed splits in each tree ($d$). This last parameter $d$ controls the model complexity. 

But the advantage is that, if you have enough power to tune each of these three parameters, then boosting gives you the most robust predictive power with tree models, out performing even random forest. However, this does come at a cost of time.