# Bootstrapping, Bagging, Random Forests, and ExtraTrees

## Introduction to Ensemble Methods
We can list out the different types of models we've built thus far:
- Linear Regression
- Generalized Linear Models
    - Logistic Regression
    - Multinomial Logistic Regression
    - Poisson Regression
- $k$-Nearest Neighbors
- Naive Bayes Classification

If we want to use any of these models, we follow the same type of process.
1. Based on our problem, we identify which model to use. (Is our problem classification or regression? Do we want an interpretable model?)
2. Fit the model using the training data.
3. Use the fit model to generate predictions.
4. Evaluate our model's performance and, if necessary, return to step 2 and make changes.

In this case, we've always had exactly one model. Today, however, we're going to talk about **ensemble methods**. Mentally, you should think about this as if we build multiple models and then aggregate their results in some way.

### Why would we build an "ensemble model?"
Consider $H$ the space of all possible hypotheses. Our goal is to estimate $f$, the true function. We can come up with different hypotheses $h_1$, $h_2$, and so on to get as close to $f$ as possible. 
- Think about $f$ as the true process that dictates Iowa liquor sales.
- Think about $h_1$ as the model you built to predict $f$.

![](./assets/images/ensemble-benefits.png)

- The **statistical** benefit to ensemble methods: By building one model, our predictions are almost certainly going to be wrong. Predictions from one model might overestimate liquor sales; predictions from another model might underestimate liquor sales. By "averaging" predictions from multiple models, we'll see that we can often cancel our errors out and get closer to the true function $f$.
- The **computational** benefit to ensemble methods: It might be impossible to develop one model that globally optimizes our objective function. (Remember that CART reach locally-optimal solutions and that all generalized linear models iterate toward a solution that isn't guaranteed to be the globally-optimal solution.) In these cases, it may be impossible for one CART or an individual GLM to arrive at the true function $f$. However, starting our "local searches" at different points and aggregating our predictions might .
- The **representational** benefit to ensemble methods: Even if we had all the data and all the computer power in the world, ot might be impossible for one model to exactly equal $f$. For example, a linear regression model can never model a relationship where a one-unit change in $X$ effects some *different* change in $Y$ based on the value of $X$. All models have some shortcomings. (See [the no free lunch theorems](https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization).) While individual models have shortcomings, by creating multiple models and aggregating their predictions, we can actually create predictions that represent something that one model cannot ever represent.

## Bootstrapping

Let's get started actually making ensemble predictions. However, in order to do that, we'll need to introduce the idea of bootstrapping, or random sampling **with replacement.**

---



In [1]:
import numpy as np
def bootstrap(data, stat = np.mean, size = 1000):
    stat_list = []
    for i in range(size):
        statistic = stat(np.random.choice(data, size = len(data), replace = True))
        stat_list.append(statistic)
    return stat_list

In [2]:
lst = [1,2,3,4,5,6,7,8,9,10]

In [3]:
bootstrap(lst, size = 5)

[5.5999999999999996,
 6.5999999999999996,
 5.0999999999999996,
 5.2999999999999998,
 5.4000000000000004]

## Bagged Decision Trees

As we have seen, decision trees are very powerful machine learning models. However, decision trees have some limitations. In particular, trees that are grown very deep tend to learn highly irregular patterns (a.k.a. they overfit their training sets). 

![](https://qph.ec.quoracdn.net/main-qimg-69838e874a74b9537c2540c65dbea725)

Bagging (bootstrap aggregating) helps to mitigate this problem by exposing different trees to different sub-samples of the whole training set.

The process for creating bagged decision trees is as follows:
1. From the original data of size $n$, bootstrap $k$ samples each of size $n$ (with replacement!).
2. Build a decision tree on each bootstrapped sample.
3. Make predictions by passing a test observation through all $k$ trees and developing one aggregate prediction for that observation.

![www.cse.buffalo.edu/~jing/sdm10ensemble.htm](./assets/images/Ensemble.png)

### What do you mean by "aggregate prediction?"
As with all of our modeling techniques, we want to make sure that we can come up with one final prediction for our observation. (Building 1,000 trees and coming up with 1,000 predictions for one observation probably wouldn't be very helpful.)

Suppose we want to predict whether or not a Reddit post is going to go viral, where `1` indicates viral and `0` indicates non-viral. We build 100 decision trees. Given a new Reddit post labeled `X_test`, we pass these features into all 100 decision trees.
- 70 of the trees predict that the post in `X_test` will go viral.
- 30 of the trees predict that the post in `X_test` will not go viral.

<details><summary> What might you expect `.predict(X_test)` and `.predict_proba(X_test)` to do?
</summary>
```
- .predict(X_test) should output a 1, predicting that the post will go viral.
- .predict_proba(X_test) should output 0.7, indicating the probability of the post going viral is 70%.
```
</details>

- **Discrete:** In ensemble methods, we will most commonly predict a discrete $Y$ by "plurality vote," where the most common class is the predicted value for a given observation.
- **Continuous:** In ensemble methods, we will most commonly predict a continuous $Y$ by averaging the predicted values into one final prediction.

---

## Random Forests

With bagged decision trees, we generate many different trees on pretty similar data. These trees are **strongly correlated** with one another. Because these trees are correlated with one another, they will have high variance. Looking at the variance of the average of two random variables $T_1$ and $T_2$:

$$
\begin{eqnarray*}
Var\left(\frac{T_1+T_2}{2}\right) &=& \frac{1}{4}\left[Var(T_1) + Var(T_2) + 2Cov(T_1,T_2)\right]
\end{eqnarray*}
$$

If $T_1$ and $T_2$ are highly correlated, then the variance will about as high as we'd see with individual decision trees. By "decorrelating" our trees from one another, we can drastically reduce the variance of our model.

That's the difference between bagged decision trees and random forests! We're going to do the same thing as before, but we're going to decorrelate our trees. This will reduce our variance (at the expense of a small increase in bias) and thus should greatly improve the overall performance of the final model.

### So how do we "decorrelate" our trees?
Random forests differ from bagging decision trees in only one way: they use a modified tree learning algorithm that selects, at each split in the learning process, a **random subset of the features**. This process is sometimes called the *random subspace method*.

The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be used in many/all of the bagged decision trees, causing them to become correlated. By selecting a random subset of the features at each split, we counter this correlation between base trees, strengthening the overall model.

For a problem with $p$ features, it is typical to use:

- $\sqrt{p}$ (rounded down) features in each split for a classification problem.
- $p/3$ (rounded down) with a minimum node size of 5 as the default for a regression problem.

While this is a guideline, Hastie and Tibshirani (authors of Introduction to Statistical Learning and Elements of Statistical Learning) have suggested this as a good rule in the absence of some rationale to do something different.

Random forests, a step beyond bagged decision trees, are **very widely used** classifiers and regressors. They are relatively simple to use because they require very few parameters to set and they perform pretty well.
- It is quite common for interviewers to ask how a random forest is constructed or how it is superior to a single decision tree.

--- 

## Extremely Randomized Trees (ExtraTrees)
Adding one more step of randomization (and thus decorrelation) yields extremely randomized trees, or _ExtraTrees_. These are trained using bagging (sampling of observations) and the random subspace method (sampling of features), but an additional layer of randomness is introduced. Instead of computing the locally optimal feature/split combination (based on, e.g., information gain or the Gini impurity) for each feature under consideration, a random value is selected for the split. This value is selected from the feature's empirical range.

This further reduces the variance, but causes an increase in bias. If you're considering using ExtraTrees, you might consider this to be a hyperparameter you can tune. Build an ExtraTrees model, a Random Forest model, and a Bagged Decision Treesmodel, then compare their performance!

That's exactly what we'll do below.

## Guided Practice: Random Forest and ExtraTrees in Scikit Learn

Scikit Learn implements both random forest and extra trees methods as part of the `ensemble` module.

Have a look at the [documentation](http://scikit-learn.org/stable/modules/ensemble.html#forest).

**Check:** What parameters did you notice? Any questions on those?

Let's load the [car dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/car/).

We would like to compare the performance of the following 4 algorithms:

- Decision Trees
- Bagging + Decision Trees
- Random Forest
- Extra Trees

Note that in order for our results to be consistent, we have to expose the models to exactly the same CrossValidation scheme. Let's start by initializing that. Then, we'll initialize the models.

*You can also create a function to speed up your work...*

In [4]:
import pandas as pd
df = pd.read_csv('./assets/datasets/car.csv')
df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,acceptability
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [5]:
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(df['acceptability'])
X = pd.get_dummies(df.drop('acceptability', axis=1))

In [6]:
y

array([2, 2, 2, ..., 2, 1, 3])

In [7]:
pd.Series(y).value_counts()

2    1210
0     384
1      69
3      65
dtype: int64

In [8]:
pd.Series(df['acceptability']).value_counts()

unacc    1210
acc       384
good       69
vgood      65
Name: acceptability, dtype: int64

In [9]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=41)

In [10]:
dt = DecisionTreeClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree", s.mean().round(3), s.std().round(3)))

Decision Tree Score:	0.969 ± 0.005


In [11]:
dt = BaggingClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Bagging", s.mean().round(3), s.std().round(3)))

Bagging Score:	0.97 ± 0.003


In [12]:
dt = RandomForestClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest", s.mean().round(3), s.std().round(3)))

Random Forest Score:	0.939 ± 0.003


In [13]:
dt = ExtraTreesClassifier()
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Extra Trees", s.mean().round(3), s.std().round(3)))

Extra Trees Score:	0.952 ± 0.009


Let's explore the effect of balancing our classes on the score.

In [14]:
dt = DecisionTreeClassifier(class_weight = 'balanced')
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Decision Tree with Balanced Classes", s.mean().round(3), s.std().round(3)))

Decision Tree with Balanced Classes Score:	0.966 ± 0.011


In [15]:
dt = BaggingClassifier(class_weight = 'balanced')
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Bagging with Balanced Classes", s.mean().round(3), s.std().round(3)))

TypeError: __init__() got an unexpected keyword argument 'class_weight'

No balancing class argument for bagged classifier. :(

In [16]:
dt = RandomForestClassifier(class_weight = 'balanced')
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Random Forest with Balanced Classes", s.mean().round(3), s.std().round(3)))

Random Forest with Balanced Classes Score:	0.953 ± 0.006


In [17]:
dt = ExtraTreesClassifier(class_weight = 'balanced')
s = cross_val_score(dt, X, y, cv=cv, n_jobs=-1)
print("{} Score:\t{:0.3} ± {:0.3}".format("Extra Trees with Balanced Classes", s.mean().round(3), s.std().round(3)))

Extra Trees with Balanced Classes Score:	0.952 ± 0.004


In [18]:
dt

ExtraTreesClassifier(bootstrap=False, class_weight='balanced',
           criterion='gini', max_depth=None, max_features='auto',
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

## Conclusion

We can improve the performance of a single model by generating multiple models and aggregating their predictions. 
- If our $Y$ is continuous, we average each prediction. 
- If our $Y$ is discrete, we use a "plurality vote" to decide the predicted value.

The three types of ensemble models we discussed today:
1. **Bagged Decision Trees** are where we take the original data of size $n$, bootstrap $k$ samples each of size $n$ (with replacement!), build a decision tree on each sample, then make predictions by passing a test observation through all $k$ trees and developing one aggregate prediction for that observation.
2. **Random Forests** are where we bag decision trees, but when it comes to building each individual decision tree, we only consider a random subset of features at each split.
3. **ExtraTrees** are where we build random forests, but when it comes to building each individual decision tree, we also select a random split of each feature at each node.

Some of these methods will perform better in some cases, some better in other cases. For example, decision trees are more nimble and easier to communicate, but have a tendency to overfit. On the other hand, ensemble methods perform better in more complex scenarios, but may become very complicated and harder to explain. (This gets back to our discussion about prediction versus inference - only you and your stakeholders will recognize what balance to strike between these two!)

### ADDITIONAL RESOURCES

- [Random Forest on wikipedia](https://en.wikipedia.org/wiki/Random_forest)
- [Quora question on Random Forest](https://www.quora.com/How-does-randomization-in-a-random-forest-work?redirected_qid=212859)
- [Scikit Learn Ensemble Methods](http://scikit-learn.org/stable/modules/ensemble.html)
- [Scikit Learn Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [Ensemble Methods Paper](http://web.engr.oregonstate.edu/~tgd/publications/mcs-ensembles.pdf)