# 06 - Random Forest
---

### **Introduction**
The random forest algorithm is an **ensemble algorithm** meaning it works by combining and aggregating the predictions from many weaker models. Random forests combine many decision trees (hence the name) and usually perform significantly better than single decision trees. The issue with decision trees is that they are very liable to overfitting. By combining many simple trees, we tend to get a model which generalises much better to unseen data. 

### **Boostrap Aggregation (Bagging)**
In machine learning, **bootsrapping** refers to when we resample (i.e. sample with replacement) from a dataset. **Aggregation** is the process of combining predictions from multiple models to form an ensemble model. **Bagging** combines both these techniques. That is, we train many models, each using a bootstrapped sample from the training data and then aggregate the predictions from all the models to form a final prediction. As discussed in the introduction, random forests are an example of bagging but many other models employ this strategy too (e.g. gradient boost). 

### **Implementation / Algorithm**
The random forest algorithm works as follows:
1. Create a bootstrapped sample of data from the training data by sampling it with replacement
2. Train a decision tree on the bootstrapped sample where, at each node, you only consider a subset of features rather than all features (this is the 'random' part)
3. Repeat steps 1-2 using a different bootstrapped sample each time until a desired number of trees have been fit
4. Compute the final prediction by combining the predictions from all the tree:
    - For a continuous target take the mean of all the predictions
    - For a discrete target we generally go by majority vote (i.e. the mode) however we could sample proportionately

By only considering a random subset of trees, we prevent the same features dominating every tree. This helps to decorollate the predictions from each tree and hence reducing the variance in the bias-variance trade-off. 

### **Out-of-bag (OOB) Error**
For bootstrapped sampling we are choosing $n$ values from a possible $n$ with replacement. For large $n$, the probability a given value is *not* selected is approximately $(1 - \frac{1}{n})^n = e^{-1} \approx 0.37$

So for a given observation $i$, some trees will have been trained on it and other will not. By aggregating the predictions from the trees which have not been trained on observation $i$ we can estimate it and compare it to the true observed value. Aggregating these errors gives the ***Out-of-Bag (OOB) Error**. So random forests (rather elegantly) have a form of cross-validation built in. You don't need to split into training, validation and test datasets because you get validation for free; you onlyneed training and test. 

### **Limitations of Tree-based Algorithms**
Note that for a continuous target, the predicted value is the mean of all the values in a given leaf. Since all these values come from the observed training data, it is impossible to predict a value outside of the observed range of values for the target. Hence tree-based algorithms such as decision trees, random forests, gradient boosting etc. cannot be used for extrapolation.
