# 1. Random Forest Algorithm
We are now going to touch on the **random forest** algorithm, which touches on earlier concepts we have gone over. Recall, if we have $B$ **independent** and **identically** distributed variables, each with a variance $\sigma^2$, then the sum (or equivalently the sample mean) of the random variables has a variance:

#### $$var(\bar{\theta}_B) = \frac{1}{B}\sigma^2$$

However, recall that this is not the case when the $B$ random variables are only **identically distributed**, and **not independent**. In this case there may be a correlation between two random variables, which we will call $\rho$; the variance of the **sum** or **sample mean** is given by this equation: 

#### $$var(\bar{\theta}_B) = \frac{1- \rho}{B}\sigma^2  + \rho \sigma^2$$

The main goal of random forest is to try and **reduce this correlation**. In other words, it tries to build a set of trees that are decorrelated from eachother. 

Recall that the idea behind *bagging* was to average the results from high variance/low bias models. Trees are perfect for that because they can go arbitrarily deep, and capture complex interactions. Much of the time, they can achieve 100% accuracy on the training set, and hence have 0 bias. We want this because then $\rho$ will be 0! At the same time, this results in them having a high variance. But due to the previous equation for the variance of an ensemble, we can achieve a much lower combined variance by finding trees that are not correlated with eachother. 

A good question to ask at this point is: "is there anything more deliberate we can do to make sure each tree is decorrelated from the others, rather than just assuming that trees grown to maximum depth on different bootstrap samples will be very different?" We will see how to do that soon!

---
<br>
## 1.1 Random Forest - Bias
We know that we can achieve low bias easily with trees because the more nodes we add, the more it will overfit. So, let's suppose that each tree has zero bias. Since each tree has the same expected value, then the expected value of an ensemble of trees is the same, and thus the bias remains the same too. This can be seen in the equation below - all estimates of $f$, $\hat{f}$, are going to have the same expected value: 

#### $$\bar{f}(x) = E\Big[\hat{f}(x)\Big]$$

(This can be seen in the previous section **1.5.1 Mean Derivation**).

And we can see that $bias^2$ is simply the ground truth function $f$ (which doesn't change) minus the expected value of the estimate, $\bar{f}$. 

#### $$bias^2 = \Big[f(x) - \bar{f}(x)\Big]^2$$

We will see later with **boosting** another way of combining trees with high bias. 

---
<br>
## 1.2 Random Forest - Decorrelation
So, how does Random Forest try to decorrelate it's trees? In the same way that we can select which samples to train on, we can also **randomly select which features to train** on too! So, if you think of the data matrix $X$, one way to get different trees is to sample different rows, which we have done already. Another way is to sample different columns, which is equivalent to training a tree on only a subset of features. 

We usually choose a dimensionality $d << D$, assuming that $X$ is an **(N x D)** matrix. The inventors of random forest recommend the following settings for $d$:

#### $$Classification: d = floor(\sqrt{D})$$
#### $$Regression: d = floor(\frac{D}{3})$$

For classification note that it can be set as low as 1. For regression it can be set as low as 5. As always, by using a method like cross validation you can see what works best for your specific dataset. 

---
<br>
## 1.3 Random Forest - Algorithm