# 1. Bootstrap Estimation
We previously looked at the bias-variance tradeoff and if you were thinking critically you may have wondered: "Could it be possible in some way to lower bias and variance simultaneously?"

In this section, we are going to take our first look into **model averaging**. The key tool that we need to do this is called **bootstrapping**, aka **resampling**. The fascinating result of this is that even though we are using the same data, we can get a better result. This should seem odd at first, since if we create a model from a set of samples, how can that be any different than taking the averages of different models trained on different subsets of those same samples again and again-it is the same set of samples after all. 

However, model averaging does work, even if it is true that they work on the same data that you would have if you only have 1 model. Before we talk about bootstraping for models, we are going to look at bootstrapping for simple parameter estimates like the mean. 

## 1.1 Bootstrap Estimation - Mean
So, how does bootstrap estimation work? We are given a set of data points from $1...N$
#### $$X = x_1,x_2,...,x_N$$
We then draw a sample, with replacement, from this data set, $B$ times. For each of the $b$ subsample datasets, we calculate the parameter of interest-aka the mean, variance, or any other statistic. Once the loop is done we will have $B$ different estimates of the parameter. We can use this to find the mean of the parameter, and the variance of the parameter. Why do we care about the mean and variance? First, the mean tells us the most likely value of the parameter, in other words the expected value of the parameter. The variance then can tell us how accurate that estimate is! A **large variance** means not that accurate, and a **small variance** means more accurate. So, in pseudo code the algorithm could look like this:

```
X = x1, x2,...xN
for b = 1..B:
    Xb = sample_with_replacement(X)         # size of Xb is N
    sample_mean[b] = sum(Xb)/N
Calculate mean and variance of {sample_mean[1],...,sample_mean[B]}
```

<br>
## 1.2 Sampling with Replacement
In case you have not come across sampling with replacement, let's quickly touch on it now. Suppose we have a dataset with the points 1,2,3,4,5.

#### $$X = 1,2,3,4,5$$

Suppose we then draw a sample and get 5. Sampling with replacements means that if we draw another sample, we can get 5 again. In fact, we could draw a sample with all 5s! 

#### $$sample = 5,5,5,5,5$$

This is because we replace the sample after we take it from the dataset. This is the opposite of sampling without replacement. If we were to sample without replacement and we drew a number of samples equal to the dataset size, we would just draw the dataset itself. Hence, sampling with replacement is important to this process. 

<br>
## 1.3 Why does bootstrapping work?
As you can see, bootstrapping is a very simple algorithm- you are just computing the parameter estimate multiple times from the same dataset. So, why does it work? Lets look at the results first and then we can derive them. Remember, we are interested in the mean and variance. The mean of the bootstrap estimate is equal to the parameter itself:

#### $$E(\bar{\theta_B}) = \theta$$

The variance is a bit more complicated. Let's suppose the correlation coefficient between two different estimates, $\hat{\theta}_i, \hat{\theta}_j$ is $\rho$, and the variance of each $\hat{\theta}$ is $\sigma^2$:

#### $$\rho = corr(\hat{\theta}_i, \hat{\theta}_j), var(\hat{\theta}) = \sigma^2$$

Then, the variance of the bootstrap estimate is:

#### $$\frac{1 - \rho}{B}\sigma^2 + \rho \sigma^2$$

Notice that if each bootstrap estimate is completely uncorrelated from the others, the variance would be the original variance divided by $B$. This means that for every bootstrap sample we take, we reduce the variance of our estimate. That is remarkable! Unfortunately, there will probably be correlation. 

## 1.4 Confidence Interval
One application of bootstrap estimation, is that we can also estimate the confidence interval of our estimate. We assume a gaussian approximation, so let's say we want a 95% confidence interval. That means that we want the lower and upper bound of $\theta$ that covers 95% of the area under the probability distribution. This is approximately equal to the sample mean of the bootstrap $\theta$, plus or minus 1.96 times the standard deviation of the bootstrap $\theta$:

#### $$95\% CI \approx \bar{\theta}_B \;\pm\; 1.96 std(\hat{\theta}_B)$$

## 1.5 Derivation of Mean and Variance 
Now that we know the main results of bootstrap estimation, how do we show that they are true? Let's start with the mean. We can start with the definition of mean, which is the expected value of the bootstrap theta, $E(\bar{\theta}_B)$. We will define the following:
> * $\bar{\theta}_B$ = sample mean of resampled sample means
* $\hat{\theta}_i$ = sample mean of bootstrap sample $i$
* $\theta$ = original parameter we're trying to estimate

#### $$E(\bar{\theta}_B) = E \Big[ \frac{1}{B} \sum_{i=1}^B\hat{\theta}_i\Big] = E\Big[\frac{1}{B}(\hat{\theta}_1 + ...+\hat{\theta}_B)\Big] = \frac{1}{B}BE(\hat{\theta}) = \theta$$

We can see that the expected value of the bootstrap estimate of the parameter, is equal to the parameter, which is what we want. Next, let's look at the variance. We can start with some definitions. Let's suppose that the expected value of $\hat{\theta}$ is equal to $\mu$. This is not necessarily equal to the original mean of data $X$. It is the mean of whatever parameter we are trying to estimate. 

#### $$E(\hat{\theta}) = \mu$$

Let's also define the variance of $\hat{\theta}$ to be $\sigma^2$:

#### $$var(\hat{\theta}) = E \Big[(\hat{\theta} - \mu)^2\Big] = \sigma^2$$