# 1. Bias-Variance Trade-Off
As we get started with ensemble methods, we will being by looking at the **bias-variance trade-off**. There are three key terms in particular that we are going to look at:
> 1. **Bias**
2. **Variance**
3. **Irreducible Error**

We can start be looking into irreducible error...

---
<br></br>
# 1.1 Irreducible Error
When we say the term irreducible error, what exactly do we mean? Well, this term comes from the fact that data generating processes are **noisy**. By definition, noise is *random* (aka not deterministic). This in turn mean that we cannot predict the exact values that we are going to get, but rather it's statistics, like **mean** and **variance**. This can be best illustrated with an equation. Say we know that the *true function* that is generating our response data, $Y$ is $f(x)$. Well, the full equation for $Y$ would look like:

#### $$Y = f(X) + \epsilon$$

Where $\epsilon$ is the **irreducible error** term, which can be thought of as **noise**. Visually this can be seen below:

<img src="images/one-input-variable.png">

<img src="images/two-input-variables.png">

The black line in the 2d plot, and the blue surface in the 3d plot represent our **true target function**, $f(x)$. This is the function that is actually used to generate all of the red points. However, we can see that the red points don't perfectly map to $f(x)$. This is due to the irreducible error, or the inherent noise in the system that we cannot get rid of. 

From a machine learning perspective, let's say that we are in charge of the data-generating process, and our exact function is:

#### $$f(x) = 2x + 1$$

We are performing linear regression in this case. Now if a machine learning researcher was working with the data and we gave him this exact function, his work would already be done! Because he has the **true function** there is nothing more he can do to make things more accurate. However, again we cannot forget about the irreducible error, which makes our equation look like:

#### $$f(x) = ax + b + \epsilon$$

We know that the noise is random, and we generally assume that is is **gaussian distributed** with 0 mean. 
#### $$\epsilon \approx N(0, \sigma^2) $$

In other words, we are starting with a fixed pattern this is based on some underlying function, and then we add noise to it (this noise is the irreducible error): 

<img src="images/i-error-1.png">

<img src="images/i-error-2.png">

So, even when *we* are responsible for the data generating process, we cannot predict what the noise will be! So even though we know the *exact* function that the data came from, our own predictions won't be perfect, because there is noise. This is irreducible error. 

#### $$\hat{f}(x) = 2x + 1 $$

Will not achieve 0 error on:

#### $$f(x) = y = 2x + 1 + \epsilon$$

---
<br></br>
# 1.2 Bias
Now let's talk about the **bias**. This is a slightly weird term since we also use the term bias when talking about the intercept in linear transformations. That is not the bias that we are talking about here. The bias we are discussing actually refers to the error in your model. In other words, it is **how far off your prediction is from the target**. 

#### $$bias = E\Big[f(x) - \hat{f}(x)\Big]$$

We can think of it as the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. Below, we can see an example of this in regression:

<img src="images/large-bias.png">

The fitted function clearly does not represent the pattern in the data, so bias tells us how far off we are from the target. In this case we tried to approximate a convex function (complex) with a linear function (simple), and there was a bias error introduced. And below we can see a picture with small bias, however, you may get the feeling that we are overfitting (since it fits the data perfectly, but it is not following the main trend): 

<img src="images/small-bias.png">

And we can also have the same issue arise in classification:

<img src="images/large-bias-classification.png">

Again, it is a very simple model (not complex) and it does not fit the pattern well. We can also see a small bias below for classification. Again it is a complex decision boundary, which allows us to fit the data better, but we may suspect that it is overfitting:

<img src="images/small-bias-classification.png">

---
<br></br>
# 1.3 Variance
Now that we have looked at bias, we can now look at variance. We already know about variance in terms of statistics:
> How much a random variable deviates from its mean in squared units.

However, in the context of the **bias-variance** trade off, we mean something much more specific. Variance in this context tells us:
> The statistical variance of our predictor over all possible training sets that are drawn from this particular data generating process. 

**Variance** refers to the amount by which $\hat{f}$ would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different $\hat{f}$. But ideally the estimate for f should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in $\hat{f}$. In general, more flexible statistical methods have higher variance.

For example, if you have a model that overfits, in the sense that it can get perfect accuracy on any data set that it trains on - remember that these data sets are all drawn from the same process that we are trying to model. Well if each model is perfect for each training set, then each model will probably be very different from the others! In other words, these different models trained on different data sets from the same process will vary, and we are **measuring that variance**. 

Let's try and word that differently just one more time. Variance has nothing to do with accuracy! Variance just measures how **inconsistent** a predictor is over different training sets. Remember that our actual goal is not to achieve the lowest possible error rate. Our actual goal is to find the true $f(x)$. Being close to the training points is just a proxy solution. 

---
<br></br>
# 1.4 Model Complexity 
Variance is often used as a proxy for model complexity. Complexity is a malleable term and can mean different things for different classifiers. For example, a decision tree that is very deep can be considered very complex, whereas a shallow decision tree is not complex. For K-Nearest Neighbor, K=1 would be complex, K=50 would not. 

For linear models you can really see how this concept becomes malleable. For linear models you may initially think that each model has the same complexity because they are all linear. One line is no more complex than any other line. However, in terms of variance there is a difference. Recall that we often use regularization to prevent overfitting. Regularization encourages the weights of a model to be small or 0, which decreases the variance, and hence decreases the complexity as well. 

Another thing about linear models is that you may assume that they are not complex because they are linear, while nonlinear models are more "expressive". For example, you may look at something like decision trees and conclude that because decision trees can find nonlinear decision boundary that they are more complex. But linear doesn't necessarily mean not complex. A large dimensionality linear model can be more complex than a nonlinear model with just a few inputs. So its not a universal measure; it means different things depending on the context. 


---
<br></br>
# 2. Bias-Variance Trade-Off
In machine learning, we are always striving to minimize our error. We just saw that the best we can do is strive to get our error that it is so small that it is just equal to the irreducible error-this is when we know the true underlying function that generates the data. In this situation, since the only error is the irreducible error, then everything else (meaning the reducible error) is 0. 

We will go over the derivation soon, but the overall error is a combination of:
> 1. Bias
2. Variance
3. Irreducible Error

As a data scientist, the goal is clear: we want to make the bias and variance as small as possible! There is a problem however, which is known as the **bias-variance trade-off**. This tells us that we are trying to find a balance between bias and variance. So while we would love low bias and variance, what we find is that when we try and lower the bias, the variance increases, and vice-versa.

This has been seen before in the context of overfitting. When we overfit our training data our bias goes down, but our variance goes up. When we underfit the training data out bias increases but our variance goes down. The visualization below clearly demonstrates this. 

<br></br>
<img src="images/bias-variance-tradeoff.png">

The **sum** of bias and variance means that there is a minimum somewhere in the middle! Which also coincides approximately with the best generalization error. Another way of visualizing this can be seen below: 

<br></br>
<img src="images/bullseye.png">

Now, an important question to ask is: we talk about bias-variance being a tradeoff, but is it really a tradeoff? Is it possible to achieve lower bias, and lower variance at the same time? The trade-off occurs in the context of looking at the same model while we alter the complexity of that model. But what if we could somehow combine models, such that the overall result achieves better accuracy on the training set, and better generalization? 

---
<br></br>
# 3. Bias-Variance Decomposition 
We are now going to go through the math that shows the expected error is:
#### $$E[error] = bias^2 + variance + irreducible \; error$$
Note that people usually use mean-squared error function for both regression and classification scenarios, when talking about bias-variance decomposition. This is despite the fact that we don't use **MSE** when optimizing a classification model. You can try and derive this using another kind of error, but **MSE** is usually sufficient for getting the idea across. 

### 3.1 Basic Definitions 
So we will start with some basic definitions. $y$, which is the data that we observe is equal to the ground truth function $f(x)$ plus some noise that is centered at 0, $\epsilon$:
#### $$y = f(x) + \epsilon$$
We will call $\hat{f}$ our estimate of $f(x)$
#### $$\hat{f}(x) = estimate \;of\;f(x)$$
And we will say that the expected error is the mean squared error between $y$ and $\hat{f}$:
#### $$err = E\Big[(y - \hat{f}(x))^2\Big]$$

---
### 3.2 Decompose $y$ into ground truth function and noise
#### $$err = E\Big[(f(x) + \epsilon - \hat{f}(x))^2\Big]$$

---
### 3.3 Introduce new variable, $\bar{f}$
This new variable is equal to the mean of $\hat{f}$:
#### $$\bar{f}(x) = E\Big[\hat{f}(x)\Big]$$
We can then add and subtract this to the inside term, so that it mathematically remains the same:
#### $$err = E\Big[(f(x) + \epsilon - \hat{f}(x) + \bar{f}(x) - \bar{f}(x))^2\Big]$$

---
### 3.4 Combine $f$ and $\bar{f}$, $\hat{f}$ and $\bar{f}$
We will now combine $f$ and $\bar{f}$, as well as $\hat{f}$ and $\bar{f}$, while leaving $\epsilon$ by itself:
#### $$err = E\Big[\Big( (f(x)  - \bar{f}(x)) -  (\hat{f}(x) - \bar{f}(x)) + \epsilon \Big)^2\Big]$$

---
### 3.4 Multiply out terms
We now multiply out all of the terms, but be sure to keep $f - \bar{f}$ together, as well as $\hat{f} - \bar{f}$ together:

#### $$err = E\Big[(f(x) - \bar{f}(x))^2\Big] + E\Big[(f(x) - \bar{f}(x))(\epsilon - (\hat{f}(x) - \bar{f}(x)))\Big] + E\Big[(\hat{f}(x) - \bar{f}(x))^2\Big] - E\Big[(\hat{f}(x) - \bar{f}(x))(f(x) - \bar{f}(x) + \epsilon) \Big] + E\Big[ \epsilon^2 \Big] +  E\Big[\epsilon (f(x) = \bar{f}(x) - (\hat{f}(x) - \bar{f}(x)))\Big] $$

---
### 3.5 Useful Identities 
Next, we can use some properties that follow from how we define these variables, to simplify this equation. First, the mean of $\epsilon$ is 0, since the average value of the noise is 0:
#### $$E\Big[\epsilon\Big] = 0$$
#### $$E\Big[\epsilon^2\Big] = \sigma_\epsilon^2 + (E\Big[\epsilon\Big])^2 = \sigma_\epsilon^2$$

We also know that the mean of $\hat{f}$, $\bar{f}$, is just the expected value of $\hat{f}$:
#### $$\bar{f}(x) = E\Big[\hat{f}(x)\Big]$$

And hence we also know that the expected value of $\hat{f}$ minus $\bar{f}$ is also 0:
#### $$E\Big[\hat{f}(x) - \bar{f}(x)\Big] = E\Big[\hat{f}(x)\Big] - E\Big[\hat{f}(x)\Big] = 0$$

---
### 3.6 Decompose again
So, after plugging in the identities discussed above, we are left with 3 terms; one for the **bias**, one for the **variance**, and one for the **irreducible error**:
#### $$err = E\Big[(f(x) - \bar{f}(x))^2 \Big] + E\Big[(\hat{f}(x) - \bar{f}(x))^2\Big] + E\Big[\epsilon^2\Big]$$

Becuase $f(x)$ is not random and it is the ground truth, the expected value of it is just equal to $f(x)$, so the first term can be reduced:

#### $$err= \Big[f(x) - \bar{f}(x)\Big]^2 + E\Big[(\hat{f}(x) - \bar{f}(x))^2\Big] + E\Big[\epsilon^2\Big]$$

Which has brought us to our final bias-variance decomposition:
#### $$err = bias^2 + variance + \sigma_\epsilon^2$$

---
### 3.7 Key points
We have seen that the **bias** is the **ground truth** minus the **average estimator**, the **variance** is the **variance of the estimator**, and the **irreducible error** is the **variance of the data noise**. 

Also, just as a note, the expected $E$ is used when referring to random variables. Hence, the expected value of a random variable is just it's mean. 

---
### 3.8 High Level Summary
To summarize what we just did: 
> 1. We have shown that the expected error that we get for our model, meaning the mean squared error between the observed targets and our predictions, is the sum of the bias squared, variance, and irreducible error.
2. Remember, this is **not** the error between the true $f(x)$ and our model $\hat{f}(x)$, since we never actually see the true $f(x)$! We only see $y$ which is $f(x)$ plus some noise. 