### Bias / Variance Tradeoff and Test Error
 
Consider the situation where we are trying to model a function $y = f(x)$ and we learn a model $\hat{f}(x)$ based on the principle of minimizing MSE -- that is minimize 
$$E[(f(x) - \hat{f}(x))^2]$$ over all $x$ (not just the training set).

Mathematically MSE can be decomposed as follows: 

$$E[(f(x) - \hat{f}(x))^2] = {\rm Bias}(\hat{f}(x))^2 + {\rm Variance}[\hat{f}] + \sigma^2$$

where $\sigma^2$ is called *irreducible error*.   So when we are comparing various models -- candidate $\hat{f}(x)$ -- we can (and must) accept the tradeoff between bias and variance.

#### Bias

Error caused by bias is calculated as the difference between the expected prediction of our model and the correct value we are trying to predict:

$$Bias = E[\hat{f}(x)] - E[f(x)]$$

This quantity is the extent to which the mean of our predictions deviates from the mean of the observed values.   Remember this is measured over the "whole population" and not just the training set.  

Bias is the error caused by our choice of model -- for example we used a linear model, but $f$ is not exactly linear.

#### Variance

Error caused by variance is taken as the variability of a model prediction for a given point:

$$Variance = E[(\hat{f}(x) - E[\hat{f}(x)])^2]$$

Notice that variance is defined in terms of the model $\hat{f}$ and does not include the "real" function $f$.  Variance measures the extent to which $\hat{f}$ moves around its mean.   Intuitively, using more training examples improves bias, but introduces more variance.


As model complexity **increases**:
- Bias **decreases**. (The model can more accurately model complex structure in data.  If the model predicts every point perfectly, bias is 0.)
- Variance **increases**. (The model identifies more complex structures, making it more sensitive to small changes in the training data.)
 
**Examples**

- Predicting the mean value has high bias but 0 variance
- Predicting on the training data exactly has 0 bias but high variance, since it assumes no regularity in the training data

<br/> <br/>
<img src="bias_v_variance.jpeg" style="height: auto; width: 70%">


####  Implications for Training and Test Error

- A more complex model can reduce the training error to 0, but then we expect the test error to rise, unless the test data looks *exactly like* the training data
  * And if it did, then it wouldn't be an interesting learning problem.
- On the other hand, a simple model may work relatively better on the test set because it is less "committed" to the specific values in the training set
- We are looking for the minimizing point in this curve, but generally we can't get many test samples

<br/><br/>

<img src="trainingVsTestError.png" style="height: auto; width: 40%">

<span style="color:blue;font-size:120%">Exercise</span>

Where do each of these learning methods fall in terms of bias/variance tradeoff?

1.  Predict to the mean:  learn the function $y = \bar{y}$ for all $y$ in the training set 
2. 1 Nearest neighbor
3. 20-Nearest neighbors
4. Linear regression with one X variable
5. Linear regression with all X variables
5. Linear regression (biggest exponent 1)
1. Polynomial regression (with exponent terms up to 8)

---


<a id="exploring-the-bias-variance-tradeoff"></a>
### Exploring the Bias-Variance Trade-Off

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Allow plots to appear in the notebook.
%matplotlib inline

<a id="brain-and-body-weight-mammal-dataset"></a>
### Brain and Body Weight Mammal Data Set

This is a [data set](http://people.sc.fsu.edu/~jburkardt/datasets/regression/x01.txt) of the average weight of the body (in kg) and the brain (in g) for 62 mammal species. We'll use this dataset to investigate bias vs. variance. Let's read it into Pandas and take a quick look:

In [None]:
mammals = pd.read_table('mammals.txt', sep='\t', names=['brain','body'], header=0)
mammals.head()

In [None]:
mammals.describe()

In [None]:
# Interesting way to get a scatter plot ... get a linear model plus scatter plot,
#  and suppress the linear model :-)

sns.lmplot(x='body', y='brain', data=mammals, ci=None, fit_reg=False);

There appears to be a relationship between brain and body weight for mammals.

Now suppose we discover a new species with body weight 100.  Use the model to predict its brain size:

In [None]:
sns.lmplot(x='body', y='brain', data=mammals, ci=None);
plt.xlim(-10, 200);
plt.ylim(-10, 250);

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(mammals['body'].values.reshape(-1,1), mammals['brain'])
print(lr.coef_)
print(lr.predict(100))

But how good is the prediction?  We don't really know because there is no point in our data set with the x value 100 -- and we have no measurement of how well the model does for data points outside the training set.

## Reasoning About Training and Test Components

Let's simulate a sampling scenario by assigning each of the 51 observations to **either universe 1 or universe 2**:

In [None]:
# Set a random seed for reproducibility -- a seeded random number generator always produces 
#  the same sequence of "random numbers" -- really helps with testing, but be sure to turn it off
#  if you're doing simulation or etc.

np.random.seed(12345)

# Randomly assign every observation to either universe 1 or universe 2.
mammals['universe'] = np.random.randint(1, 3, len(mammals))
mammals.head()

We can now tell Seaborn to create two plots in which the left plot only uses the data from **universe 1** and the right plot only uses the data from **universe 2**:

In [None]:
# col='universe' subsets the data by universe and creates two separate plots.
sns.lmplot(x='body', y='brain', data=mammals, ci=None, col='universe');
plt.xlim(-10, 200);
plt.ylim(-10, 250);

The lines looks pretty similar between the two populations, despite the fact that they used completely separate samples of data. In both cases, we would predict a brain weight of about 45 g.

It's easier to see the degree of similarity by placing them on the same plot:

In [None]:
# hue='universe' subsets the data by universe and creates a single plot.
sns.lmplot(x='body', y='brain', data=mammals, ci=None, hue='universe');
plt.xlim(-10, 200);
plt.ylim(-10, 250);

So, what was the point of this exercise? This was a visual demonstration of a high-bias, low-variance model.

- It's **high bias** because it doesn't fit the data particularly well.
- It's **low variance** because the model doesn't change much depending on which observations happen to be available in that universe.

<a id="lets-try-something-completely-different"></a>
### Let's Try Something Completely Different

What would a **low bias, high variance** model look like? Let's try polynomial regression with an eighth-order polynomial.

In [None]:
sns.lmplot(x='body', y='brain', data=mammals, ci=None, col='universe', order=8);
plt.xlim(-10, 200);
plt.ylim(-10, 400);

- It's **low bias** because the models match the data effectively in both of the training sets
- It's **high variance** because the models are widely different, depending on which observations happen to be available in that universe. (For a body weight of 100 kg, the brain weight prediction would be 40 kg in one universe and 0 kg in the other!)


In [None]:
sns.lmplot(x='body', y='brain', data=mammals, ci=None, hue='universe', order=8);
plt.xlim(-10, 200);
plt.ylim(-10, 250);

<span style="color:blue; font-size=120%">Exercise:</span>

* Now relate this result back to training versus test MSE.   
* Of the line or polynomial
    * Which has the smaller training error
    * Which has the smaller test error

<a id="balancing-bias-and-variance"></a>
## Balancing Bias and Variance
Can we find a middle ground?

Perhaps we can create a model that has **less bias than the linear model** and **less variance than the eighth order polynomial**?

Let's try a second order polynomial instead:

In [None]:
sns.lmplot(x='body', y='brain', data=mammals, ci=None, col='universe', order=2);
plt.xlim(-10, 200);
plt.ylim(-10, 250);

This seems better. In both the left and right plots, **it fits the data well, but not too well**.

This is the essence of the **bias-variance trade-off**: You are seeking a model that appropriately balances bias and variance and thus will generalize to new data (known as "out-of-sample" data).

We want a model that best balances bias and variance. It
should match our training data well (moderate bias) yet be low variance for out-of-sample data (moderate variance).

- Training error as a function of
complexity.
- Question: Why do we even
care about variance if we
know we can generate a
more accurate model with
higher complexity?

### Can we obtain a zero-bias, zero-variance model?

Only if 
1. You choose the right functional form for the data
1. You have adequate training data
1. The data has no irreducible error

Remember MSE can be be broken down into
1. Bias term
1. Variance term
1. Irreducible error
