### Training Versus Testing 

**Error Measures**
- User-specified e(h(x),f(x))
- In-sample $E_{in}(h) = \frac{1}{N}\sum_{n=1}^Ne(h(x_n), f(x_n))$
- Out-of-sample $E_{out}(h)=E_x[e(h(x),f(x))]$

**Noisy Targets.**
The relationship that we are trying to learn may not be a deterministic function of x.
![title](img/noisy_target.png)

- $(x_1, y_1), ..., (x_n,y_n)$ generated by 
$$P(x,y) = P(x)p(y|x)$$
- $E_{out}(h)$ is now $E_{x,y}[e(h(x),y)]$

### Bias Variance Tradeoff
approximation-generalization tradeoff

Samll $E_{out}$: good approximation of f out of sample

More complex H -> better chance of approximating f

Less complex H -> better chance of generalizing out of sample

#### Bias Variance analysis
Decomposing $E{out}$ into 
1. How well H can approximate f
2. how well can we zoom in on to a good $h \in H$

out of sample error for a specific dataset:

$E_{out}(g^D) = E_x[(g^D(x)-f(x))^2] $

Now we try to find the expectation of the out of sample error given a specific size of dataset

\begin{align}
E_D[E_{out}(g^D)] & = E_D[ E_x[(g^D(x)-f(x))^2]]\\
& = E_x[ E_D[(g^D(x)-f(x))^2]]
\end{align}

Now we focus on $ E_D[(g^D(x)-f(x))^2]$. **This quantity tells how far your hopo learned from the (particular)dataset differs from the ultimate target.**



To evaluate this, we define the 'average' hypo $\bar g(x)$

$\bar g(x) = E_D[g^D(x)]$

\begin{align}
E_D[(g^D(x)-f(x))^2] & = E_D[(g^D(x)-\bar g(x) + \bar g(x)-f(x))^2] \\
& = E_D[(g^D(x)-\bar g(x))^2 + (\bar g(x)-f(x))^2) + 2(g^D(x)-\bar g(x))(\bar g(x)-f(x))] \\
& = E_D[(g^D(x)-\bar g(x))^2] + (\bar g(x)-f(x))^2
\end{align}

**Variance.** $E_D[(g^D(x)-\bar g(x))^2]$ measures how far your hopo learnt from a particular dataset depart from the best hypo you can get from your hopothesis set.

**Bias.** $(\bar g(x)-f(x))^2$ measures how far the best hypo from the hypothesis set is from the target function.


Therefore, 
\begin{align}
E_D[E_{out}(g^D)] & = E_x[ E_D[(g^D(x)-f(x))^2]]\\
&= E_x[bias(x)+var(x)] \\
&=bias+var
\end{align}

#### Rule of Thumb!

Always match the model complexity to the data resources, not to the target complexity!

### Learning curve

![learning curve](img/learning_curve.png)

Observations:
1. Complex models approximate better than simple models(the expected error is lower). 
2. Generalisation ability can be gauged by the discrapency of E_in and E_out. Simple models have less generalisation error (more powerful in generalisation)

So which one is better? - match the model by the resource that you have. e.g.
1. Depends on number of examples (N). if N is small, you cannot afford a complex model. 

![vc analysis vs bias variance tradeoff](img/vc&bv.png)

### Overfitting

Overfitting is different from bad generalisation.
Overfitting can happen in the same model.

Overfitting: when $E_{in}$ goes down and $E_{out}$ goes up. -> fitting the noise.

Even when the target function has no noise, overfitting can still occurs.

#### Impact of noise level and target complexity

![overfitting](img/overfitting.png)

note: red -> overfitting


first firgure suggests noise stochastic 
error caused by having too complex mode: deterministic noise

Observations:
1. number of data points goes up, overfitting goes down
2. stochastic noise/deterministic noise goes up -> overfitting goes up

**Definition of deterministic noise**: the part of f that H cannot capture. 

Its main differences with stochastic noise is that 

1. deterministic noise depends on H. 
2. its fixed for a given x.

**However**, for finite N, H tried to fit the noise (including the deterministic noise).

### Noise and bias-variance 

\begin{align}
E_D[(g^D(x)-y)^2] & = E_D[(g^D(x)-\bar g(x) + \bar g(x)-f(x) - \epsilon(x))^2] \\
& = E_D[(g^D(x)-\bar g(x))^2 + (\bar g(x)-f(x))^2 + \epsilon(x)^2 + crossterm] \\
& = E_D[(g^D(x)-\bar g(x))^2] + (\bar g(x)-f(x))^2 + \sigma ^2
\end{align}

bias = deterministic noise; fixed given a hypo set!

$\sigma$: stochastic noise

The model will tries to capture both and thats why there is a variance term, because the model cannot tell signal from noise.


### How to deal with overfitting - regularization & validation

$E_{out}(h) = E_{in}{h} +$ overfit penalty

regularization estimate the overfit penalty, while validation estimate the out-of-sample error.

on a validation set $(x_1, y_1), .., (x_k,y_k)$. the error is $E_{val}(h) = \frac{1}{K}\sum_1^K e(h(e_k),y_k)$

So how reliable is this estimate?

$E[E_{val}(h)] = E_{out}(h)$

$var[E_{val}(h)] = \frac{\sigma^2}{K}$

This implies that, when K is small, we will have a bad estimate (variance too high). 

If K is very large, we are having a reliable estimate, but since we now have less data in the train set, we end up having a reliable estimate of a bad model. 

#### difference between test set and validation set

test set is unbiased; validation set has optimistic bias.