## Assessing Performance

---
### Defining how we assess performance

- Measuring loss
 - Loss function: $L(y, f_{\hat{w}}(X))$
   - $y$: actual value
   - $\hat{f(X)}$: predicted value $\hat{y}$
 - exmaples:
   - Absolute error: $L(y, f_{\hat{w}}(X)) = |y-f_{\hat{w}}(X)|$
   - Absolute error: $L(y, f_{\hat{w}}(X)) = (y-f_{\hat{w}}(X))^2$
 - "Remember that all models are worng; the practical question is how wrong do they have to be to not be useful." George Box, 1987

---
### 3 measures of loss and their trends with model complexity

#### Training error
- Compute training error
  1. Define a loss function $L(y, f_{\hat{w}}(X))$
    - E.g., squared error, absolute error, RMSE(root mean squared error)..
  2. Training error
    - avg.loss on houses in training set
    - $\frac{1}{N} \sum L(y, f_{\hat{w}}(X))$
    
- Training error vs. model complexity
 - training error decreases significantly with model complexity
- Is training error a good measure of predictive performance?
 - issue: Training error is overly optimisic because $\hat{w}$ was fit to training data
 - *having small training error does not imply having good predictive performance*

#### Generalization (true) error
- Formally: generalization error = $E_{X, y}[L(y, f_{\hat{w}}(X))]$
 - average over all possible (X, y) paires weighted by how likely each is
- Generalization error vs. model complexity
 - generalization error is going down, and the we get to a point where the error starts increasing
 - Can't compute! (It's ideal)
 
#### Test error
- Approximating generalization error
- Hold out some $(X, y)$ that are *not* used for fitting the model
- test error
 - avg.loss on houses in test set
 - $\frac{1}{N_{test}} \sum L(y, f_{\hat{w}}(X))$
 - $\hat{w}$ minimizes RSS of training data!
- Training, true, & test error vs. model complexity
 - test error is a noisy approximation of generalization error

#### Overfitting
- If there exists a model iwht estimated params $w'$ such that
 - training error($\hat{w}$) < training error($w'$)
 - test error($\hat{w}$) > test error($w'$)
 
<img src="./figures/w3-f1.png" width=400>

#### Training/test split
- Too few training set
 - $\hat{w}$ poorly estimated
- Too few test set
 - test error bad approximation of generalization error
- Rule of thumb
 - Typically, just enough test points to form a reasonable estimate of generalization error
 - If this leaves too few for training, other methods like *cross validation*

---
### 3 sources of error and the bias-variance tradeoff

1. Noise
2. Bias
3. Variance

#### Noise
- Data inherently noisy
- $y_i = f_{w(true)}(X_i) + \epsilon_i$
- **Irreducible error**: $\epsilon_i$
<img src="./figures/w3-f2.png" width=400>

#### Bias
- Assume we fit a constant function(=simple model, low complexity model)
 - Over all possible size $N$ training sets, what do I expect my fit to be?
 - Bias is the difference between this average fit and the true function
 - $Bias(X) = f_{w(true)}(X) - f_{\bar{w}}(X)$
- *Is our approach flexible enough to caputre $f_{w(true)}(X)$? If not, error in predictions*
 - **low complexity -> high bias**

<img src="./figures/w3-f3.png" width=350 align=left>
<img src="./figures/w3-f4.png" width=400 align=right>

#### Variance
- How much do specific fits vary from the expected fit?
 - $Var(X) = f_{\hat{w}}(X) - f_{\bar{w}}(X)$
- *Can specific fits vary widely? If so, erratic predictions*
 - **low complexity -> low variance**

<img src="./figures/w3-f5.png" width=350 align=left>
<img src="./figures/w3-f6.png" width=380 align=right>

- Assume we fit a high-order polynomial(=high complexity model)
 - The variation between these fits, really large

- **high complexity -> high variance**

<img src="./figures/w3-f7.png" width=350 align=left>
<img src="./figures/w3-f8.png" width=350 align=right>

- **high complexity -> low bias**

<img src="./figures/w3-f9.png" width=400>

#### Bias-variance tradeoff
- Bias decrease, Variace increase at high complexity
- $MSE = bias^2 + variance$
- the goal is finding a sweet spot
- Just like with generalization error, we **cannot** compute bias and variance
 - because we cannot compute true function and cannot get all data in the world
<img src="./figures/w3-f10.png" width=400>

#### Error vs. amount of data (for a fixed model complexity)
- true error
 - $\hat{w}$ is not approximated well from few points
 - decrease error until meet the limit(bias + noise)
- training error
 - with few data points, a fixed complexity model can fit these points reasonably well
- In the limit, the curve will flatten out to how well model can fit true relationship $f_{true}$
- In the limit, true error = training error
<img src="./figures/w3-f11.png" width=500>

---
### OPTIONAL ADVANCED MATERIAL: Formally defining and deriving the 3 sources of error

#### Accounting for training set randomness
- Training set was just a random sample of N houses sold
- What if *N other houses* had been sold and recorded?
- Ideally, want performance `averaged over all possible training sets` of size N
<img src="./figures/w3-f12.png" width=400>

#### Expected prediction error
- $E_{\text{training set}}[\text{generalization error of }\hat{w}(\text{training set})]$
 - $E_{\text{training set}}$: averaging over all training sets (weighted by how likely each is)
 - $\hat{w}$: parameters fit on a specific training set

- Prediction error at target input:
  1. Loss at target $X_{t}$ (e.g. 2640 sq.ft.)
  2. Squared error loss $L(y, f_{\hat{w}}(X)) = (y - f_{\hat{w}}(X))^2$
  
#### Sum of 3 sources of error
- **Average prediction error at $X_t$**
 - $\sigma^2 + [bias(f_{\hat{w}}(X_t))]^2 + var(f_{\hat{w}}(X_t))$

<img src="./figures/w3-f13.png" width=400>


#### Error variance of the model ($\sigma^2$)
- $y = f_{w(true)}(X) + \epsilon$
 - $\epsilon$: all other factors out there in the world are captured by noise term
 - $E[\epsilon] = 0$
 
- Error variance $\sigma^2$
 - the spread of noise you're likely to see at any point in the input space.

- `Irreducible error`
 - no control over it no matter how complicated and interesting of a model, we specify our algorithm for fitting that model
 - we can't do anything about the fact that we're using $X$ for our prediction. But there's just inherently some noise in how our observations are generated in the world.


<img src="./figures/w3-f14.png" width=400>

#### Bias of function estimator
- Average estimated function $f_{\bar{w}}(X)$
 - averaged over all possible training data sets of size $N$ that I might get.
- True function $f_{w}(X)$

- Bias
 - $bias(f_{w(true)}(X_t)) = f_{w}(X_t) - f_{\bar{w}}(X_t)$
 - when it comes in as error term $\sigma^2 + [bias(f_{\hat{w}}(X_t))]^2 + var(f_{\hat{w}}(X_t))$,
 - `bias squared` because of scaling with the other terms($\sigma^2, var(f_{\hat{w}}(X_t))$)

<img src="./figures/w3-f15.png" width=370 align=left>
<img src="./figures/w3-f16.png" width=350 align=right>

#### Variance of function estimator

- Variance
 - Over all possible fits, How much do they deviate from expected fit
 - how much variation is there in the training dataset specific fits across all training datasets we might see?
 - $var(f_{\hat{w}}(X_t)) = E_{train}[(f_{\hat{w}(train)}(X_t) - f_{\bar{w}}(X_t))^2]$
 
<img src="./figures/w3-f17.png" width=370 align=left>
<img src="./figures/w3-f18.png" width=350 align=right>

#### Deriving expected prediction error
- Expected prediction error
 - $E_{\text{train}}[\text{generalization error of }\hat{w}(\text{train})]$
 - $E_{\text{train}}[E_{X, y}[L(y, f_{\hat{w}(train)}(X))]]$


1. Look at target $X_{t}$
2. Consider $L(y, f_{\hat{w}}(X)) = (y - f_{\hat{w}}(X))^2$


- **Expected prediction error at $X_t$**
 - = $E_{\text{train}, y_t}[(y_t - f_{\hat{w}(train)}(X_t))^2]$

<img src="./figures/w3-f18.png" width=400>
<img src="./figures/w3-f19.png" width=400>
<img src="./figures/w3-f20.png" width=400>
<img src="./figures/w3-f21.png" width=400>

---
### Data split for model selection

#### The regression/ML workflow
1. Model selection
 - Often, need to **choose tuning parameters** $\lambda$ controlling model complexity (e.g., degree of polynomial)
2. Model assessment
 - Having selected a model, **assess the generalization error**

#### Hypothetical implementation (`Overly optimistic!`)

<img src="./figures/w3-f23.png" width=300>

1. Model selection
 - For each considered model complexity $\lambda$:
   - Estimate parameters $\hat{w}_{\lambda}$ on **training data**
   - Assess performance of $\hat{w}_{\lambda}$ on **test data**
   - Choose $\lambda^{*}$ to be $\lambda$ with **lowest test error**


2. Model assessment
 - Compute test error of $\hat{w}_{\lambda^*}$ (fitted model for selected complexity $\lambda^*$ to approx. generalization error
 

- **Issue:** Just like fitting $\hat{w}$ and assessing its performance both on training data
 - $\lambda^*$ was selected to minimize **test error** (i.e., $\lambda^*$ was fit on test data)
 - If test data is not representative of the whole world, then $\hat{w}_{\lambda^*}$ will typically perform worse than **test error** indicates

#### Practical implementation(`Solution`)

<img src="./figures/w3-f24.png" width=300>

- **Solution :** Create two "test" sets!
 - Selecte $\lambda^*$ such that $\hat{w}_{\lambda^*}$ minimizes error on **validation set**
 - Approximate generalization error of $\hat{w}_{\lambda^*}$ using **test set**
- Doing the split between training set, validation set, and test set?
 - No hard and fast rule, no one answer
 - Typical splits are 80:10:10, 50:25:25 etc.
 - It's assuming that you have enough data to do this type of split and still get reasonable estimates of your model parameters reasonable notions of how different model complexities compare.

---
### What you can do now...
- Describe what a loss function is and give examples
- Contrast training, generalization, and test error
- Compute training and test error given a loss function
- Discuss issue of assessing performance on trainng set
- Describe tradeoffs in forming training/test splits
- List and interpret the 3 sources of avg. prediction error
 - Irreducible error, bias, and variance
- Discuss issue of selecting model complexity on test data and then using test error to assess generalization error
- Motivate use of a validation set for selecting tuning parameters (e.g., model complexity)
- Describe overall regression workflow