# Estimating test error

As we have seen, _training error_ is often a poor indicator of how well a machine learning model will perform when confronted with new data.  Today, we will look at the problem of estimating the _test error_ of a model.  Estimating test error is critical for building machine learning models that give us reliable predictions on new input data.

In [None]:
import numpy as np
import plotnine as pn
import pandas as pd
from numpy.random import default_rng
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, KFold


## 1. Data

Let's start by loading a dataset that comprises 100 observations of the function we investigated in detail during last class period:

$$ y = \frac{1}{2} \cos(3\pi x) + \frac{3\pi x}{5} + \epsilon ,$$

where $Var[\epsilon] = 0.5$.

In [None]:
df = pd.read_csv('weird_function_data.csv')
#df = pd.read_csv('/blue/zoo4926/share/Jupyter_Content/data/weird_function_data.csv')

## 2. Inspect the training error

Let's fit polynomials of various degree to the full dataset and look at how the training error, as defined by $MSE$, changes.  Remember,

$$ MSE = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y_i})^2 .$$

**Challenge:** Implement a function to calculate $MSE$, given a model, an $x$ matrix, and an array $y$ of "true" values.

## 3. The validation set method

One way to estimate test error is to randomly split the training dataset into two parts, designating one part as the _training set_ and the other part as the _validation set_.  We train our model on the training set and use the validation set to estimate the test error.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

**Question:** Do you see any problems with this approach?

## 4. $k$-fold Cross-validation

_$k$-fold cross-validation_ is a technique that overcomes some of the problems with using a single train/validation split.  The idea is that we randomly split our dataset into $k$ (approximately) equal-sized subsets.  We use each subset as the validation set in turn, yielding a total of $k$ estimates of our test error, which we then average to obtain our final test error estimate.  (Note that the way I am using our custom $MSE$ function in `cross_val_score()` is a little bit of a hack.  See the [scki-kit learn documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) for the approved way of doing this.)

In [None]:
kf = KFold(n_splits=10, shuffle=True)
res = cross_val_score(model, x, y, cv=kf, scoring=mse)

## 5. Leave-one-out cross-validation

What if, for $k$-fold cross-validation, we use $k = N$, where $N$ is the number of observations in our dataset?  Think for a moment about how this would work:
  * How big will our validation sets be?
  * How many model fits will we need?
 
This scenario is called _leave-one-out cross-validation_.

## 6. So which is best?

Short answer: bias and variance again, plus computational considerations!

## 7. Final thoughts

How do these methods connect to other methods for evaluating and comparing models?

## 8. Practice

Using the tools we discussed today, revisit the problem of predicting a car's fuel efficiency ($mpg$) from its engine's horsepower ($pw$).  Earlier, we looked at linear and quadratic models.  Would a higher-degree polynomial work better?  Don't forget to use `StandardScaler()` in your `Pipeline`.

In [None]:
df = pd.read_csv('data/auto_mpg.csv')
#df = pd.read_csv('/blue/zoo4926/share/Jupyter_Content/data/auto_mpg.csv')