# Evaluating Regression Models

## Training and Test Data

The data for which we know the label $y$ is called the **training data.**

The data for which we don't know $y$ (and want to predict it) is called the **test data.**

In [1]:
import pandas as pd

df = pd.read_csv("data/bordeaux.csv", index_col="year")
df_train = df.loc[:1980].copy()
df_test = df.loc[1981:].copy()

Let's seperate the inputs $X$ from the labels $y$.

In [2]:
X_train = df_train[["win", "summer"]]
y_train = df_train["price"]

X_test = df_test[["win", "summer"]]

## $K$-Nearest Neighbors

We've seen one machine learning model: $k$-nearest neighbors.

In [4]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(n_neighbors=5)
)

pipeline.fit(X=X_train, y=y_train)
pipeline.predict(X=X_test)

array([35.8, 54. , 52.2, 18.4, 35.6, 13.2, 37. , 51.4, 36.6, 36.6, 40.6])

*Today*: How do we know if this model is any good?

## Prediction Error

If the true labels are $y_1, \dots, y_n$ and our model predicts $\hat{y}_1, \dots, \hat{y}_n,$ how do we measure how well our model did?

- **mean squared error (MSE)**

$$
\mathsf{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
$$

- **mean absolute error (MAE)**

$$
\mathsf{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|
$$

Calculating MSE or MAE requires data where true labels are known. Where can we find such data?

## Training Error

On the training data, the true labels $y_1, \dots, y_n$ are known.

Let's calculate the **training error** of our model.

In [5]:
pipeline.fit(X=X_train, y=y_train)
y_train_ = pipeline.predict(X=X_train)
((y_train - y_train_) ** 2).mean()

np.float64(207.24148148148146)

There's also a Scikit-Learn function for that!

In [6]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_true=y_train, y_pred=y_train_)

207.24148148148146

How do we intrepret this MSE of $207.24$? <br />
Remember, we are predicting the price of wine. So the model is off by 208.24 square dollars on average.

The square root is easier to interpret. The model is off by $\sqrt{207.24} \approx \$14.40$

## The Problem with Training Error

What's the training error of a $1$-nearest neighbor model?

In [7]:
pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsRegressor(n_neighbors=1)
)
pipeline.fit(X=X_train, y=y_train)
y_train_ = pipeline.predict(X=X_train)
mean_squared_error(y_true=y_train, y_pred=y_train_)

0.0

Why did this happen? <br />
The 1-nearest neighbor to any observatoiin in the training data is the observation itself!

A 1-nearest neighbor model will always be perfect on the training data. But is it necessarily the best model?

## Test Error

We don't need to know how well our model does on *training data.*

We want to know how well it will do on *test data.*

In general, test error $>$ training error.

Analogy: A professor posts a practice exam before an exam.
- If the actual exam is the same as the practice exam, how many points will students miss? That's training error.
- If the actual exam is different from the practice exam, how many points will students miss? That's test error.

It's always easier to answer questions that you've seen before than questions you haven't seen.

*Now:* How do we estimate the test error?

## Validation Set

The training data is the only data we have, where the true labels $y$ are known.

So one way to estimate the test error is to not use all of the training data to fit the model, leaving the remaining data for estimating the test error.

## Implementing the Validation Set