# Chapter 5.3 Training Error and Overfitting

In the last section, you learned how to build regression models. In this section, you will learn how to evaluate the quality of predictions from a model.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
pd.options.display.max_rows = 5

housing = pd.read_csv("/data301/data/AmesHousing.txt", sep="\t")
housing

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2928,2929,924100070,20,RL,77.0,10010,Pave,,Reg,Lvl,...,0,,,,0,4,2006,WD,Normal,170000
2929,2930,924151050,60,RL,74.0,9627,Pave,,Reg,Lvl,...,0,,,,0,11,2006,WD,Normal,188000


## Performance Metrics for Regression Models

To evaluate the performance of a regression model, we compare the predicted labels from the model against the true labels. Since the labels are quantitative, it makes sense to calculate the difference between each predicted label $\hat y_i$ and the corresponding true label $y_i$, and then average the squared differences. This measure of error is known as **mean squared error** (or **MSE**, for short):

$$ 
\begin{align*}
\textrm{MSE} &= \textrm{mean of } (y_i - \hat y_i)^2.
\end{align*}
$$ 

MSE is difficult to interpret because its units are the square of the units of $y$. To make MSE more interpretable, it is common to take the _square root_ of the MSE to obtain the **root mean squared error** (or RMSE, for short):

$$ 
\begin{align*}
\textrm{RMSE} &= \sqrt{\textrm{MSE}}.
\end{align*}
$$ 

The RMSE measures how off a "typical" prediction is.


Another common measure of error is the **mean absolute error** (or **MAE**, for short):

$$ 
\begin{align*}
\textrm{MAE} &= \textrm{mean of } |y_i - \hat y_i|.
\end{align*}
$$ 

Like the RMSE, the MAE measures how off a "typical" prediction is. There are other metrics that can be used to measure the quality of a regression model, but these are the most common ones.

## Training Error

To calculate the MSE, RMSE, or MAE, we need data where the true labels are known. Where do we find such data? One natural source of labeled data is the training data, since we need the true labels to be able to train a model.

For $k$-nearest neighbors, the training data is the data from which the $k$-nearest neighbors are selected to make predictions. So to calculate the training RMSE, we do the following for each observation in the training data:

1. Find its $k$-nearest neighbors in the training data.
2. Average the labels of the $k$-nearest neighbors to obtain the predicted label.
3. Subtract the predicted label from the true label.

Then, we can, for example, square these differences and take their mean to obtain the MSE.

Let's calculate the training error for a 10-nearest neighbors model for house price using a subset of features from the Ames housing data set. First, let's get the predicted labels for each house in the training data.

In [None]:
# Features in our model. All quantitative, except Neighborhood.
features = ["Lot Area", "Gr Liv Area",
            "Full Bath", "Half Bath",
            "Bedroom AbvGr", 
            "Year Built", "Yr Sold",
            "Neighborhood"]

X_train = pd.get_dummies(housing[features])
y_train = housing["SalePrice"]

In [None]:
# Standardize the variables.
X_train_std = (X_train - X_train.mean()) / X_train.std()

# Use the average price of the 10 nearest neighbors as the prediction.
def get_10NN_prediction(obs):
    """Given a new observation (standardized), find its
       10-nearest neighbors in the training data and 
       return the average label.
    """
    # Omit the square root, since only the order of the distances matters.
    dists = ((obs - X_train_std) ** 2).sum(axis=1)
    i_nearest = dists.sort_values().index[:10]
    return y_train[i_nearest].mean()

# Apply the function above to each row in the training data.
y_train_pred = X_train_std.apply(get_10NN_prediction, axis=1)
y_train_pred

Finally, we compare the predicted labels against the actual labels. Note that the actual labels for the training data are found in `y_train`.

In [None]:
mse = ((y_train - y_train_pred) ** 2).mean()
mse

This number is very large and not very interpretable (because it is in units of "dollars squared"). Let's take the square root to obtain the RMSE.

In [None]:
rmse = np.sqrt(mse)
rmse

The RMSE says that our model's predictions are, on average, off by about \$33,000. Not great, but not too bad when an average house is worth about \$180,000.

### The Problem with Training Error

Training error is not a great measure of the quality of a model. To see why, consider a 1-nearest neighbor regression model. Before you read on, can you guess what the training error of a 1-nearest neighbor regression model will be?

In [None]:
# Use the price of the 1 nearest neighbor as the prediction.
def get_1NN_prediction(obs):
    dists = ((obs - X_train_std) ** 2).sum(axis=1)
    return y_train[dists.idxmin()]

# Apply the function above to each row in the training data.
y_train_pred = X_train_std.apply(get_1NN_prediction, axis=1)

# Calculate the difference between the predictions and the true labels.
mse = ((y_train - y_train_pred) ** 2).mean()
rmse = np.sqrt(mse)
rmse

The training error of this model seems too good to be true. Can our model really be off by just \$1000 on average?

The error is only small because the nearest neighbor to any point in the training data will always be the point itself! In fact, if we look at the vector of differences between the true and predicted labels, we see that most of the differences were zero.

In [None]:
y_train - y_train_pred

Why isn't the MSE exactly equal to 0, then? That is because there may be multiple houses in the training data with the exact same values of the features, so there may be multiple observations that are a distance of 0.0 away. Any one of these observations can be the "1-nearest neighbor". If the house that we select to be the nearest neighbor happens to be different from the house we are predicting for, then there is no guarantee that its price will be the same.

How many houses did the model get wrong using 1-nearest neighbors?

In [None]:
((y_train - y_train_pred) != 0).sum()

The model nailed the price exactly for all but 22 of the 2930 houses. Clearly, training error is too optimistic about the performance of the 1-nearest neighbor model.

## Test Error

The problem with training error is that it reports the error of the model on data that it has already seen. The purpose of building a machine learning model is to predict the labels for new data---data that it did not train on. We would like to measure how well our model would perform on new data. Unfortunately, acquiring new data is expensive. How can we emulate the process of evaluating our model on new data, using just the data that we have?

We can split our data into two halves: a **training set** which will be used to train the model and a **test set** which will be used to evaluate the model. To split our data into training and test sets, we can use the `.sample()` function in `pandas`. Let's use this to split our data into two equal halves, which we will call `train` and `test`.

In [None]:
train = housing.sample(frac=.5)
test = housing.drop(train.index)

train

Now let's use this train/test split to estimate the **test error** of a 10-nearest neighbors model.

First, we extract the variables we need. We have to be careful if we have categorical features because categories that appear in test set may not appear in the training set (and vice versa). To be cautious, we will call `get_dummies` on both sets together and then split them again. Notice the use of `.iloc` below (because our indexes were scrambled by the train/test split).

In [None]:
X_train = train[features]
X_test = test[features]

y_train = train["SalePrice"]
y_test = test["SalePrice"]

X = pd.get_dummies(pd.concat([X_train, X_test]))
X_train = X.iloc[:len(X_train)]
X_test = X.iloc[len(X_train):]

X_train

Now, let's standardize the training data. Remember that we have to standardize any new data in exactly the same way, so we need to also standardize the test data using the mean and SD of the _training_ data.

In [None]:
X_train_std = (X_train - X_train.mean()) / X_train.std()
X_test_std = (X_test - X_train.mean()) / X_train.std()

We are now in a position to predict on the test set and evaluate the test error. To do this, we need to determine for each observation in the test set, its 10-nearest neighbors in the training set.

In [None]:
def get_10NN_prediction(obs):
    """Given a new observation (standardized), find its
       10-nearest neighbors in the training set and 
       return the average label.
    """
    dists = ((obs - X_train_std) ** 2).sum(axis=1)
    i_nearest = dists.sort_values().index[:10]
    return y_train[i_nearest].mean()

# Apply the function above to each row in the test set.
y_test_pred = X_test_std.apply(get_10NN_prediction, axis=1)
y_test_pred

Notice that the observations that we are predicting on (i.e., the test set) are completely distinct from the observations that we use to obtain the predictions (i.e., the training set).

Finally, let's calculate the test RMSE.

In [None]:
mse = ((y_test - y_test_pred) ** 2).mean()
rmse = np.sqrt(mse)
rmse

Notice that the test error is higher than the training error that we calculated earlier. In general, this will be true. It is harder for a model to predict for new observations it has not seen, than for observations it has seen!

## Cross Validation

One downside of the test error above is that it was calculated using only 50% of the data. As a result, the estimate is very noisy.

There is a cheap way to obtain a second opinion of how well our model will do on new data. Previously, we split our data at random into two halves, training the model on the first half and evaluating it using the second half. Because the model has not already seen the second half of the data, this is a valid measure of how well the model would perform on new data. 

But the way we split our data was arbitrary. We can just as well swap the roles of the two halves, training the model on the _second_ half and evaluating it using the _first_ half. As long as the model is being evaluated on data that is different from the data that was used to train it, we have a valid measure of how well our model would perform on new data. A schematic of this approach, known as **cross-validation**, is shown below.

<img src="cross-validation.png" />

Because we will be doing the same computations twice, just with different data, let's wrap the $k$-nearest neighbors algorithm above into a function called `get_test_error()`, that computes the test error given training and test data.

In [None]:
def get_test_error(X_train, y_train, X_test, y_test):
    
    # standardize the data with respect to the training set
    X_train_std = (X_train - X_train.mean()) / X_train.std()
    X_test_std = (X_test - X_train.mean()) / X_train.std()
    
    # a function that returns the 10NN prediction for a given observation
    def get_10NN_prediction(obs):
        dists = ((obs - X_train_std) ** 2).sum(axis=1)
        i_nearest = dists.sort_values().index[:10]
        return y_train[i_nearest].mean()

    # get the predictions for the test set
    y_test_pred = X_test_std.apply(get_10NN_prediction, axis=1) 
    
    # calculate the RMSE
    mse = ((y_test - y_test_pred) ** 2).mean()
    rmse = np.sqrt(mse)
    return rmse

If we apply this function to the training and test sets from earlier, we get the same estimate of the test error.

In [None]:
get_test_error(X_train, y_train, X_test, y_test)

But if we reverse the roles of the training and test sets, we get another estimate of the test error.

In [None]:
get_test_error(X_test, y_test, X_train, y_train)

Now we have two, somewhat independent estimates of the test error. It is common to average the two to obtain an overall estimate of the test error, called the **cross-validation test error**. Notice that the cross-validation error uses each observation in the data exactly once. We make a prediction for each observation, but always using a model that was trained on data that does not include that observation.

# Exercises

**Exercise 1.** Use cross-validation to estimate the test error of a 1-nearest neighbor classifier on the housing price data. How does a 1-nearest neighbor classifier compare to a 10-nearest neighbor classifier in terms of test error?

In [None]:
# TYPE YOUR CODE HERE.

**Exercise 2.** Train a $k$-nearest neighbors regression model to predict the tip using the Tips dataset (`/data301/data/tips.csv`). Calculate the training and test errors of your model.

In [None]:
# TYPE YOUR CODE HERE.