## k-Fold Cross-Validation

In this notebook we are going to use 10-Fold cross-validation to assess the out-of-sample accuracy of two regression models (linear and quadratic regression).
We will use sklearn's `cross_val_score`, which takes care of all the tasks associated with $k$-fold CV:
* splitting the dataset into folds,
* iterating over all the folds and choos a different one as the test set at each iteration,
* training the model on the remaining $k-1$ folds,
* and finally compute the model's error on the left-out fold.
This function returns an array with all the $k$ test MSEs it computed.
To get the final MSE of our model, we average these $k$ values.

**Technical note**: As a design decision by sklearn's authors, `cross_val_score` wants a *scoring* function, i.e., a function which is higher when the model performs better and lower when the model performs poorly. The MSE, on the other hand, is a *loss* function: it is low for good models and high for bad models. The simplest way to turn a loss function into a scoring function is to take its negative (multiply by -1). That's why the scoring function is referred to as `neg_mean_squared_error` (*neg* stands for negative).

**Note**: I am going to fix the *random seed* used by `KFold` so that this notebook is reproducible: two people running it should now get the same results.

In [1]:
import pandas as pd

In [2]:
d = pd.read_csv('auto-mpg.csv')

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_score
import numpy as np

In [4]:
X = d.drop('mpg', axis=1)
y = d.mpg

### Creating the two models using sklearn's pipelines

In [5]:
linear_model = make_pipeline(
    StandardScaler(),
    LinearRegression()
)

In [6]:
quadratic_model = make_pipeline(
    PolynomialFeatures(degree=2),
    StandardScaler(),
    LinearRegression()
)

### Perform k-Fold CV and average the (negative) scores

Note how we use `cv=KFold()` to tell `cross_val_score` that we want to use $k$-Fold CV as our cross-validation method.
We pass the `n_splits=10` parameter (i.e., we do **10**-fold CV) and `shuffle=True` to make sure we shuffle rows before creating folds.

In [7]:
# Using random_state=0 to make the notebook reproducible
sol_linear = cross_val_score(
    estimator=linear_model,
    X=X, y=y,
    scoring='neg_mean_squared_error',
    cv=KFold(n_splits=10, shuffle=True, random_state=0))

In [8]:
-1 * np.mean(sol_linear)

11.507884923299752

In [9]:
# Using random_state=0 to make the notebook reproducible
sol_quadratic = cross_val_score(
    estimator=quadratic_model,
    X=X, y=y,
    scoring='neg_mean_squared_error',
    cv=KFold(n_splits=10, shuffle=True, random_state=0))

In [10]:
-1 * np.mean(sol_quadratic)

12.737040297390678

It looks like the linear regression model has a lower error: it is the model we will use in production!

In [11]:
winner = linear_model.fit(X, y)