In [47]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
%matplotlib inline

### 5.3.1 The Validation Set Approach
We explore the use of the validation set approach in order to estimate the
test error rates that result from fitting various linear models on the Auto
data set.
Before we begin, we use the set.seed() function in order to set a seed for seed
R’s random number generator, so that the reader of this book will obtain
precisely the same results as those shown below. It is generally a good idea
to set a random seed when performing an analysis such as cross-validation
that contains an element of randomness, so that the results obtained can
be reproduced precisely at a later time.
We begin by using the sample() function to split the set of observations sample() into two halves, by selecting a random subset of 196 observations out of
the original 392 observations. We refer to these observations as the training
set.

In [42]:
# Load dataset
# We later want to fit using horsepower, so remove its rows with NA
auto = pd.read_csv('Data/Auto.csv', na_values='?')
auto = auto.dropna()

X = np.array(auto["horsepower"]).reshape(-1, 1)
Y = np.array(auto["mpg"]).reshape(-1, 1)

# Set the random seed
np.random.seed(1)

# Train and test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=0)

We then fit a linear regression using only the observations corresponding to the training set, and calcualte the MSE on the test set.

In [43]:
model = LinearRegression().fit(X_train, y_train)
predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)
display(f"The MSE for linear regression is {mse}")

'The MSE for linear regression is 23.61661706966988'

Therefore, the estimated test MSE for the linear regression fit is 23.61. We can also estimate the test error for the quadratic and cubic regressions.

In [38]:
# Quadratic regression
poly2 = PolynomialFeatures(2)

X_v2= poly2.fit_transform(X)

X_train_v2, X_test_v2, y_train_v2, y_test_v2 = train_test_split(X_v2, Y, test_size=0.5, random_state=0)

model = LinearRegression().fit(X_train_v2, y_train_v2)
predictions_v2 = model.predict(X_test_v2)

mse_v2 = mean_squared_error(y_test_v2, predictions_v2)
display(f"The MSE for quadratic regression is {mse_v2}")

# Cubic regression

poly3 = PolynomialFeatures(3)

X_v3= poly3.fit_transform(X)

X_train_v3, X_test_v3, y_train_v3, y_test_v3 = train_test_split(X_v3, Y, test_size=0.5, random_state=0)

model = LinearRegression().fit(X_train_v3, y_train_v3)
predictions_v3 = model.predict(X_test_v3)

mse_v3 = mean_squared_error(y_test_v3, predictions_v3)
f"The MSE for cubic regression is {mse_v3}"


'The MSE for quadratic regression is 18.763031346897684'

'The MSE for cubic regression is 18.79694163262019'

These error rates are 18.76 and 18.79, respectively. If we choose a different
training set instead, then we will obtain somewhat different errors on the
validation set.

In [45]:
# Linear regression
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=3)
model = LinearRegression().fit(X_train, y_train)
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
display(f"The MSE for linear regression is {mse}")
# Quadratic regression
poly2 = PolynomialFeatures(2, include_bias = False)
X_v2= poly2.fit_transform(X)
X_train_v2, X_test_v2, y_train_v2, y_test_v2 = train_test_split(X_v2, Y, test_size=0.5, random_state=3)
model = LinearRegression().fit(X_train_v2, y_train_v2)
predictions_v2 = model.predict(X_test_v2)
mse_v2 = mean_squared_error(y_test_v2, predictions_v2)
display(f"The MSE for quadratic regression is {mse_v2}")
# Cubic regression
poly3 = PolynomialFeatures(3, include_bias = False)
X_v3= poly3.fit_transform(X)
X_train_v3, X_test_v3, y_train_v3, y_test_v3 = train_test_split(X_v3, Y, test_size=0.5, random_state=3)
model = LinearRegression().fit(X_train_v3, y_train_v3)
predictions_v3 = model.predict(X_test_v3)
mse_v3 = mean_squared_error(y_test_v3, predictions_v3)
f"The MSE for cubic regression is {mse_v3}"


'The MSE for linear regression is 20.755407959228602'

'The MSE for quadratic regression is 16.945106759516108'

'The MSE for cubic regression is 16.974378328026475'

Using this split of the observations into a training set and a validation
set, we find that the validation set error rates for the models with linear,
quadratic, and cubic terms are 20.75, 16.94, and 16.97, respectively.
These results are consistent with our previous findings: a model that
predicts mpg using a quadratic function of horsepower performs better than
a model that involves only a linear function of horsepower, and there is
little evidence in favor of a model that uses a cubic function of horsepower.

### 5.3.2 Leave-One-Out Cross-Validation
The LOOCV estimate can be automatically computed for any generalized
linear model using sklearn.

In [55]:
def cv_poly_reg(degree, X, Y):
    poly = PolynomialFeatures(degree = degree, include_bias = False)
    X = poly.fit_transform(X)
    model = LinearRegression()
    scores = cross_val_score(model, X, Y, cv= X.shape[0], scoring = "neg_mean_squared_error" )
    mse = np.mean(scores)
    display(f"The MSE for degree {degree} regression is {mse}")

for degree in range(1,6):
    cv_poly_reg(degree, X, Y)

'The MSE for degree 1 regression is -24.231513517929226'

'The MSE for degree 2 regression is -19.24821312448967'

'The MSE for degree 3 regression is -19.334984064029197'

'The MSE for degree 4 regression is -19.424430310319195'

'The MSE for degree 5 regression is -19.03321350576735'

we see a sharp drop in the estimated test MSE between
the linear and quadratic fits, but then no clear improvement from using
higher-order polynomials

In [None]:
### 5.3.3 k-Fold Cross-Validation

The same functions can also be used to implement k-fold CV. Below we use k = 10, a common choice for k, on the Auto data set. We simply copy the lines of code above using 10 folds instead of n folds in the cross-validation to be explicit.