# ISLR_5-3-1_to_3

### 5-3-1 The Validation Set Approach

In [None]:
from __future__ import print_function
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cross_validation import train_test_split,cross_val_score
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline

In [None]:
# load clean version of data set
df = pd.read_csv('../Data/Auto-cleaned.csv')
df.head(3)

In [None]:
df.plot.scatter(x='horsepower',y='mpg')

In [None]:
# split data with fixed random state
x = df.horsepower.values
x = x.reshape((len(x),1))
y = df.mpg.values
y = y.reshape((len(y),1))
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=196,random_state=0)

In [None]:
# fit linear regression & compute MSE
LR = LinearRegression()
LR.fit(x_train,y_train)
y_pred = LR.predict(x_train)
MSE = mean_squared_error(y_train,y_pred)
print('linear, training, mean square error = %0.2f' % MSE)

In [None]:
# fit quadratic regression & compute MSE
poly = PolynomialFeatures(degree=2)
X_train = poly.fit_transform(x_train)
LR.fit(X_train,y_train)
y_pred = LR.predict(X_train)
MSE = mean_squared_error(y_train,y_pred)
print('quadratic, training, mean square error = %0.2f' % MSE)

### 5-3-2 Leave-One-Out Cross-Validation

In [None]:
MSE = cross_val_score(LR,x,y,cv=len(y),scoring='mean_squared_error')
print('number of cross-validation folds = %d' % len(MSE))
print('average MSE test error %0.2f' % MSE.mean())

The average MSE score is negative! This output suggests to me that computer scientists have a greater influence on scikit learn than mathematicians. I believe, for most mathematicians, a negative MSE output would be a show stopper. The reason for the negative sign is to make the API design of scikit learn more uniform and pluggible in the grid search routine which maximizes the score. The problem that is being solved is that we want to minimize MSE and so by introduction of the minus sign we can can instead maximize it. For further details see: 

 MSE is negative when returned by cross_val_score #2439 
 https://github.com/scikit-learn/scikit-learn/issues/2439

In [None]:
%%time
# leave-one-out cross-validation to select optimal polynomial degree
results = []
for k in range(5):
    poly = PolynomialFeatures(degree=k+1)
    X = poly.fit_transform(x)
    MSE = -1.0*cross_val_score(LR,X,y,cv=len(y),scoring='mean_squared_error')
    results.append([k+1,MSE.mean()])

In [None]:
results=pd.DataFrame(results,columns=['degree','average test MSE'])
results

There is a sharp improvement for k = 2 (quadratic fit) and not much improvement after that. A good rule of thumb is to use the simplest model that we find acceptible. We choose the quadratic fit. (Note: I converted the MSE scores back to positive numbers by multiplying by -1.0, so no one will laugh at me.) You should have a good understanding of the issue first before you manually change a score! and also carefully document what you have done!!

### 5-3-3 k-Fold Cross-Validation

In [None]:
%%time
# 10 fold cross-validation to select optimal polynomial degree
results = []
for k in range(5):
    poly = PolynomialFeatures(degree=k+1)
    X = poly.fit_transform(x)
    MSE = -1.0*cross_val_score(LR,X,y,cv=10,scoring='mean_squared_error')
    results.append([k+1,MSE.mean()])

In [None]:
results=pd.DataFrame(results,columns=['degree','average test MSE'])
results

The errors are higher as expected because we have a smaller training set with 10 fold cross-validation compared to leave-one-out cross-validation. (The textbook has the errors about the same. Not sure why there is not a better match to the numbers above.) Note 10 fold cross-validation is much faster than leave one out cross-validation.