## The hold-out validation method

We split the data-set into training and test data, using `train_test_split`.
We then train two models (linear regression and quadratic regression) on the training data and assess their quality on the test data.
The model showing the lower error on the test data will be our "winner".
In the end, we re-train the winner model on the entire dataset and we are ready to use it in production.

To simulate what would happen in production, I create an extra data-point (not originally in the `auto-mpg.csv` dataset) and see what our model predicts.
This extra point refers to the Seat Marbella car.

**Note**: I am going to fix the *random seed* used by `train_test_split` so that this notebook is reproducible: two people running it should get the same results.

In [1]:
import pandas as pd

In [2]:
d = pd.read_csv('auto-mpg.csv')

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [4]:
X = d.drop('mpg', axis=1)
y = d.mpg

### Splitting the dataset into training and test sets

In [5]:
# Using random_state=0 to make the notebook reproducible
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

### Creating the two models using sklearn's pipelines

In [6]:
linear_model = make_pipeline(
    StandardScaler(),
    LinearRegression()
)

In [7]:
quadratic_model = make_pipeline(
    PolynomialFeatures(degree=2),
    StandardScaler(),
    LinearRegression()
)

### Training on the training set

In [8]:
linear_model.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

In [9]:
quadratic_model.fit(X_train, y_train)

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures()),
                ('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

### Getting predictions on the test set

In [10]:
linear_yhat = linear_model.predict(X_test)

In [11]:
quadratic_yhat = quadratic_model.predict(X_test)

### Estimating the MSE of the models on the test set

In [12]:
mean_squared_error(y_test, linear_yhat)

10.17069369566702

In [13]:
mean_squared_error(y_test, quadratic_yhat)

6.825965811077711

It looks like the quadratic regression model has a lower error: it is the model we will use in production!

### Retraining the winner model on the entire data-set

In [14]:
winner = quadratic_model.fit(X, y)

## Example of using the model in production

In [15]:
seat_marbella = [
    4,           # Cylinders
    899 * 0.061, # Displacement: 899 cc => cubic inches
    41,          # Horsepower
    680 * 2.20,  # Weight: 680 kg => pounds
    19.2,        # Acceleration: time to go from 0 to 100 kmph ~~ 0 to 60 mph
    83,          # Year: 1983
    2,           # Origin: 2 (Europe)
]
seat_marbella_lkm = 5.1 # Litres per 100 km
seat_marbella_mpg = (100 * 3.78) / (1.61 * seat_marbella_lkm) # litres/km => miles/gallon

In [16]:
predicted_seat_marbella_mpg = winner.predict([seat_marbella])

In [17]:
print(f"Real value: {seat_marbella_mpg:.3f}, "
      f"Predicted value: {predicted_seat_marbella_mpg[0]:.3f}")

Real value: 46.036, Predicted value: 47.372
