## The hold-out validation method

We split the data-set into training and test data, using `train_test_split`.
We then train two models (linear regression and quadratic regression) on the training data and assess their quality on the test data.
The model showing the lower error on the test data will be our "winner".
In the end, we re-train the winner model on the entire dataset and we are ready to use it in production.

To simulate what would happen in production, I create an extra data-point (not originally in the `auto-mpg.csv` dataset) and see what our model predicts.
This extra point refers to the Seat Marbella car.

**Difference with the HoldOut notebook**: in this notebook we treat the "origin" column as categorical.
Therefore, we will apply 1-hot-encoding to it.
Because we do not want to standardise nor to take polynomial features of 0-1 columns, we now have to split our preprocessing pipeline:
* Categorical column "origin" only gets 1-hot encoded.
* All other columns, which are numeric, get the standard treatment: polynomial features (in the case of the quadratic model) and standardisation.

To achieve this result we use pandas' `ColumnTransformer`, which allows us to apply different preprocessing steps to different columns.

**Note**: I am going to fix the *random seed* used by `train_test_split` so that this notebook is reproducible: two people running it should get the same results.

In [1]:
import pandas as pd

In [2]:
d = pd.read_csv('auto-mpg.csv')
d.origin.replace({1: 'america', 2: 'europe', 3: 'japan'})
d.origin = pd.Categorical(d.origin)

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [4]:
X = d.drop('mpg', axis=1)
y = d.mpg

### Splitting the dataset into training and test sets

In [5]:
# Using random_state=0 to make the notebook reproducible
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

### Creating the two models using sklearn's pipelines

In [6]:
categorical_columns = [
    col for col in X.columns if d[col].dtype == 'category']
numerical_columns = [
    col for col in X.columns if col not in categorical_columns]

In [7]:
linear_preprocessing = ColumnTransformer(transformers=[
    ('categorical', OneHotEncoder(), categorical_columns),
    ('numerical', StandardScaler(), numerical_columns)
])

linear_model = make_pipeline(
    linear_preprocessing,
    LinearRegression()
)

In [8]:
quadratic_preprocessing = ColumnTransformer(transformers=[
    ('categorical', OneHotEncoder(), categorical_columns),
    ('numerical',
     make_pipeline(PolynomialFeatures(degree=2), StandardScaler()),
     numerical_columns)
])

quadratic_model = make_pipeline(
    quadratic_preprocessing,
    LinearRegression()
)

### Training on the training set

In [9]:
linear_model.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(), ['origin']),
                                                 ('numerical', StandardScaler(),
                                                  ['cylinders', 'displacement',
                                                   'hp', 'weight',
                                                   'acceleration', 'year'])])),
                ('linearregression', LinearRegression())])

In [10]:
quadratic_model.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('categorical',
                                                  OneHotEncoder(), ['origin']),
                                                 ('numerical',
                                                  Pipeline(steps=[('polynomialfeatures',
                                                                   PolynomialFeatures()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['cylinders', 'displacement',
                                                   'hp', 'weight',
                                                   'acceleration', 'year'])])),
                ('linearregression', LinearRegression())])

### Getting predictions on the test set

In [11]:
linear_yhat = linear_model.predict(X_test)

In [12]:
quadratic_yhat = quadratic_model.predict(X_test)

### Estimating the MSE of the models on the test set

In [13]:
mean_squared_error(y_test, linear_yhat)

10.022424640636537

In [14]:
mean_squared_error(y_test, quadratic_yhat)

6.357424643443906

It looks like the quadratic regression model has a lower error: it is the model we will use in production!

### Retraining the winner model on the entire data-set

In [15]:
winner = quadratic_model.fit(X, y)

## Example of using the model in production

In [16]:
seat_marbella = pd.DataFrame({
    'cylinders': [4],
    'displacement': [899 * 0.061],
    'hp': [41],
    'weight': [680 * 2.20],
    'acceleration': [19.2],
    'year': [83],
    'origin': [2]
})
seat_marbella_lkm = 5.1 # Litres per 100 km
seat_marbella_mpg = (100 * 3.78) / (1.61 * seat_marbella_lkm) # litres/km => miles/gallon

In [17]:
predicted_seat_marbella_mpg = winner.predict(seat_marbella)

In [18]:
print(f"Real value: {seat_marbella_mpg:.3f}, "
      f"Predicted value: {predicted_seat_marbella_mpg[0]:.3f}")

Real value: 46.036, Predicted value: 47.583
