# Cross-Validation

A better way to test models.

In general a validation set will be about 20% of the data we have available. Small validation sets lead to lots of "noise" in model results (randomness). Unfortunately adding more validation data means taking away from our training data and vice versa.

The process of cross-validation breaks the total dataset into chunks (folds) using different chunks for training and validation each time so that in the end all of the data has been used once to validate the model giving a more accurate measure of model quality.

Cross-validation should always be used on smaller datasets, on larger datasets where the validation data is sufficiently it is not necessary and will be very slow.
Cross-validation can also be run once on a larger data set and if the model results are very similar then it probably isn't necessary.

In [1]:
import pandas as pd

# Read the data
data = pd.read_csv('../data/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

Define a pipeline. it is very difficult to do cross-validation without a pipeline.

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators=50, random_state=0))
])

next obtain cross validation scores with the cross_val_score() function.
Set the number of "folds" with the cv parameter.

In [3]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

MAE scores:
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


Here we selected teh 'neg_mean_absolute_error' as a measure of model quality, other options can be found here: http://scikit-learn.org/stable/modules/model_evaluation.html 

After collecting the cross validation scores, they are typically averaged.

In [4]:
print("Average MAE score (across experiments):")
print(scores.mean())

Average MAE score (across experiments):
277707.3795913405
