# Validation Sets

### Introduction

In previous lessons we learned why we split our data into a training and test set.  The training set is used to fit the model and the test set is used to evaluate how well the model performed.  It turns our we should really split our data one additional time, so that we have a training, validation, and then a test set.  

In this lesson we'll learn why we split into three groups and what each group is responsible for.

### Setting up our data and models

For this lesson, we'll use the boston dataset.  We start by splitting our data into two groups.

In [13]:
from sklearn.datasets import load_boston
dataset = load_boston()

X = dataset['data']
y = dataset['target']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=3)

And then we train our model on the training set.

In [14]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

### Review: Why split into two groups

Now, if we look at the accuracy of the model, we see that our $r^2$ score is higher on the training set than on the test set.

In [15]:
model.score(X_train, y_train)

0.7419034960343789

In [16]:
model.score(X_test, y_test)

0.7147895265576858

This is expected.  It occurs because when we train our model, the model simply tries to fit to the training data.  The training data has a degree of randomness in it, and the model fits to the randomness in the training data as well as the underlying trend.  But because we do not expect to occur again, our model tends to match training data better than the holdout data it hasn't seen. 

### Choosing a model

Now this seems like a good strategy.  But then comes the task ofÂ selecting features for our model.  In this dataset, we have thirteen different features, but in many datasets we will have hundreds of features.

In [17]:
X.shape

(506, 13)

When we select features for our model, we'll do so by trying models with different collections of features, and evaluate how well they perform on the test set.  Then, we'll select a model by generally choosing the model that performs best on the test set.

Now if we use this model to predict future data, will it perform as well as it did on the test set?  Probably not.

The problem is that when selecting our features, we may be trying hundreds if not thousands of different models.  And then, by evaluating a model based on how it performs on the test set, we may be selecting a model that just happens to match up to the randomness in the data of the test set.  And of course we don't expect to see that randomness again in future data.

## Our remedy

We can fix this by splitting our data into three groups.  A training set, a validation set, and a test.

* **Training set** We use the training set to fit each of the models.
* **Validation Set**  We use the validation set to see how well each model performs on data it has not yet seen.
* **Test Set**  Because we want to avoid selecting a model that happens to perform well on the validation set due to random chance, when we are done choosing our model, we evaluate our final model on the test set. 

Performing this operation is simple enough.

In [1]:
from sklearn.datasets import load_boston
dataset = load_boston()

X = dataset['data']
y = dataset['target']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.4, random_state=1)

So we can see that we started with a dataset of the following shape.

In [2]:
X.shape

(506, 13)

And we have split our dataset into the following group sizes.

In [5]:
X_train.shape
# (354, 13)

X_val.shape
# (81, 13)

X_test.shape
# (152, 13)

(102, 13)

Once we have segmented our data.  The ideal move is to store our test data in it's own file, and never look at this file until we are ready to evaluate our final model.  We can do the same from our training and validation sets so that we don't use any data that is in our test set.

In [27]:
import pandas as pd
train_df = pd.DataFrame(X_train, columns = dataset['feature_names'])
train_df.loc[:, 'y'] = y_train
train_df.to_feather('./train.feather')

In [31]:
validate_df = pd.DataFrame(X_val, columns = dataset['feature_names'])
validate_df.loc[:, 'y'] = y_val
validate_df.to_feather('./validate.feather')

> We can also choose to keep the training and validation sets combined, and later split them.  The key is just to have one set of data -- our test set -- that we store in a separate file and do not look at until the end.

In [32]:
test_df = pd.DataFrame(X_test, columns = dataset['feature_names'])
test_df.loc[:, 'y'] = y_test
test_df.to_feather('./test.feather')

### Summary

In this lesson, we learned about an issue that can occur when developing a machine learning model.  We do so by training multiple different models and evaluating how each performs on a holdout set.  With each model that we attempt and then evaluate based on the holdout set, we increase the chance of finding a model that performs well because it matches the randomness in the holdout set data.  To correct for this, we split our data into three groups.  The first group is for training, the second group is for comparing models, and the third set is for evaluating how well our chosen model will perform on data it (nor us) have previously seen.

### Resources

[Slate Gelman](https://slate.com/technology/2013/07/statistics-and-psychology-multiple-comparisons-give-spurious-results.html)

[Garden of Forking Paths](https://www.americanscientist.org/article/the-statistical-crisis-in-science#)