# Test Sets

### Review

In the last lesson, we saw how we can correct for our different sources of errors.  We know that in training our models, our models will train to randomness of the data when given a chance.  To correct for this, we have a holdout set and choose the complexity of the model based on how the model scores on these holdout sets.

Let's review this and show one last component to evaluating our machine learning model.

1. We load our feature data sets.  

Now in this lesson we're having each of our feature datasets be of length 100.

In [1]:
from data import input_temps, temps_and_is_weekends, temps_weekends_and_ages, customers_with_errors

In [2]:
feature_datasets = [input_temps, temps_and_is_weekends, temps_weekends_and_ages]

In [4]:
list(map(lambda dataset: len(dataset), feature_datasets))
# [100, 100, 100]
# len(customers_with_errors)
# 100

[150, 150, 150]

2. Then we split these datasets into training and holdout sets.

In [5]:
split_datasets = []
for dataset in feature_datasets:
    training_data = dataset[:100]
    holdout_data = dataset[100:]
    split_dataset = (training_data, holdout_data)
    split_datasets.append(split_dataset)

3. We train each model on it's respective training set

In [7]:
from sklearn.linear_model import LinearRegression
training_models = []
for dataset in split_datasets:
    model = LinearRegression()
    training_feature_data = dataset[0]
    model.fit(training_feature_data, customers_with_errors[0:100])
    training_models.append(model)
training_models

[LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)]

In [8]:
list(map(lambda training_model: training_model.coef_, training_models))

[array([3.12480048]),
 array([ 3.11082792, 45.65739545]),
 array([ 3.09483918, 47.24264466,  0.45487237])]

4. We evaluate our models on the training and holdout datasets.

In [9]:
split_models_with_data = list(zip(training_models, split_datasets))

In [10]:
training_scores = []
for split_model_with_data in split_models_with_data:
    selected_model = split_model_with_data[0]
    training_data = split_model_with_data[1][0]
    training_score = selected_model.score(training_data, customers_with_errors[:100])
    training_scores.append(training_score)

In [11]:
holdout_scores = []
for split_model_with_data in split_models_with_data:
    selected_model = split_model_with_data[0]
    holdout_data = split_model_with_data[1][1]
    holdout_score = selected_model.score(holdout_data, customers_with_errors[100:])
    holdout_scores.append(holdout_score)

In [14]:
training_scores

[0.8924818514798912, 0.9457660708435309, 0.9491031227430928]

In [15]:
holdout_scores

[0.6179093480262522, 0.7780004499162101, 0.7689173517214827]

Then after looking at our training and holdout scores, we choose the model where the holdout score peaks, which is our second model.

| model                  |train score | holdout score |  
| ---------------------- |:----------:| :------------:|
| temps                  | .89        |  .61 | 
| temps, weekend         | .945        |  .778 
| temps, ages, weekend   | .949        |  .768         |

## Just a little bit more

So we choose our second model as it scores the best, and place it into production.  Temperatures and weekend data be our best combination of parameters in predicting the amount of customers.  And then we see that this our holdout score tends to overperform how our model performs in production.

This may be surprising, but it is yet another case of overfitting.  In the first case, where we trained our models, we risked overfitting because our model can simply find numbers that match the randomness in the data.  In the second case, where we tune our model by choosing different amounts of features, and as we'll see, make other modifications to our model, the exact tuning of our model may be overfit to our holdout set.  Tuning and tweaking our model can take hours, so the phenomena where we tune parameters that just randomly happen to best fit our holdout set is not uncommon.

Because of this, data scientists segment their datasets into three group: a training set, a validation set and a test set.  
* The training set
The training set is used to train parameters on with the different variations of the model.  Because these *parameter values* are subject to overfitting against the training set, especially as the flexibility in our model increases, we use a validation set to protect against this.  

* The validation set
After the model is trained we look at the scores of our validation set to choose the tuning parameters in our model -- here our tuning parameter is just the type of features we include.  However, as we learn how to tweak our models more and more, we run the risk of tweaking parameters that just happen to work well with our validation set, but won't perform as well in production.

* The test set
Because of this, after we select the parameters, we run this model against another holdout set of data we have not yet used, called our test set.  Because our test set was not used to train our models, nor to choose tuning parameters, we expect the score of the test set to most approximate how our model performs in production.  We also can use our test set to compare performance against other machine learning algorithms, like comparing the performance of our linear regression model against the performance of our random forests model.


### Training, Validation and Tests in Action

So now let's go through our machine learning process again, but this time, segmenting our dataset into training data, validation data, and testing data.

We have our same `feature_datasets` of one parameter, two parameter, and three parameter data.

In [17]:
train_validate_test_datasets = []
for dataset in feature_datasets:
    training_data = dataset[:90]
    validation_data = dataset[90:120]
    test_data = dataset[120:]
    train_validate_test_dataset = (training_data, validation_data, test_data)
    train_validate_test_datasets.append(train_validate_test_dataset)

We can see that our first dataset with one parameter is split into groups of three.

In [18]:
len(train_validate_test_datasets[0])

3

Now we can train each of our models with the training data.

In [21]:
from sklearn.linear_model import LinearRegression
updated_training_models = []
for dataset in train_validate_test_datasets:
    model = LinearRegression()
    training_feature_data = dataset[0]
    model.fit(training_feature_data, customers_with_errors[0:90])
    updated_training_models.append(model)
updated_training_models

[LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)]

In [27]:
list(map(lambda model: model.coef_, updated_training_models))

[array([3.15452937]),
 array([ 3.14877452, 46.60849399]),
 array([ 3.11976833, 48.8545893 ,  0.4403944 ])]

Then we can evaluate the models with the validation sets to select the number of parameters.

In [24]:
models_with_trisplits = list(zip(updated_training_models, train_validate_test_datasets))

In [25]:
validation_scores = []
for trisplit_model_with_data in models_with_trisplits:
    selected_model = trisplit_model_with_data[0]
    validation_data = trisplit_model_with_data[1][1]
    validation_score = selected_model.score(validation_data, customers_with_errors[90:120])
    validation_scores.append(validation_score)

In [26]:
validation_scores

[0.14886339953110272, 0.4943257519541469, 0.5002740763480484]

And finally we select the second model, and then choose this model to calculate the test score.

In [30]:
test_scores = []
for trisplit_model_with_data in models_with_trisplits:
    selected_model = trisplit_model_with_data[0]
    test_data = trisplit_model_with_data[1][2]
    test_score = selected_model.score(test_data, customers_with_errors[120:])
    test_scores.append(test_score)

In [31]:
test_scores

[0.16574724727928036, 0.47415508688958946, 0.49213683932072705]

We expect the numbers in our test scores to better approximate how our model will perform in production.

### Summary