# Validation and Test

## Learning objectives

Understand what terms below are all about:

- cross validation
- test set
- random seeding
- data leakage

## The test set

> Machine learning models are only useful if they can __generalise__ to make good predictions on unseen examples

To estimate how well a model will perform on unseen data, we split our initial dataset into two different sets. One is for training on, and the other is for testing on.

> The testing set is used for evaluating whether a model meets our requirements and estimating real world performance. That is all. It is not for making choices about our model.

Sklearn provides a method `train_test_split()` in it's `model_selection` module to split our data.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.load_boston(return_X_y=True)

print(f"Number of samples in dataset: {len(X)}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

print("Number of samples in:")
print(f"    Training: {len(y_train)}")
print(f"    Testing: {len(y_test)}")

## The validation set

If we want to choose between two models, we can't base those decisions on the testing set. If we did, we would be making decisions that are biased towards doing well on answers that we are expecting. 

Making choices based on the testing set is like seeing the answers on a test.

Instead, we create another set, called the __validation set__. This is used for comparing models or different options for the same model. We call this __cross validation__

> We make the validation set by further splitting the training set.

In [None]:
X_test, X_validation, y_test, y_validation = train_test_split(
    X_test, y_test, test_size=0.3
)

print("Number of samples in:")
print(f"    Training: {len(y_train)}")
print(f"    Validation: {len(y_validation)}")
print(f"    Testing: {len(y_test)}")

> Pay attention: people commonly fail to understand the difference between the validation set and the testing set.

The difference between the validation set and the testing set is that we use the validation set to make choices about our models, but not the test set.

Such choices may include:
- Should I deploy the linear regression model or the neural network?
- Should I use the model which was trained on all the features of just the 3 I picked?
- Any choice of hyperparameter (which you will learn about shortly)

Let's see that now

In [None]:
import numpy as np
# ML algorithms you will later know, don't panic
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

np.random.seed(2)

models = [
    DecisionTreeRegressor(splitter="random"),
    SVR(),
    LinearRegression()
]

for model in models:
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_validation_pred = model.predict(X_validation)
    y_test_pred = model.predict(X_test)

    train_loss = mean_squared_error(y_train, y_train_pred)
    validation_loss = mean_squared_error(y_validation, y_validation_pred)
    test_loss = mean_squared_error(y_test, y_test_pred)

    print(
        f"{model.__class__.__name__}: "
        f"Train Loss: {train_loss} | Validation Loss: {validation_loss} | "
        f"Test Loss: {test_loss}"
    )

## Analysis

As you can see, best Validation Loss is for Linear Regression. This is the model we should choose. Unfortunately it occurs that on "real" (test) data it performs worse than Decision Tree.

Once again: usually we won't have information from test loss, but now you should know this technique is imperfect (we will see how to mitigate those effects later on)

## Seeding

Above we had this line: `np.random.seed(2)`. It is actually really important and we should know what is going on.

### Pseudo-random number generators

Many machine learning algorithms use random initialization (for example to instantiate the parameters of a linear regression model). Depending on algorithm it might have more or less severe effect on the result.

- Each time you run algorithm based on the randomness the result may vary to some degree
- Random number generators use so called `seed` which is a numerical value which determines what values will be generated
- For each run to be the same (or to show some phenomenon like we did above) we should __always__ seed all functions using random numbers

The last one is pretty easy in `numpy` and `sklearn` as it is a single line. Seeding this way is present in most of the frameworks.

### Why initialise a seed?

- When you want your experiments to be reproducible (especially important in Machine Learning)
- To be sure the outcome will not change during each run

> Always set a random seed to make sure your results are repeatable when some part of the code involves random numbers being generated.

## Data Leakage

__Data leakage__ is where a model has access to information about the testing sets. Of course, in the real world when facing totally new examples, our model will not have known anything about the incoming data. That means in training we need to carefully preserve that separation too.

> Never make any decisions about your model design using the test set

Of course, data leakage can be caused by bad data splitting. These include
- some samples are both in training and validation
- model is simply evaluated based on performance on training data

But there are less obvious ways to cause data leakage as well. One is to compute some statistics (like the mean) or new features (like difference from the mean) based on the data before splitting it, and then using those values in training, after splitting it. Those statistics contain data about the testing data which has since been split off.

Let's see an example in action...

In [None]:
def calculate_validation_loss(X_train, y_train, X_validation, y_validation):
    model = LinearRegression()

    # Without data leakage, train on train, validate on validation
    model.fit(X_train, y_train)
    y_validation_pred = model.predict(X_validation)
    validation_loss = mean_squared_error(y_validation, y_validation_pred)

    print(f"Validation loss: {validation_loss}")
    
# Without data leakage, train on train, validate on validation
calculate_validation_loss(X_train, y_train, X_validation, y_validation)

# With data leakage, 50 samples from validation added
fail_X_train = np.concatenate((X_train, X_validation[:50]))
fail_y_train = np.concatenate((y_train, y_validation[:50]))

calculate_validation_loss(fail_X_train, fail_y_train, X_validation, y_validation)

As expected, as the model saw part of validation data and it __falsely__ performs better on it.

## Summary

- Validation set is used to find info about best algorithms, best set of arguments to algoirthms etc.
- Test set is used to check how our algorithm performs on unseen data
- __As we tune algorithms according to `validation` dataset we cannot use it to check performance__
- `seed` is used to ensure reproducibility. Also multiple runs for experiments are good if our code depends on random initialization heavily (we can take mean results of experiments)
- Data leakage is information from `validation` (or `test`) leaking into training
- Data leakage leads to falsely good results and should be avoided
- Rule of thumb: imagine you only have training dataset when doing preprocessing. Anything you calculate from it cannot be used in `validation` or `test`