# Validation and Testing

## Introduction

> ML models are only useful if they can __generalise__ and make good predictions on unseen data.

To estimate the performance of a model on unseen data, the initial dataset is usually split into two sets: one for training and the other for testing.

## The Test Set

> The test set is used for evaluating if a model meets the desired requirements and for estimating its real-world performance. It is not employed for making choices about the model.

To estimate how well a model will perform on unseen data, we split our initial dataset into two different sets. One is for training on, and the other is for testing on.

> The testing set is used for evaluating whether a model meets our requirements and estimating real world performance. That is all. It is not for making choices about our model.

Sklearn provides a method `train_test_split()` in it's `model_selection` module to split our data.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.load_boston(return_X_y=True)

print(f"Number of samples in dataset: {len(X)}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

print("Number of samples in:")
print(f"    Training: {len(y_train)}")
print(f"    Testing: {len(y_test)}")

## The Validation Set

The decision to choose between two models cannot be based on the testing set. If that were the case, the decisions would be biased, leaning towards our expectations. 

Making a decision based on the testing set is analogous to seeing the answers on a test.

Instead, we create another set, called the __validation set__. This set is used for comparing models or different options for the same model, and the process is referred to as __cross validation.__

> We make the validation set by further splitting the training set.

In [None]:
X_test, X_validation, y_test, y_validation = train_test_split(
    X_test, y_test, test_size=0.3
)

print("Number of samples in:")
print(f"    Training: {len(y_train)}")
print(f"    Validation: {len(y_validation)}")
print(f"    Testing: {len(y_test)}")

## Validation vs Test Sets

People commonly misunderstand the difference between the validation set and the test set.

The difference is that the validation set, not the test set, is employed to make choices about models.

Such choices may include the following:

- Should I deploy the linear regression model or the neural network?
- Should I use the model which was trained on all the features of just the three I selected?
- Which hyperparameters should I select (which you will learn about shortly)?

Let's see that now

In [None]:
import numpy as np
# ML algorithms you will later know, don't panic
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

np.random.seed(2)

models = [
    DecisionTreeRegressor(splitter="random"),
    SVR(),
    LinearRegression()
]

for model in models:
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_validation_pred = model.predict(X_validation)
    y_test_pred = model.predict(X_test)

    train_loss = mean_squared_error(y_train, y_train_pred)
    validation_loss = mean_squared_error(y_validation, y_validation_pred)
    test_loss = mean_squared_error(y_test, y_test_pred)

    print(
        f"{model.__class__.__name__}: "
        f"Train Loss: {train_loss} | Validation Loss: {validation_loss} | "
        f"Test Loss: {test_loss}"
    )

### Analysis

As you can observe, the best validation loss is for linear regression. This is the model we should choose. Unfortunately, it occurs that on 'real' (test) data, it performs worse than the Decision Tree.

Once again: usually we won't have information from test loss, but now you should know this technique is imperfect (we will see how to mitigate those effects later on)

## Seeding

In the code above, note the following line: `np.random.seed(2)`. This code line is important, and its role will be covered here.

### Pseudo-random number generators

Many ML algorithms employ random initialisation to, for example, instantiate the parameters of a linear regression model. Depending on the algorithm, it may have a more- or less-severe effect on the result.

- Each time you run an algorithm randomly, the result may vary to some degree.
- Random number generators employ a `seed`, which is a numerical value that determines what values will be generated.
- For each run to be the same, (or to exhibit some phenomena similar to the case above), we should __always__ seed all functions using random numbers.

The last one is quite easy in `numpy` and `sklearn` as it is a single line. Seeding via this approach is common in most frameworks.

### Benefits of seed initialisation

- To ensure the reproducibility of experiments, which is particularly important in ML.
- To ensure an equal outcome for all runs.

> Always set a random seed to make sure your results are repeatable when some part of the code involves random numbers being generated.

## Data Leakage

__Data leakage__ refers to a situation where a model has access to information about the testing sets. Definitely, in the real world, when working with new examples, the model will not know anything about the incoming data. This means that in training, the separation must be carefully preserved.

> Never make any decisions about your model design using the test set.

### Causes

Data leakage can be caused by bad data splitting. These include the following cases:
- Some samples are both in training and validation.
- The model is evaluated based on its performance on the training data.

Let's see an example in action...

In [None]:
def calculate_validation_loss(X_train, y_train, X_validation, y_validation):
    model = LinearRegression()

    # Without data leakage, train on train, validate on validation
    model.fit(X_train, y_train)
    y_validation_pred = model.predict(X_validation)
    validation_loss = mean_squared_error(y_validation, y_validation_pred)

    print(f"Validation loss: {validation_loss}")
    
# Without data leakage, train on train, validate on validation
calculate_validation_loss(X_train, y_train, X_validation, y_validation)

# With data leakage, 50 samples from validation added
fail_X_train = np.concatenate((X_train, X_validation[:50]))
fail_y_train = np.concatenate((y_train, y_validation[:50]))

calculate_validation_loss(fail_X_train, fail_y_train, X_validation, y_validation)

As expected, as the model saw part of validation data and it __falsely__ performs better on it.

## Conclusion
At this point, you should have a good understanding of

- Validation set is used to find info about best algorithms, best set of arguments to algoirthms etc.
- Test set is used to check how our algorithm performs on unseen data
- __As we tune algorithms according to `validation` dataset we cannot use it to check performance__
- `seed` is used to ensure reproducibility. Also multiple runs for experiments are good if our code depends on random initialization heavily (we can take mean results of experiments)
- Data leakage is information from `validation` (or `test`) leaking into training
- Data leakage leads to falsely good results and should be avoided
- Rule of thumb: imagine you only have training dataset when doing preprocessing. Anything you calculate from it cannot be used in `validation` or `test`