the reason not to evaluate the models on the same data they were trained on quickly became evident: after just a few epochs, all three models began to overfit, their performance on never-before-seen data started stalling compared to their performance on the training data, which always improves as training progresses

in machine learning, the goal is to achieve models taht generalise that perform well on never-before-seen data, and overfitting is the central obstacle, the model can only control that which can be boserved, so it is crucial to be able to reliably measure the generalization power of the model

##### Training, Validation and Test Sets

- splitting the available data into three sets:
    - training sets
        - training the data
    - validation
        - evaluate the model
    - test
        - test it one final time

developing a model always involves tuning its configuration: (hyperparameters to distinguish them from the parameters, which are the network's weights), the validation set tuning as a feedback signal the perfoemance of the model

in essence, the tuning is a form of learning: a search for good configuration in some parameter space, as a result, tuning the configuration of the model based on its performance on the validation set can quickly result in 'overfitting to the validation set', even though the model is never directly trained on it

central to this pehnomenon is the notion of information leaks, every time tune a hyperparameter of the model based on the model's performance on the validation set, some information about the validation data leadks into the model
- for one tuning, one parameter, very few bits of information will leak, and the validation set will remain reliable to evaluate the model
- for repeating, running one experiment, evaluating on the validation set, and modifying the model as a result, then it will leak an increasingly significant amount of informaiton about the validation set into the model

the model will end up that performs artificially well on the validation data, but not the test set, otherwise the measure of generalization will be flawed

##### Three Classic Evaluation recipes: 
- simple hold-out validation
- k-fold validation
- iterated k-fold validation with shuffling

##### Simple Hold-Out Validation

- set apart ome fraction of the data as test set
- train on the remaining data
- evaluate on the test set

![simple_hold_out_validation_split](./h_o_val_split.png)

In [None]:
# hold-out validation

num_validation_samples = 10000

np.random.shuffle(data) # shuffling the data is usually appropiate

validation_data = data[:num_validation_samples] # defines the validation set
data = data[num_validation_samples:]

training_data = data[:] # defines the training set

model = get_model()
model.train(training_data) # trains a model on the training data
validation_score = model.evaluate(validation_data) # and evaluates it on the validation data

# at this point you can tune the model
# retrain it, evaluate it, tune it again

model = get_model()
model.train(np.concatenate([training_data, 
                            validation_data]))
test_score = model.evaluate(test_data)

# once you've tuned the hyperparameters, 
# it's common to train the final model from scratch on all non-test data available

flaw:
- if little data is available, then the validation and test sets may contain too few samples to be statistically representative of the data at hand
- if different random shuffling rounds of the data before splitting end up yielding very different measures of model performance
- k-fold validation and iterated k-fold validation are two ways to address

##### K-Fold Validation

- split the data into K partitions of equal size
- for each partition i, train a model on the remaining k-1 partition
- and evaluate it on partition i
- your final score is then the averages of the k scores obtained
this methond is helpful when the performance of the model shows significant variance based on the train-test split, and like hold-out validation, this method doesn't exempt you from using a distinct validation set for model calirbation

![three_fold_validation](./3_f_vali.png)

In [None]:
# k-fold cross-validation

k = 4
num_validation_samples = len(data) // k

np.random.shuffle(data)

validation_score = []
for fold in range(k):
    validation_data = data[num_validation_samples * fold: # select the validation data partition
     num_validation_samples * (fold + 1)]
    training_data = data[:num_validation_samples * fold] + 
     data[num_validation_samples * (fold + 1):] # use the remainder of the data as training data, 
                                                # note the + is list concatenation not summation  
    model = get_model() # create a brand new instance of the model (untrained)
    model.train(training_data)
    validation_score = model.evaluate(validation_data)
    validation_scores.append(validation_score)

validation_score = np.average(validation_scores) # validation score: the averagee of the validation scores of k-folds
    
model = get.model() # trains the final model on all non-test data available
model.train(data)
test_score = model.evaluate(test_data)

##### Iterated K-Fold Validation with Shuffling

- for situations in which have relatively little data available but need to evaluate the model as precisely as possible (extremly helpful in kaggle):
    - applying k-fold validation multiple times
    - shuffling the data every time before splitting it k ways
    - the final scores is the average of the scores obtained at each run of k-fold validation
    - training and evaluating P x K models (where P is the number of iterations), which can be expensive

##### Things to Keep in Mind

when choosing an evaluation protocol:
- data representativeness
    - both training set and test set to be representative 
        - for instance, when classifying images of digits, and starting from an array of samples whihc ordered by their class, this seems like a ridiculous mistake, but it's surprisingly common
        - usually should randomly shuffle the data before splitting it into training and test sets
- the arrow of time
    - if trying to predict the future given the past (i.e. tomorrow's weather, stock movements, and so on) do not randomly shuffle before splitting it, to avoid temporal leak
        - model will effectively be trained on data from the future
        - always make sure all data in test set is posterior to the data in the training set
- redundancy in the data
    - some data points appear twice (fairly common with real world data), then shuffling the data and splitting it into a training set and a validation set will result in redundancy between the training and validation sets, in effect, it is testing on part of training data, which is the worst thing you can do
    - make sure training set and validation set are disjoint