# Evaluating machine-learning models
The chapter discusses the importance of splitting data into a training set, validation set, and test set in order to measure the generalization power of machine learning models. Overfitting is a common problem that occurs when a model performs well on training data but poorly on unseen data. The chapter emphasizes the importance of evaluating models on never-before-seen data to achieve models that generalize well. Strategies for mitigating overfitting and maximizing generalization are also discussed. The section focuses on how to measure generalization and how to evaluate machine-learning models.

# Training, validation, and test sets
To evaluate a machine learning model, the available data is split into three sets: 
1. **Training.**
2. **Validation.**
3. **Test.**

The model is trained on the training data, and its performance is evaluated on the validation data, which is used to tune the model's configuration. However, this can result in overfitting to the validation set, and information leaks can occur, causing the model to perform artificially well on the validation data. Therefore, a completely different, never-before-seen dataset (the test dataset) should be used to evaluate the model's generalization. Advanced methods of splitting data into training, validation, and test sets, such as simple **hold-out validation**, **K-fold validation**, and **iterated K-fold validation with shuffling**, can be used when little data is available.

# Simple Hold-Out Validation
Set apart some fraction of your data as your test set. Train on the remaining data, and evaluate on the test set. As you saw in the previous sections, in order to prevent information leaks, you shouldn’t tune your model based on the test set, and therefore you should also reserve a validation set.
Schematically, hold-out validation looks like figure. The following listing shows a simple implementation.

In [1]:
from IPython.display import Image
Image(url='https://image.ibb.co/f6f1my/hold_out_validation.png')

In [None]:
# Hold-out validation code...
num_validation_samples = 10000

# Shuffling the data is usually appropriate.
np.random.shuffle(data)

# Defines the validation set
validation_data = data[:num_validation_samples]
data = data[num_validation_samples:]

# Defines the training set
training_data = data[:]

# Trains a model on the training data, and evaluates it on the validation data
model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)

# At this point you can tune your model,
# retrain it, evaluate it, tune it again...
model = get_model()
model.train(np.concatenate([training_data,
validation_data]))
test_score = model.evaluate(test_data)

# Once you’ve tuned your hyperparameters, it’s common to train your final model from scratch on all non-test data available.

This code splits the dataset data into training, validation, and test sets. The number of validation samples is set to num_validation_samples, which is 10000 in this case. Then, the data is shuffled randomly, and the first num_validation_samples are selected for the validation set, while the rest is used for the training set. A neural network model is created using get_model(), and it is trained on the training set. The validation set is used to evaluate the model's performance and obtain the validation_score. After tuning the model based on the validation score, the model is reinitialized using get_model(), and it is trained on the concatenated training and validation sets. Finally, the test set is used to evaluate the model's performance and obtain the test_score.

# K-Fold Validation
With this approach, you split your data into K partitions of equal size. For each partition i, train a model on the remaining
K – 1 partitions, and evaluate it on partition i. Your final score is then the averages of the K scores obtained. This method is helpful when the performance of your model shows significant variance based on your train test split. Like hold-out validation, this method doesn’t exempt you from using a distinct validation set for model calibration.
Schematically, K-fold cross-validation looks like figure. Listing shows a simple implementation.

In [2]:
from IPython.display import Image
Image(url='https://image.ibb.co/bUm7Ry/k_fold_validation.png')

In [None]:
# K-Fold Validation Code...
k=4
num_validation_samples = len(data) // k

np.random.shuffle(data)

validation_scores = []
for fold in range(k):
    # Selects the validation data partition
    validation_data = data[num_validation_samples * fold:
                           num_validation_samples * (fold + 1)]
    
    # Uses the remainder of the data as training data. Note that the + operator is list concatenation, not summation.
    training_data = data[:num_validation_samples * fold] + data[num_validation_samples * (fold + 1):]
    
    # Creates a brand-new instance of the model (untrained)
    model = get_model()
    model.train(training_data)
    validation_score = model.evaluate(validation_data)
    validation_scores.append(validation_score)

# Validation score: average of the validation scores of the k folds
validation_score = np.average(validation_scores)

# Trains the final model on all non-test data available
model = get_model()
model.train(data)
test_score = model.evaluate(test_data)

This code performs k-fold cross-validation to evaluate a machine learning model. It first sets k to 4 and calculates the number of validation samples (num_validation_samples) by dividing the length of the data by k. Then, it shuffles the data randomly using np.random.shuffle().

Next, the code initializes an empty list validation_scores to store the validation scores obtained for each fold. The for loop runs for k iterations, where in each iteration it selects a subset of the data as the validation data, and the rest of the data is used as training data. The get_model() function is called to initialize a new model for each iteration, and the model is trained on the training data using model.train(). The validation score is then calculated using model.evaluate() on the validation data, and the score is appended to the validation_scores list.

This code performs k-fold cross-validation to evaluate a machine learning model. It first sets k to 4 and calculates the number of validation samples (num_validation_samples) by dividing the length of the data by k. Then, it shuffles the data randomly using np.random.shuffle().

Next, the code initializes an empty list validation_scores to store the validation scores obtained for each fold. The for loop runs for k iterations, where in each iteration it selects a subset of the data as the validation data, and the rest of the data is used as training data. The get_model() function is called to initialize a new model for each iteration, and the model is trained on the training data using model.train(). The validation score is then calculated using model.evaluate() on the validation data, and the score is appended to the validation_scores list.

# Iterated K-Fold Validation With Shuffling
This one is for situations in which you have relatively little data available and you need to evaluate your model as precisely as possible. I’ve found it to be extremely helpful in Kaggle competitions. It consists of applying K-fold validation multiple times, shuffling the data every time before splitting it K ways. The final score is the average of the scores obtained at each run of K-fold validation. Note that you end up training and evaluating P × K models (where P is the number of iterations you use), which can very expensive.

# Things to Keep in Mind
1. **Data Representatives:** To ensure accurate training and testing in machine learning, it is important to have representative data sets. For example, if you are classifying images of digits, splitting your data set in a way that one set contains only certain classes and the other set contains different classes can lead to errors. Therefore, it is recommended to randomly shuffle your data before splitting it into training and testing sets.
2. **The arrow of time:** When predicting the future based on past data, shuffling the data before splitting it into training and testing sets can cause a "temporal leak" by introducing future data into the training set. To avoid this, it's important to ensure that all the data in the test set is from a later time period than the data in the training set.
3. **Redundancy in your data:** Duplicate data points are common in real-world data. If you shuffle such data and split it into training and validation sets, there may be overlap between the two sets, leading to testing on part of the training data, which is undesirable. To avoid this, make sure that the training and validation sets do not overlap.