# Validation and Testing

## Why do we need to split the data into training and testing sets?

- It is not needed at all, we can just use all the data for training and testing
- So that the model can be tested on data it has not seen before. This is called generalization. ***
- So that we can use the testing set to tune the hyperparameters of the model
- So that we have a set of data that can be discarded because it could be contaminated

## What is happening in the next lines of code?

```python
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

cancer_data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer_data.data, cancer_data.target, test_size=0.2)
```

- We are loading the breast cancer dataset and splitting it into training and testing sets, using only 20% of the whole dataset
- We are loading the breast cancer dataset and splitting it into training and testing sets, using only 80% of the whole dataset
- We are loading the breast cancer dataset and splitting it into training and testing sets, with 80% of the data for training and 20% for testing
- We are loading the breast cancer dataset and splitting it into training and testing sets, with 20% of the data for training and 80% for testing ***


## We can split the test set into a validation set and a test set. What is the purpose of the validation set?

- The validation set helps us to give an additional set for the training set
- The validation set helps us to give a model that is not biased towards the training set
- The validation set helps us to give a model that is not biased towards the test set ***
- The validation set won't make any difference when comparing models, it is just a set of data that we can discard in case the test set is contaminated


## What is cross-validation?

- A method to split the data into training and testing sets
- A method to split the data into training, validation and testing sets ***
- A method to split the data into training and validation sets
- A method to split the data into training and testing sets, but with a different set of data each time

## Assuming we have a dataset with 300 samples, what will be the length of the training set, validation set and test set if we use the following code?

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)
```

- `210, 30, 60`
- `210, 45, 45`
- `240, 30, 30` ***
- `240, 45, 15`

## In summary, what is the purpose of the training, validation and test sets?

- The training set to select the best model, the validation set to evaluate the final model on unseen data and the test set to optimize the model
- The training set to evaluate the final model on unseen data, the validation set to select the best model and the test set to optimize the model
- The training set to optimize the model, the validation set to evaluate the final model on unseen data and the test set to select the best model
- The training set to optimize the model, the validation set to select the best model and the test set to evaluate the final model on unseen data ***


## Which of the following strategies is the best way to make your process repeatable?

- Using the same dataset for training and testing
- Using the same random seed for each step that involves randomness ***
- Using the same model with the same hyperparameters 
- Using as many different models as possible

## What is considered as data leakage?

- Having too many features that make the training process slow
- Having information from the test set in the training set ***
- Having too many samples that makes the amount of data too big
- Having the data split into sets with different sizes


## Which of the following can cause data leakage?

- Using a random seed when splitting the data into training and testing sets
- Not using a random seed when splitting the data into training and testing sets
- Computing the mean of the whole dataset and then splitting the data into training and testing sets ***
- Computing the mean of the training set and the test set independently after splitting the data
