# K-Fold Cross Validation & Grid search

## Learning objectives

Understand what is going on with:

- Grid search and how to use it for hyperparameter finding
- K-Fold cross-validation

## Grid Search

One __algorithmic__ approach for finding hyperparameters is __grid search__.

> In Grid Search, a grid consisting of possible hyperparameters is employed, and each combination is used to learn the algorithm of choice and validate the results.

Consider the example grid for RandomForest:

| `n_estimators`  | `criterion`  | `min_samples_leaf` |
|:--:|:---:|:---:|
| 10  | "mse"  | 2  |
| 50  | "mae"  | 1  |
| 100  |   |   |

Each time, a row is created from different values so that all combinations are checked. __Please note that the grid does not have to be a 'square'; therefore,__ any number of different hyperparameters can be checked together.

### Example

Here, we implement a simple Grid Search.

- Create a dictionary containing the hyperparameter names and their allowed values (as `list`). Use `4` values at most:
    - `n_estimators` with the values, `[10, 50, 100]`, (or others; not more than a `1000`).
    - `criterion` (check possible values).
    - `min_samples_leaf` with values `[2, 1]`.

### Code understanding

- What is the role of `yield`, and when should it be used? 
- What is the name of the function that ends with `yield`?
- What is the `grid_search` function?

In [None]:
import itertools
import typing


def grid_search(hyperparameters: typing.Dict[str, typing.Iterable]):
    keys, values = zip(*hyperparameters.items())
    yield from (dict(zip(keys, v)) for v in itertools.product(*values))


grid = {
    "n_estimators": [10, 50, 100],
    "criterion": ["mse", "mae"],
    "min_samples_leaf": [2, 1],
}

for i, hyperparams in enumerate(grid_search(grid)):
    print(i, hyperparams)

### Grid search evaluation

Now that we have a grid-creating function, we can evaluate RandomForest:

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

X, y = datasets.fetch_california_housing(return_X_y=True)
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.2)

best_hyperparams, best_loss = None, np.inf

for hyperparams in grid_search(grid):
    model = RandomForestRegressor(**hyperparams)
    model.fit(X_train, y_train)

    y_validation_pred = model.predict(X_validation)
    validation_loss = mean_squared_error(y_validation, y_validation_pred)

    print(f"H-Params: {hyperparams} Validation loss: {validation_loss}")
    if validation_loss < best_loss:
        best_loss = validation_loss
        best_hyperparams = hyperparams

print(f"Best loss: {best_loss}")
print(f"Best hyperparameters: {best_hyperparams}")

### Analysis

Above, you can see how one can find the best model based on the validation loss. Note, however, that there are some drawbacks to this method:

- There is only one validation set (what would occur if we were to choose a different part of the data as the validation set?).
- Every combination must be checked (this can be costly and ineffective as a __high number__ of checked hyperparameters does not correspond to an improved score).

Fortunately, there are some solutions.

## K-Fold Cross Validation

> K-Fold splits datasets into multiple parts, which are, in turn, utilised for training and validation.

The illustration below will aid your understanding:

![](images/k-fold.svg)

### Algorithmic flow

1. Split dataset into `K` parts (predefined, usually `5` or `3`).
2. Set `i=0`.
3. Take `parts[i]` as the validation set and the rest as the training set.
4. Train on the training dataset.
5. Calculate the metrics (in our case, the loss) on the validation set and save them.
6. Increment i and repeat until the last part is utilised as the validation set.
7. Take the mean of the validation results.

### Reasons for K-Fold

K-Fold is a common practice in ML and the de facto standard for the following reason.

- Evaluation on a single-validation set can be quite noisy. Depending on the data employed for validation, the results may vary (significantly on occasions). This gives us a __false impression regarding the model's performance__ (usually the model performs worse than expected).

### Drawbacks

K-Fold is not standard in the deep-learning community because it is __time-intensive__. For `5` splits, we have to fit `5` separate models. This is not feasible for larger models (which may be trained for weeks).

### Example

Here, we will implement K-Fold from scratch. However, unlike the Grid Search case, we will do it with __`yield`.__

- Define a `k_fold` function taking `dataset` and `n_splits` (which is the number of splits). Set the default value of `n_splits` to `5`, and specify the `type` as `int`.
- Use [`np.array_split`](https://numpy.org/doc/stable/reference/generated/numpy.array_split.html) to split the `dataset` into `n_splits` __chunks__.
- In a `for` loop,
    - obtain and employ the `i`-th chunk as the validation dataset.
    - obtain the remaining chunks (up to `i`) and from `i + 1`.
    - concatenate these chunks, thereby creating the training data.
    - `yield` a tuple containing the `training` and `validation` datasets.
    
Finally, conduct a thorough inspection of the code:

In [None]:
def k_fold(dataset, n_splits: int = 5):
    chunks = np.array_split(dataset, n_splits)
    for i in range(n_splits):
        training = chunks[:i] + chunks[i + 1 :]
        validation = chunks[i]
        yield np.concatenate(training), validation

# K-Fold evaluation
loss = 0
n_splits = 5
for (X_train, X_validation), (y_train, y_validation) in zip(
    k_fold(X, n_splits), k_fold(y, n_splits)
):
    # Hyperparameters?
    model = RandomForestRegressor()
    model.fit(X_train, y_train)

    y_validation_pred = model.predict(X_validation)
    fold_loss = mean_squared_error(y_validation, y_validation_pred)
    loss += fold_loss
    print(f"Fold loss: {fold_loss}")

# We divide by the number of splits to obtain the mean
print(f"K-Fold estimated loss: {loss / n_splits}")

### Analysis

- As you can observe, the loss for each fold varies __considerably__. There is a significant difference between `46` and `8`.
- We have more reliable statistics, attributed to K-Fold. 

> Regarding hyperparameter search, Grid Search can be mixed with K-Fold validation. This is the standard approach, and it is offered by `sklearn`.

The only requirement is two nested loops. Now, you probably have more clarity on why it is not used in deep learning:

- The complexity will be `O(n*m)`, where `n` is the number of hyperparameters and `m` is the number of splits we intend to use (thus, `60` models in our simple case).

## Grid Search + K-Fold

### Example

Here, we will perform a Grid Search with K-Fold within.

- Iterate over `grid_search(grid)` in a `for` loop.
    - Set `loss` to `0`.
    - Iterate over the training and validation datasets via `zip` as we saw above.
        - Create the `RandomForestRegressor` model with `hyperparams` via __dictionary unpacking__ with `**` (check online for more information).
        - Fit the model to the training data.
        - Predict on the validation dataset.
        - Calculate the MSE loss for this fold.
        - Add the fold loss to `loss`.

In [None]:
# K-Fold evaluation
best_hyperparams, best_loss = None, np.inf
n_splits = 5
# Grid search goes first
for hyperparams in grid_search(grid):
    loss = 0
    # Instead of validation we use K-Fold
    for (X_train, X_validation), (y_train, y_validation) in zip(
        k_fold(X, n_splits), k_fold(y, n_splits)
    ):
        model = RandomForestRegressor(**hyperparams)
        model.fit(X_train, y_train)

        y_validation_pred = model.predict(X_validation)
        fold_loss = mean_squared_error(y_validation, y_validation_pred)
        loss += fold_loss
    # Take the mean of all the folds as the final validation score
    total_loss = loss / n_splits
    print(f"H-Params: {hyperparams} Loss: {total_loss}")
    if total_loss < best_loss:
        best_loss = total_loss
        best_hyperparams = hyperparams

# See the final results
print(f"Best loss: {best_loss}")
print(f"Best hyperparameters: {best_hyperparams}")

### Analysis

- Our results changed drastically, indicating that a different criterion and a different number of estimators are the best. 
- Although the results may vary here as well, they are considerably less spurious than those in the previous case (in magnitude).
- Additionally, the results indicate that the performance is considerably worse (more than double).
- The results provide a __better estimation__ of the algorithm's behaviour.

## K-Fold Variants

There are many variants of K-Fold. As always, we urge you to explore them on your own. Here are some variants:

- [Stratified K-Fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html): used when __classification__ is carried out with an __unbalanced__ dataset (e.g. `20` examples for class `0` and `500` for class `1`).
- [Time Series cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html): should be used when working with time data (where events follow each other).
- Leave One Out Cross Validation (LOOCV): similar to K-Fold, but the `validation` set is a single sample. 
    - Highly inefficient as we have to fit as many models as the number of samples (over `500` in our case). 
    - Provides noisy estimates and is overly optimistic about the results.

## More On Grid Search

### Merits

- Easy parallelisation (each hyperparameter can be run on a different core or even computer).
- Simple and useful when the __search space__ (set of all the hyperparameters that might be worth examining) is small.

### Demerits

- Every combination of hyperparameters is tried, and previous results do not inform the next trials. Therefore, bad combinations will be tried. Explore other versions, such as genetic algorithms, so that each trial provides feedback to the hyperparameter-optimising algorithm.
- Requires discrete values. Some algorithms (e.g. [Random Search](https://en.wikipedia.org/wiki/Random_search)) can work on probability distributions, which are, in turn, sampled to obtain values.
- Impossible to use with a large hyperparameter space that grows __exponentially__.

## Using the Optimal Hyperparameters

Having established the best model, the identified hyperparameters should be __re-trained on a whole dataset__.

Validation is no longer required for the following reasons:

- We already found a good model and estimated its performance on the left-out dataset (for a more rigorous approach, read up on [Nested Cross Validation](https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/).
- The __samples are precious__. We intend to use all (or as many as possible) samples to train the model so it gathers more knowledge about the task.

__Note:__ For some models, validation may be required __regardless__, particularly for those that tend to overfit the data (neural networks are the prime example; XGBoost is another example). It is task-specific, and with time, your understanding will improve.

> As a sanity check, you should always leave a small chunk of the data (say `5%`) and validate on that after finding the hyperparameters.

Lastly, save the model for re-use.

## Conclusion

- __Grid Search__ in conjunction with __K-Fold__ can be simply used to find hyperparameters of our model
- __K-Fold__ creates many `(train, validation)` sets. Using it with Grid Search helps us being __way more confident__ about the results 
- __Grid Search__ has it's downsides and it might be worth checking other options out
- __Always employ domain knowledge__ - if you know some set of hyperparameters will fail, do not try them out. Check the most promising combinations. Check all if you have no sensible idea or even better ask an expert