# Hyperparameters, K-Fold Cross Validation & Grid search

## Learning objectives

Understand what is going on with:

- Hyperparameters
- Grid search and how to use it for hyperparameter finding
- K-Fold cross-validation

One of the steps in the Machine Learning system design is to optimize the model. But wait, we were mentioning that the model learns the parameters by performing an optimization of the loss function, what can we do further than that? Some of the parameters that the model can't learn are called hyperparameters. We can tune the hyperparameters to find the best model for our problem.

## Hyperparameters

Up to this point we were only using __parameters__ (for example parameters of linear regression are it's weight(s)).

> Hyperparameters control the learning process. In contrast to parameters, they either can not or should not be learned from data. They are usually set and fixed before training process starts.

E.g. learning rate should not be learned from the data
- to find the best learning rate would require to train the model completely, using many different learning rates.
- this would be computationally expensive

E.g. the order of polynomial features to pass to the model (e.g. $x^2$, $x^3$) should not be learned from the data
- Hyperparameters like this control "representational capacity" of the model - the complexity between inputs and outputs which the model can represent. E.g. can the model represent wavy relationships or only straight line relationships?
- If this hyperparameter were to be optimised by the model, it would obviously choose to be able to represent more complex input-output relationships so that it can perform best on the data, but this will cause it to overfit to the data and not generalise (more on this later).

Other hyperparameters include:
- learning rate
- batch size
- regularisation parameter (more on this later)
- order of the polynomial features included 

Those are essential and prevalent in machine learning, see code cell below:

In [None]:
from sklearn.ensemble import RandomForestRegressor

regressors = [
    RandomForestRegressor(n_estimators=10, criterion="mae"),
    RandomForestRegressor(n_estimators=50, min_samples_leaf=2),
    RandomForestRegressor(),
]

Above we have a single classification machine learning method called Random Forest (we will get to how it works in next module).

What interests us are `__init__` parameters we provided (`n_estimators`, `criterion`, `min_samples_leaf`). __They are examples of hyperparameters__ you can set before fitting them to data.

What can happen after setting them incorrectly?
- our algorithm may __under/overfit__ (more details later)
- it might not converge __at all__ in some cases

When we do it right (at least more or less) we can observe:
- improved convergence & faster training time
- lower loss & better performance on test data

You can probably tell by now how crucial those things are. The question is how to find them?

Other examples of hyperparameters include:
- batch size
- polynomial for linear/logistic regression (will see more about those)

## Finding hyperparameters

There are a couple ways to find those:
- experience, after some time you get an idea of what should work and what might not, present especially in deep learning
- algorithmic (we will focus on this one) - we try a set of possible hyperparameters
and choose the best one
- mix of both - you know the boundaries that should yield good results (say `64 < batch_size < 1024`) but you are not certain about exact value so you employ algorithmic approach to find them

Later you will get enough info to get you started using the last approach

## Grid Search

One __algorithmic__ way to find hyperparameters is __grid search__:

> Grid Search consists of grid of possible hyperparameters and each combination is used to learn algorithm of choice and validate the results

Examplary grid for RandomForest you saw above might be:

| `n_estimators`  | `criterion`  | `min_samples_leaf` |
|:--:|:---:|:---:|
| 10  | "mse"  | 2  |
| 50  | "mae"  | 1  |
| 100  |   |   |

Each time we create a row from different values so that all combinations are checked. __Please notice the grid doesn't have to be "square"__ so any number of different hyperparameters can be checked together.

## Exercise

Let's implement a simple Grid Search. To outline what we have to do:
- Create dictionary containing hyperparameters names and their allowed values (as `list`). Use at most `4` values:
    - `n_estimators` with values `[10, 50, 100]` (or others, not more than a `1000`)
    - `criterion` (check possible values)
    - `min_samples_leaf` with values `[2, 1]`

### Code undestanding

- What is `yield` for? 
- When should we use it? 
- What is the name of the function which ends with `yield`?
- Can you explain `grid_search` function? (We will want someone to try and explain as much as he can)

In [None]:
import itertools
import typing


def grid_search(hyperparameters: typing.Dict[str, typing.Iterable]):
    keys, values = zip(*hyperparameters.items())
    yield from (dict(zip(keys, v)) for v in itertools.product(*values))


grid = {
    "n_estimators": [10, 50, 100],
    "criterion": ["mse", "mae"],
    "min_samples_leaf": [2, 1],
}

for i, hyperparams in enumerate(grid_search(grid)):
    print(i, hyperparams)

### Grid search evaluation

Now that we have our grid creating function we can evaluate RandomForest and see how it works out:

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

X, y = datasets.load_boston(return_X_y=True)
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.2)

best_hyperparams, best_loss = None, np.inf

for hyperparams in grid_search(grid):
    model = RandomForestRegressor(**hyperparams)
    model.fit(X_train, y_train)

    y_validation_pred = model.predict(X_validation)
    validation_loss = mean_squared_error(y_validation, y_validation_pred)

    print(f"H-Params: {hyperparams} Validation loss: {validation_loss}")
    if validation_loss < best_loss:
        best_loss = validation_loss
        best_hyperparams = hyperparams

print(f"Best loss: {best_loss}")
print(f"Best hyperparameters: {best_hyperparams}")

## Analysis

Above you can see how one can find best model based on validation loss. There are some drawbacks to this method though, namely:
- there is only one validation set; what would happen if we were to choose different part of data as validation?
- every combination has to be checked (which can be costly and semi-effective as __a lot__ of checked hyperparameters will not improve the score)

Luckily there are some solutions...

## K-Fold cross validation

What is it?

> K-Fold splits dataset into multiple parts which are in turn used for training and validation

Picture will help you understand what this is about:

![](images/k-fold.svg)

This is how it goes algorithmically:
1. split dataset into `K` parts (predefined, usually `5` or `3`)
2. set `i=0`
3. take `parts[i]` as validation set and the rest as training data
4. train on training part of data
5. calculate metrics (loss in our case) on validation set and save it
6. increment i and repeat until the last part is used as validation
7. take the mean of validation results

### But why?

K-Fold is a common practice in machine learning and de facto standard. Here are the reasons:
- evaluation on single validation set can be really noisy; depending on the data we take into validation the results may vary (sometimes wildly)
- this gives us __false impression about model performance__. Usually the model does worse than expected

### But why not?

... It isn't standard in deep learning community though, because:
- K-Fold takes a lot of time. In case of `5` splits we have to fit `5` separate models. For larger models (which may be trained for weeks) it is infeasible

## Exercise

We can implement K-Fold easily from scratch, just as we did with grid search, __BUT THIS TIME YOU WILL DO IT ON YOUR OWN WITH `yield`!__

- Define `k_fold` function taking `dataset` and `n_splits` (which is the number of splits). Set `n_splits` default value to `5` and specify the `type` as `int`
- Use [`np.array_split`](https://numpy.org/doc/stable/reference/generated/numpy.array_split.html) to split `dataset` into `n_splits` __chunks__e
- In a `for` loop:
    - Get `i`-th chunk as validation dataset
    - Get rest of the chunks (up to `i`) and from `i + 1`
    - Concatenate those chunks creating training data
    - `yield` a tuple containing `training` and `validation` dataset
    
Finally, go over the rest of this code and check what is going on (we will ask one of you to explain what you see!)

In [None]:
def k_fold(dataset, n_splits: int = 5):
    chunks = np.array_split(dataset, n_splits)
    for i in range(n_splits):
        training = chunks[:i] + chunks[i + 1 :]
        validation = chunks[i]
        yield np.concatenate(training), validation

# K-Fold evaluation
loss = 0
n_splits = 5
for (X_train, X_validation), (y_train, y_validation) in zip(
    k_fold(X, n_splits), k_fold(y, n_splits)
):
    # What happened to hyperparameters?
    model = RandomForestRegressor()
    model.fit(X_train, y_train)

    y_validation_pred = model.predict(X_validation)
    fold_loss = mean_squared_error(y_validation, y_validation_pred)
    loss += fold_loss
    print(f"Fold loss: {fold_loss}")

# We divide by number of splits to get the mean
print(f"K-Fold estimated loss: {loss / n_splits}")

## Analysis

- You can see that loss for each fold varies __a lot__. `46` and `8` is a giant difference
- Thanks to K-Fold we have more reliable statistics

But what about hyperparameter search? You've probably guessed it:
> Grid Search can be mixed with K-Fold validation. This is standard approach and provided by `sklearn` easily

We just need two nested loops. Now you can probably see the downside more clearly (and why it isn't used in deep learning):
- Complexity will be `O(n*m)` where `n` is the number of hyperparameter and `m` number of splits we want to use. `60` models in our simple case!

## Grid Search + K-Fold

Okay, let's mix those folks together in an...

## Exercise

Now we will do Grid Search with K-Fold inside. Here is what you should do:
- Iterate over `grid_search(grid)` in `for` loop
    - Set `loss` to `0`
    - Iterate over training and validation dataset via `zip` as  we saw above
        - Create `RandomForestRegressor` model with `hyperparams` via __dictionary unpacking__ with `**` (check the internet for more info)
        - Fit model to training data
        - Predict on validation
        - Calculate mean squared error loss for this fold
        - Add fold loss to `loss`

In [None]:
# K-Fold evaluation
best_hyperparams, best_loss = None, np.inf
n_splits = 5
# Grid search goes first
for hyperparams in grid_search(grid):
    loss = 0
    # Instead of validation we use K-Fold
    for (X_train, X_validation), (y_train, y_validation) in zip(
        k_fold(X, n_splits), k_fold(y, n_splits)
    ):
        model = RandomForestRegressor(**hyperparams)
        model.fit(X_train, y_train)

        y_validation_pred = model.predict(X_validation)
        fold_loss = mean_squared_error(y_validation, y_validation_pred)
        loss += fold_loss
    # Take mean from all folds as final validation score
    total_loss = loss / n_splits
    print(f"H-Params: {hyperparams} Loss: {total_loss}")
    if total_loss < best_loss:
        best_loss = total_loss
        best_hyperparams = hyperparams

# And see our final results
print(f"Best loss: {best_loss}")
print(f"Best hyperparameters: {best_hyperparams}")

## Analysis

- Our results changed drastically. Different criterion is the best and different number of estimators. 
- The results may vary in this case as well but are way less spurious than before (in magnitude)
- Also the results are way worse (more than double)
- __But our results better estimate algorithm behviour__. If we didn't do that we would be all happy with loss around `9` just to be negatively surprised much later. __It is good to know before!__

## K-Fold variants

There are a lot of variants of K-Fold, as always we urge you to explore more on your own, here are some of them:
- [Stratified K-Fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) used when we do __classification__ and our dataset is __unbalanced__ (e.g. `20` examples for class `0` and `500` for class `1`)
- [Time Series cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) - should be used when we work with time data (where events follow another)
- Leave One Out Cross Validation (LOOCV) - Just like K-Fold but `validation` set is single sample. Very in-efficient as we have to fit as many models as samples (over `500` in our case!). Provides noisy estimates and is overly optimistic about the results

Any many, many more, go and search if you are interesed (also see challenges if you want to have some guidance for your own research).

## More about Grid Search

### Positives

- Can be easily parallelized (each hyperparameter can be run on different core or even computer)
- Simple and useful when __search space__ (set of all hyperparameters which might be worth checking out) is small

### Negatives

- Every combination of hyperparameters is tried and previous results do not inform next trials. Hence bad combinations will be tried (and most are bad, trust me) You can check other versions like genetic algorithms so that each trial gives feedback to hyperparameter optimizing algorithm
- Needs discrete values. Some algorithms (e.g. [Random Search](https://en.wikipedia.org/wiki/Random_search)) can work on probability distributions which are in turn sampled to get values
- Impossible to use with large hyperparameter space which grows __exponentially__.


## What to do with best hyperparameters

Now that we have some confidence that our model is the best one, we should take the hyperparameters we've found and __train it on a whole dataset once more__.

Why, no more validation? No, because:
- We already found a good model and estimated how it performs on left-out set (for more rigorous approach check out [Nested Cross Validation](https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/)
- __Samples are precious__. We want to use all (or as many as posbbile) samples to train our model so it gathers more knowledge about the task.

For some models validation might be required __anyway__, especially for those which tend to overfit on our data (neural networks are the prime example but XGBoost can be considered as such as well). It is task specific and you will get to know more later.

> As a sanity check you might always leave some small part of data (`5%` say) and validate on that after you've found your hyperparameters

As a last step, we should save our model for later re-use

## Summary

- Hyperparameters are parameters controlling behaviour of our algorithm __which cannot be learned__ (or cannot be at the current stage of our knowledge)
- __Grid Search__ in conjunction with __K-Fold__ can be simply used to find hyperparameters of our model
- __K-Fold__ creates many `(train, validation)` sets. Using it with Grid Search helps us being __way more confident__ about the results 
- __Grid Search__ has it's downsides and it might be worth checking other options out
- __Always employ domain knowledge__ - if you know some set of hyperparameters will fail, do not try them out. Check the most promising combinations. Check all if you have no sensible idea or even better ask an expert