<a href="https://colab.research.google.com/github/ShaunakSen/Data-Science-and-Machine-Learning/blob/master/Hyperopt_A_Conceptual_Explanation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## A Conceptual Explanation of Bayesian Hyperparameter Optimization for Machine Learning


> Based on the article by Will Koehrsen: https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f

---

Following are four common methods of hyperparameter optimization for machine learning in order of increasing efficiency:

1. Manual
2. Grid search
3. Random search
4. Bayesian model based opt

![](https://miro.medium.com/max/2000/1*E0_THdPH2NfKB37JUQB8Eg.png)

Validation Errors comparing random search and a model based approach on LFW (left) and PubFig83 (right)

These figures compare validation error for hyperparameter optimization of an image classification neural network with random search in grey and Bayesian Optimization (using the Tree Parzen Estimator or TPE) in green. Lower is better: a smaller validation set error generally means better test set performance, and a smaller number of trials means less time invested. Clearly, there are significant advantages to Bayesian methods, and these graphs, along with other impressive results, convinced me it was time to take the next step and learn model-based hyperparameter optimization.

> The one-sentence summary of Bayesian hyperparameter optimization is: build a probability model of the objective function and use it to select the most promising hyperparameters to evaluate in the true objective function.

### Hyperparameter Optimization

The aim of hyperparameter optimization in machine learning is to find the hyperparameters of a given machine learning algorithm that return the best performance as measured on a validation set. (Hyperparameters, in contrast to model parameters, are set by the machine learning engineer before training. The number of trees in a random forest is a hyperparameter while the weights in a neural network are model parameters learned during training. I like to think of hyperparameters as the model settings to be tuned.)


Hyperparameter optimization is represented in equation form as:

![](https://miro.medium.com/max/780/1*QR4_VOfAAWLVe2I0nqwtTg.png)

Here f(x) represents an objective score to minimize— such as RMSE or error rate— evaluated on the validation set; x* is the set of hyperparameters that yields the lowest value of the score, and x can take on any value in the domain X. **In simple terms, we want to find the model hyperparameters that yield the best score on the validation set metric.**

The problem with hyperparameter optimization is that evaluating the objective function to find the score is extremely expensive. Each time we try different hyperparameters, we have to train a model on the training data, make predictions on the validation data, and then calculate the validation metric. With a large number of hyperparameters and complex models such as ensembles or deep neural networks that can take days to train, this process quickly becomes intractable to do by hand!

Grid search and random search are slightly better than manual tuning because we set up a grid of model hyperparameters and run the train-predict -evaluate cycle automatically in a loop while we do more productive things (like feature engineering). However, even these methods are relatively inefficient because they do not choose the next hyperparameters to evaluate based on previous results. **Grid and random search are completely uninformed by past evaluations, and as a result, often spend a significant amount of time evaluating “bad” hyperparameters.**

For example, if we have the following graph with a lower score being better, where does it make sense to concentrate our search? If you said below 200 estimators, then you already have the idea of Bayesian optimization! We want to focus on the most promising hyperparameters, and if we have a record of evaluations, then it makes sense to use this information for our next choice.

![](https://miro.medium.com/max/1120/1*MiNXGrkk5BbjfkNAXZQSNA.png)

Random and grid search pay no attention to past results at all and would keep searching across the entire range of the number of estimators even though it’s clear the optimal answer (probably) lies in a small region!






### Bayesian Optimization

Bayesian approaches, in contrast to random or grid search, keep track of past evaluation results which they use to form a probabilistic model mapping hyperparameters to a probability of a score on the objective function:

![](https://miro.medium.com/max/614/1*u00KlxHhd1fz6-Jaeou6PA.png)

In the literature, this model is called a “surrogate” for the objective function and is represented as p(y | x). The surrogate is much easier to optimize than the objective function and Bayesian methods work by finding the next set of hyperparameters to evaluate on the actual objective function by selecting hyperparameters that perform best on the surrogate function. In other words:


1. Build a surrogate probability model of the objective function
2. Find the hyperparameters that perform best on the surrogate
3. Apply these hyperparameters to the true objective function
4. Update the surrogate model incorporating the new results
5. Repeat steps 2–4 until max iterations or time is reached


The aim of Bayesian reasoning is to become “less wrong” with more data which these approaches do by continually updating the surrogate probability model after each evaluation of the objective function.

At a high-level, Bayesian optimization methods are efficient because they choose the next hyperparameters in an informed manner. The basic idea is: **spend a little more time selecting the next hyperparameters in order to make fewer calls to the objective function**. In practice, the time spent selecting the next hyperparameters is inconsequential compared to the time spent in the objective function. By evaluating hyperparameters that appear more promising from past results, Bayesian methods can find better model settings than random search in fewer iterations.

If there’s one thing to take away from this article it’s that Bayesian model-based methods can find better hyperparameters in less time because they reason about the best set of hyperparameters to evaluate based on past trials.

As a good visual description of what is occurring in Bayesian Optimization take a look at the images below (source). The first shows an initial estimate of the surrogate model — in black with associated uncertainty in gray — after two evaluations. Clearly, the surrogate model is a poor approximation of the actual objective function in red:

![](https://miro.medium.com/max/1400/1*RQ-pAwQ88yC904QppChGPQ.png)

The next image shows the surrogate function after 8 evaluations. Now the surrogate almost exactly matches the true function. Therefore, if the algorithm selects the hyperparameters that maximize the surrogate, they will likely yield very good results on the true evaluation function.

![](https://miro.medium.com/max/1400/1*bSLAe1LCj3mMKfaZsQWCrw.png)

Bayesian methods have always made sense to me because they operate in much the same way we do: we form an initial view of the world (called a prior) and then we update our model based on new experiences (the updated model is called a posterior). Bayesian hyperparameter optimization takes that framework and applies it to finding the best value of model settings!


### Sequential Model-Based Optimization

Sequential model-based optimization (SMBO) methods (SMBO) are a formalization of Bayesian optimization. The sequential refers to running trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating a probability model (surrogate).


There are five aspects of model-based hyperparameter optimization:


1. A domain of hyperparameters over which to search (using hp.search_space)
2. An objective function which takes in hyperparameters and outputs a score that we want to minimize (or maximize) (we define this)
3. **The surrogate model of the objective function**
4. **A criteria, called a selection function, for evaluating which hyperparameters to choose next from the surrogate model** (steps 3 and 4 are what essentially goes under the hood)
5. A history consisting of (score, hyperparameter) pairs used by the algorithm to update the surrogate model (hp trials object)

There are several variants of SMBO methods **that differ in steps 3–4**, namely, how they build a surrogate of the objective function and the criteria used to select the next hyperparameters. Several common choices for the surrogate model are **Gaussian Processes, Random Forest Regressions, and Tree Parzen Estimators (TPE)** while the most common choice for step 4 is **Expected Improvement**. In this post, we will focus on TPE and Expected Improvement.

#### Domain

In the case of random search and grid search, the domain of hyperparameters we search is a grid. An example for a random forest is shown below:

``` python

hyperparameter_grid = {
    'n_estimators': [100, 200, 300, 400, 500, 600],
    'max_depth': [2, 5, 10, 15, 20, 25, 30, 35, 40],
    'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8]
}
```

For a model-based approach, the domain consists of probability distributions. As with a grid, *this lets us encode domain knowledge into the search process by placing greater probability in regions where we think the true best hyperparameters lie*. If we wanted to express the above grid as a probability distribution, it may look something like this:

![](https://miro.medium.com/max/552/1*luY6Ahh7uttR4quIcgOCBw.png)

![](https://miro.medium.com/max/552/1*YfoPLKK8_WXIsRaQ7zcSjg.png)

![](https://miro.medium.com/max/552/1*e6cIETdFd1rzD9ivofNJqw.png)

> Here we have a uniform, log-normal, and normal distribution. These are informed by prior practice/knowledge (for example the learning rate domain is usually a log-normal distribution over several orders of magnitude).

