*Credit*: This example is taken from the `skopt` [Bayesian Optimization tutorial](https://scikit-optimize.github.io/notebooks/bayesian-optimization.html).

In [0]:
import numpy as np
import matplotlib.pyplot as plt
!pip install scikit-optimize
from skopt import gp_minimize

# Black-box function optimization with skopt

Scikit-Optimize, or [skopt](https://scikit-optimize.github.io/), is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimization.

Alternative libraries include [Spearmint](https://github.com/HIPS/Spearmint),  [Hyperopt](http://hyperopt.github.io/hyperopt/), [Optuna](https://optuna.org) and [BoTorch](https://github.com/pytorch/botorch).

Black-box algorithms do not need any knowledge of the gradient. These libraries provide algorithms that are more powerful and scale better than the *Nelder-Mead* simplex algorithm we saw in the *Constrained Optimization* notebook. Modern black-box (or sequential model-based) optimization algorithms are increasingly popular for optimizing the *hyperparameters* (user-tuned "knobs") of machine learning models.

Let's assume the following noisy function $f$:

In [0]:
noise_level = 0.1

def f(x, noise_level=noise_level):
    return np.sin(5 * x[0]) * (1 - np.tanh(x[0] ** 2)) + np.random.randn() * noise_level

In `skopt`, functions $f$ are assumed to take as input a 1D vector $x$ represented as an array-like and to return a scalar $f(x)$:

In [0]:
# Plot f(x) + contours
x = np.linspace(-2, 2, 400).reshape(-1, 1)
fx = [f(x_i, noise_level=0.0) for x_i in x]
plt.plot(x, fx, "r--", label="True (unknown)")
plt.fill(np.concatenate([x, x[::-1]]),
         np.concatenate(([fx_i - 1.9600 * noise_level for fx_i in fx], 
                         [fx_i + 1.9600 * noise_level for fx_i in fx[::-1]])),
         alpha=.2, fc="r", ec="None")
plt.legend()
plt.grid()

Note that this function is differentiable and wouldn't actually be that difficult to differentiate analytically. But let's pretend for now that it's impossible to differentiate it and this precludes gradient-based optimization.

Bayesian Optimization based on Gaussian Process regression is implemented in `skopt.gp_minimize` and can be carried out as follows:

In [0]:
res = gp_minimize(f,                  # the function to minimize
                  [(-2.0, 2.0)],      # the bounds on each dimension of x
                  acq_func="EI",      # the acquisition function
                  n_calls=15,         # the number of evaluations of f 
                  n_random_starts=5,  # the number of random initialization points
                  noise=0.1**2,       # the noise level (optional)
                  random_state=123)   # the random seed

Accordingly, the approximated minimum is found to be:

In [0]:
print("x^*=%.4f, f(x^*)=%.4f" % (res.x[0], res.fun))

For further inspection of the results, attributes of the res named tuple provide the following information:

- `x [float]`: location of the minimum.
- `fun [float]`: function value at the minimum.
- `models`: surrogate models used for each iteration.
- `x_iters [array]`: location of function evaluation for each iteration.
- `func_vals [array]`: function value for each iteration.
- `space [Space]`: the optimization space.
- `specs [dict]`: parameters passed to the function.

In [0]:
print(res)

Together these attributes can be used to visually inspect the results of the minimization, such as the convergence trace or the acquisition function at the last iteration:

In [0]:
from skopt.plots import plot_convergence
plot_convergence(res);

## Branin-Hoo function

The [Branin-Hoo function](http://www.sfu.ca/~ssurjano/branin.html) is a popular optimization benchmark. 

The 2-dimensional function has the form
$$
f(\mathbf{x}) = a(x_2 - bx_1^2 + cx_1 - r)^2 + s(1-t)cos(x_1) + s .\
$$

The recommended values of $a$, $b$, $c$, $r$, $s$ and $t$ are $a=1$, $b = 5.1 / (4 \pi^2)$, $c= 5/\pi$, $r=6$, $s=10$ and $t=1/(8\pi)$. 

The function is usually evaluated on the square $x_1 \in [-5, 10], x_2 \in [0, 15]$.

It has three global minima. Using the recommended parameters above, $f(\mathbf{x}^*)=0.397887$ at $\mathbf{x}^* = (-\pi, 12.275), (\pi, 2.275)$ and $(9.42478, 2.475)$.

In [0]:
def branin(x, a=1, b=5.1 / (4 * np.pi**2), c=5. / np.pi,
           r=6, s=10, t=1. / (8 * np.pi)):
    """Branin-Hoo function is defined on the square x1 ∈ [-5, 10], x2 ∈ [0, 15].

    It has three minima with f(x*) = 0.397887 at x* = (-pi, 12.275),
    (+pi, 2.275), and (9.42478, 2.475).

    More details: <http://www.sfu.ca/~ssurjano/branin.html>
    """
    return (a * (x[1] - b * x[0] ** 2 + c * x[0] - r) ** 2 +
            s * (1 - t) * np.cos(x[0]) + s)

### Exercise

Adapt the Bayesian Optimization with GP regression example above to solve the Branin-Hoo function. 

### Exercise

Experiment with different settings of the acquisition function (parameter `acq_func`). What is an acquisition function? You can find a great write-up on Bayesian Optimization on [Martin Krasser's blog](http://krasserm.github.io/2018/03/21/bayesian-optimization/).