# <h1 align="center">Introduction to Bayesian Optimization</h1>
<h3 align="right">Linjian Li</h3>
<h3 align="right">2018-12-22</h3>


## Some Application Examples $f(x)$
- Hyperparameter optimization
- Recommending ads

<img src="guess optimum.png">

## Feature
- Global optimization
- For Black-box functions
- Does not require derivatives



## Gaussian Process
- a stochastic process
- any finite subcollection of random variables has a multivariate Gaussian distribution
- **mean function $m(\cdot)$** and **covariance function (kernal) $k(\cdot, \cdot)$**
    <img src="usual covariance functions.png">

### Posterior

$$
\begin{pmatrix}\mathbf{y} \\ \mathbf{f}_*\end{pmatrix} \sim \mathcal{N}
\left(\begin{pmatrix}\boldsymbol{\mu} \\ \boldsymbol{\mu_*} \end{pmatrix},
\begin{pmatrix}\mathbf{K}_y & \mathbf{K}_* \\ \mathbf{K}_*^T & \mathbf{K}_{**}\end{pmatrix}
\right)\tag{1}\label{eq1}
$$

$$
\begin{align*}
p(\mathbf{f}_* \lvert \mathbf{X}_*,\mathbf{X},\mathbf{y}) 
&= \int{p(\mathbf{f}_* \lvert \mathbf{X}_*,\mathbf{f})p(\mathbf{f} \lvert \mathbf{X},\mathbf{y})}\ d\mathbf{f} \\ 
&= \mathcal{N}(\mathbf{f}_* \lvert \boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*)
\end{align*} \tag{2}\label{eq2}
$$

$$
\begin{align*}
\boldsymbol{\mu_*} &= \mathbf{K}_*^T \mathbf{K}_*^{-1} \mathbf{y} \tag{4}\label{eq4}\\
\boldsymbol{\Sigma_*} &= \mathbf{K}_{**} - \mathbf{K}_*^T \mathbf{K}_*^{-1} \mathbf{K}_* \tag{5}\label{eq5}
\end{align*}
$$


<img src="GP.png">

The gray dash line is the invisible true function. <br/>
The green solid line is the **mean** our guess of the true function. <br/>
The green translucent green areas are the **standard deviation** of our guess.

So, which point to be sampled next?
- Exploitation-exploration (mean-variance) trade-off
    - Exploitation
    - Exploration
- Acquisition functions


## Acquisition functions $\alpha(x;D)$
Acquisition functions are functions of the input space $x$ parameterized by the 
- observed data $D$
- GP hyperparameters $\theta$

Popular acquisition functions 
- probability of improvement (PI)
    - $\alpha_{PI}(x; D_{n}) \gets P(f(x)>\tau) = \Phi(\frac{\mu_{n}(x)-\tau}{\sigma_{n}(x)})$
- expected improvement (EI)
    - $\alpha_{EI}(x; D_{n}) \gets E[I(x,f(x),\theta)] = (\mu_{n}(x)-\tau)\Phi(\frac{\mu_{n}(x)-\tau}{\sigma_{n}(x)})+\sigma_{n}(x)\phi(\frac{\mu_{n}(x)-\tau}{\sigma_{n}(x)})$
- upper confidence bound (UCB) 
    - $\alpha_{UCB}(x; D_{n}) \gets \mu_{n}(x)+\beta_{n}\sigma_{n}(x)$
-  predictive entropy search (PES)

The following image shows the $\alpha_{EI}$ whose aim is to find the **mimimum** of $f(x)$.
<img src="AC_to_find_minimum.png">


## Steps to Optimize Function $f(x)$
1. Place a prior (using GP)
2. Construct an acquisition function to determine the next query point x
    $$ x_t = argmax_x \alpha(x|D_{t-1}) $$
3. Evaluate $f(x)$
4. Form a posterior distribution (using GP)
4. Augment data $D_{t} = \{ D_{t-1}, (x_{t},y_{t}) \}$
5. go to 2


## Conclusion
- ﬁnd the minimum (maximum) of difﬁcult non-convex functions with relatively **few evaluations**,<br/>
    at the cost of performing **more computation** to determine the next point to try.


## Example
This example is to find the **maximum** of the function $f(x)$.
<img src="iterations.png">


## References
(Asterisks represents the amount of contributions. The higher, the better. Mximum is 5.)
- https://github.com/krasserm/bayesian-machine-learning/blob/master/bayesian_optimization.ipynb $(*****)$
- https://github.com/krasserm/bayesian-machine-learning/blob/master/gaussian_processes.ipynb $(*****)$
- https://www.quora.com/How-does-Bayesian-optimization-work $(*****)$
- http://katbailey.github.io/post/gaussian-processes-for-dummies/ $(*****)$
- Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1), 148-175. $(****)$
- https://www.youtube.com/watch?v=-Ekte3my510 $(***)$
- https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083 $(*)$ 