In [None]:
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_context('paper')
sns.set_style('white')

# Hands-on Activity 23.1: Maximum Mean - A Bad Information Acquisition Function

## Objectives

+ Develop intuition about the maximum mean as an information acquisition function and why you should never use it.

## Working 1D Example

It is easier to introduce the ideas using an example.
Let's work with a synthetic 1D function defined in $[0,1]$:

In [None]:
def f(x):
    return 4 * (1. - np.sin(6 * x + 8 * np.exp(6 * x - 7.))) 

x = np.linspace(0, 1)
plt.plot(x, f(x), linewidth=2)
plt.xlabel('$x$')
plt.ylabel('$y$');

We wish to maximize this function (of course, in reality, you wouldn't see the functioin).
Let us generate some starting data:

In [None]:
np.random.seed(123456) # For reproducibility
n_init = 3
X = np.random.rand(n_init) # In 1D you don't have to use LHS
Y = f(X)
plt.plot(X, Y, 'kx', markersize=10, markeredgewidth=2)
plt.xlabel('$x$')
plt.ylabel('$y$');

Assume that we do some kind of Bayesian regression, using the data we have so far.
Here, we will do GPR, but any Bayesian regression would actually work.
We will not work right now with the full predictive $p(f(\cdot)|\mathcal{D}_{n})$, but with the point-predictive distribution:
$$
p(y|\mathbf{x},\mathcal{D}_{n}) = \mathcal{N}\left(y|m_{n}(\mathbf{x}), \sigma^2_{n}(\mathbf{x})\right),
$$
where $m_{n}(\mathbf{x})$ and $\sigma^2_{n}(\mathbf{x})$ are the predictive mean and variance respectively.

Here is an example with GPR:

In [None]:
import GPy
# The kernel we use
k = GPy.kern.RBF(1, lengthscale=0.15, variance=4.)
gpr = GPy.models.GPRegression(X[:, None], Y[:, None], k)
# Assuming that we know there is no measurement noise:
gpr.likelihood.variance.constrain_fixed(1e-16)
# You can evaluate the predictive distribution anywhere:
m, sigma2 = gpr.predict(x[:, None])
# And you can visualize the results as follows
# Standard deviation
sigma = np.sqrt(sigma2)
# Lower quantile
l = m - 1.96 * sigma
u = m + 1.96 * sigma
fig, ax = plt.subplots(dpi=100)
plt.plot(x, f(x), 'r--', linewidth=2, label='True function')
ax.plot(X, Y, 'kx', markersize=10, markeredgewidth=2, label='Observations')
ax.plot(x, m, label='GP mean')
ax.fill_between(x, l.flatten(), u.flatten(), color=sns.color_palette()[0], alpha=0.25,
                label='GP 95% pred. int.')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
plt.legend(loc='best');

Now, the question is this: "Where should we evaluate the function next if our goal is to maximize the function?"
Let's start with the naive assumption that we should evaluate the function at the point that maximizes the Gaussian process posterior mean, i.e., wherever is the max of the thick blue line.
In other words, the information acquisition function is the posterior mean of the Gaussian process.
Let's see what happens.

In [None]:
def maximize_naive(f, gpr, X_design, max_it=6):
    """
    Optimize f using a limited number of evaluations.
    
    :param f:        The function to optimize.
    :param gpr:      A Gaussian process model to use for representing our state of knowldege.
    :param X_design: The set of candidate points for identifying the maximum.
    :param max_it:   The maximum number of iterations.
    """
    for count in range(max_it):
        m, sigma2 = gpr.predict(X_design)
        sigma = np.sqrt(sigma2)
        l = m - 1.96 * sigma
        u = m + 1.96 * sigma
        i = np.argmax(m)
        X = np.vstack([gpr.X, X_design[i:(i+1), :]])
        y = np.vstack([gpr.Y, [f(X_design[i, :])]])
        gpr.set_XY(X, y)
        fig, ax = plt.subplots(dpi=100)
        ax.plot(gpr.X, gpr.Y, 'kx', markersize=10, markeredgewidth=2,
                label='Observations')
        ax.plot(x, f(x), 'r--', label='True function')
        ax.plot(x[i], f(x[i]), 'go', label='Next observation')
        ax.plot(x, m, label='GP mean')
        ax.fill_between(X_design.flatten(), l.flatten(), u.flatten(), 
                        color=sns.color_palette()[0], alpha=0.25,
                        label='GP 95% pred. int.')
        ax.set_xlabel('$x$')
        ax.set_ylabel('$y$')
        ax.set_title('BGO iteration #{0:d}'.format(count+1))
        plt.legend(loc='best')

In [None]:
# Run the algorithm
k = GPy.kern.RBF(1, lengthscale=0.15, variance=4.)
gpr = GPy.models.GPRegression(X[:, None], Y[:, None], k)
maximize_naive(f, gpr, x[:, None], max_it=10)

Observe that the algorithm misses the real maximum.
Instead it finds a local maximum and gets trapped by it.
This is because using the posterior mean as an acquisition functions focuses too much on **exploiting** the current available information but fails to **explore** regions of the input space that we haven't visited.

### Questions

+ Experiment with different number of initial observations. How many do you have to use for the algorithm to actually converge to the global maximum?