# Gaussian process regression basics [optional]

This notebook presents basic Gaussian process (GP) regression using only the `numpy` package to keep the example as simple as possible. 
Note that you do not need to know Gaussian processes in detail to be able to complete the rest of the tutorial, so this step is optional.

While a complete introduction to Gaussian proccess is beyond the scope of this tutorial, we try to describe the very basics. For more information and theoretical background, we recommend you to take a look at for example the excelent [Gaussian Processes for Machine Learning](http://www.gaussianprocess.org/gpml/) book.
The aim here is just to build some intuition about how the model works in practice.

In supervised learning we observe input-output pairs $(\mathbf{x}, y)$ and we assume $y = f(\mathbf{x})$ for some unknown function $f$, possibly corrupted by noise.
The goal of learning is to estimate $f$ as closely as possible from the observed data.
The optimal approach would be to estimate a distribution over functions given the data $p(f|\mathbf{X},\mathbf{y})$ and use it to make predictions given new inputs $p(y_*|\mathbf{x}_*,\mathbf{X},\mathbf{y}) = \int p(y_*|f,\mathbf{x}_*)p(f|\mathbf{X},\mathbf{y})df$.

A Gaussian process is a generalisation of the Gaussian distribution that describes a distribution over functions and is fully specified by its mean and covariance function:

$$
f \sim GP(m(\mathbf{x}),k(\mathbf{x},\mathbf{x'}))
$$

where $m(\mathbf{x})$ is the mean function and $k(\mathbf{x},\mathbf{x'})$ is the covariance function also known as the kernel.

## Dependencies

In [None]:
# imports
import matplotlib.pyplot as plt
import numpy as np

## Example data

We start by generating some example data. To keep it simple, we have just one input variable `x` and one output variable `y`. We make `y` a nonlinear function of `x` to better illustrate the flexibility of the GP regression model. 

In [None]:
np.random.seed(0)

N = 20  # number of data points
f = lambda x: np.sin(4 * x) * np.sin (5 * x)  # true function
x = np.random.uniform(0.1, 1.5, N)  # random values of x
y = f(x) + np.random.normal(0, .05, N)  # noisy observations
x1 = np.linspace(0, 1.6, 200)  # grid of x values for plotting

plt.figure(figsize=(12,4))
plt.plot(x1, f(x1), color='k', label="true function")
plt.plot(x, y, 'o', color='k', label="noisy observations")
plt.xlabel("x"); plt.ylabel("y")
plt.legend()
plt.show()

## GP prior distribution

Here we define the GP prior distribution. That is the distribution of functions before we observe any data.

We will use a very simple zero mean function:

$$
m(\mathbf{x}) = 0
$$

Even with a prior with a zero-mean function, the GP is usually flexible enough to fit a wide variety of functions.

We also define the widely used squared exponential kernel (also known as RBF):

$$
k_{SE}(x, x') = \sigma_f^2 \exp \big( -\frac{(x - x')^2}{2l^2} \big)
$$

where $\sigma_f^2$ is the variance parameter and $l$ is the length scale parameter.

Finally, we define a simple noise kernel in the code below.

In [None]:
class SquaredExponentialKernel:
    
    def __init__(self, variance, length_scale):
        self.variance = variance
        self.length_scale = length_scale
        
    def kernel_function(self, x1, x2):
        z = (x1 - x2)**2 / (2 * self.length_scale**2)
        return self.variance * np.exp(-z)
    
    def __call__(self, X, Z=None):
        """Compute covaraince matrix."""
        if Z is None:
            Z = X
        N, M = len(X), len(Z)
        K = np.zeros((N, M))
        # naive
        for i in range(N):
            for j in range(M):
                K[i, j] = self.kernel_function(X[i], Z[j])
        return K
    

class NoiseKernel:
    
    def __init__(self, variance):
        self.variance = variance
        
    def __call__(self, X):
        """Compute covaraince matrix."""
        return self.variance * np.eye(len(X))

Now we can draw some sample functions from the GP prior distribution and plot them. You can try to change the `variance` and `length_scale` parameters in the code below to see how they affect the sampled functions. Try for example to change the `length_scale` in factors of ten.

In [None]:
def draw_samples_from_gp_prior(n_samples=5, variance=1.0, length_scale=0.1):
    # define kernel
    kernel = SquaredExponentialKernel(variance=variance, length_scale=length_scale)
    # draw samples
    samples = []
    for _ in range(n_samples):
        samples.append(np.random.multivariate_normal(np.zeros(len(x1)), kernel(x1)))
    std = np.sqrt(variance)
    # plot
    plt.figure(figsize=(12,4))
    plt.fill_between((0.0, 1.6), (2*std, 2*std), (-2*std, -2*std), color="C0", alpha=0.3, label="uncertainty (2*std)")
    for i,fx1 in enumerate(samples):
        plt.plot(x1, fx1, color="C0", linestyle="--", label="prior samples" if i == 0 else "")
    plt.xlabel("x"); plt.ylabel("y")
    plt.legend()
    plt.show()
    
draw_samples_from_gp_prior(n_samples=5, variance=1.0, length_scale=0.1)

## GP posterior distribution (with noise)

We now condition on the observed data to compute the GP posterior and make predictions of the mean $\mu$ and the (co)variance $\Sigma$.
For the simple GP regression model, this step can be computed in closed form with the following expressions:

\begin{align}
p(\mathbf{f}_*|\mathbf{X}_*,\mathbf{X},\mathbf{y}) &= \mathcal{N}(\mathbf{f}_*|\mathbf{\mu},\mathbf{\Sigma}) \\
\mathbf{\mu} &= \mu(\mathbf{X}_*) + \mathbf{K}^T_* \mathbf{K}^{-1} (\mathbf{y} - \mu(\mathbf{X})) = \mathbf{K}^T_* \mathbf{K}^{-1} \mathbf{y} \\
\Sigma &= \mathbf{K}_{**} - \mathbf{K}^T_* \mathbf{K}^{-1} \mathbf{K}_*
\end{align}

where $\mathbf{K} = k(\mathbf{X},\mathbf{X}) + \sigma^2 \mathbf{I}$, $\mathbf{K}_* = k(\mathbf{X},\mathbf{X}_*)$ and $\mathbf{K}_{**}=k(\mathbf{X}_*,\mathbf{X}_*)$ with covariance function (kernel) $k$ and noise level $\sigma^2$ and assuming a zero mean function $\mu(\mathbf{X})=\mathbf{0}$.

(In practice the Cholesky decomposition can be used instead of the computationally expensive matrix inversion.)

In [None]:
def predict(variance, length_scale, noise_level, n_samples=5):
    # kernels
    kernel = SquaredExponentialKernel(variance=variance, length_scale=length_scale)
    noise = NoiseKernel(variance=noise_level)
    # inference
    K = kernel(x) + noise(x)
    K_inv = np.linalg.inv(K)
    K1 = kernel(x, x1)
    K11 = kernel(x1, x1)
    mu = K1.T @ K_inv @ y
    Sigma = K11 - K1.T @ K_inv @ K1
    # sample posterior
    samples = []
    for _ in range(n_samples):
        fx1 = np.random.multivariate_normal(mu, Sigma)
        samples.append(fx1)
    sigma = np.sqrt(Sigma.diagonal())
    return mu, sigma, samples

Let us plot the model predictions for various values of the length scale parameter to see how it behaves. 

You can try and change the values of the `variance`, `length_scale` and `noise_level` parameters in the code below to see they affect the predictions of the model.

In [None]:
for ls in [0.1, 0.2, 0.3, 0.4, 0.5]:
    mu, sigma, samples = predict(variance=1.0, length_scale=ls, noise_level=0.01)
    # plot
    plt.figure(figsize=(12,4))
    plt.title(f"length_scale={ls}")
    plt.plot(x1, f(x1), color="k", label="true function")
    plt.plot(x, y, 'o', color="k", label="noisy observations")
    plt.plot(x1, mu, color="C0", label="prediction")
    plt.fill_between(x1, mu + 2 * sigma, mu - 2 * sigma, color="C0", alpha=0.3, label="uncertainty (2*std)")
    for i,fx1 in enumerate(samples):
        plt.plot(x1, fx1, color="C0", linestyle="--", alpha=0.5, label="posterior samples" if i==0 else "")
    plt.legend(loc=1)
    plt.show()

As you can see, some parameter values make the model fit the data better than others. Thus, the problem of learning a good model of the data consists of finding suitable kernel parameters for the GP model.
In practice the kernel parameters can be optimised automatically using gradient based methods. 
Fortunately there are many GP packages available so we do not have to implement this ourselves. 