# Gaussian Process Regression

In this example, we return to our "straight line" mock dataset, and investigate a "model-free model" for it: a Gaussian Process. The idea is to find a flexible model that can _interpolate_ between the data we have, in order to predict future data lying in the gaps, or beyond the observed domain.

In [None]:
%load_ext autoreload
%autoreload 2

from __future__ import print_function
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = (6.0, 6.0)
plt.rcParams['savefig.dpi'] = 100

## The Data

* Let's generate a simple Cepheids-like dataset: observations of $y$ with reported uncertainties $\sigma_y$, at given $x$ values.


> You can make a different dataset by choosing a different random, to see how the conclusions change from dataset to datset.

In [None]:
from straightline_utils import *

(x, y, sigmay) = generate_data(seed=13)

plot_yerr(x, y, sigmay)

## Fitting a Gaussian Process

Let's follow [Jake VanderPlas' `astroML` example](http://www.astroml.org/book_figures/chapter8/fig_gp_example.html#book-fig-chapter8-fig-gp-example), to see how to work with the `scikit-learn` v0.17 Gaussian Process model.

In [None]:
from sklearn.gaussian_process import GaussianProcess

### Defining a GP

First we define a kernel function, for populating the covariance matrix of our GP. To avoid confusion, a Gaussian kernel is referred to as a "squared exponential":

In [None]:
def my_squared_exponential(x1, x2, theta0):
    return np.exp(-theta0 * (x1 - x2) ** 2)

Now, let's draw some samples from the unconstrained process. Each sample is a function $y(x)$, which we evaluate on a grid. We'll need to assert a value for the kernel hyperparameter $h$, which dictates the correlation length between the datapoints. That will allow us to compute a mean function (which for simplicity we'll set to the mean observed $y$ value), and a covariance matrix that captures the correlations between datapoints. 

In [None]:
np.random.seed(1)
xgrid = np.linspace(0, 350, 100)
print("y(x) will be predicted on a grid of length", len(xgrid))

h = 10.0
theta0 = 0.5 / h**2
print("Correlation parameters h, theta0: ",h, theta0)

# Set up the mean function and covariance matrix. Note the broadcasting to a 2D array, 
# achieved by passing in x[:, None] which has shape (100, 1).

mu = np.zeros(len(xgrid))
C = my_squared_exponential(xgrid, xgrid[:, None], theta0)

print("Set up covariance matrix C with shape ", C.shape)

# Draw three sample y(x) functions:
draws = np.random.multivariate_normal(mu, C, 3)

print("Drew 3 samples, stored in an array with shape ", draws.shape)

Let's plot these, to see what our prior looks like.

In [None]:
# Start a 4-panel figure:
fig = plt.figure()

# Plot our three prior draws:
ax = fig.add_subplot(221)
ax.plot(xgrid, draws[0].T, '-r')
ax.plot(xgrid, draws[1].T, '-g')
ax.plot(xgrid, draws[2].T, '-b', label='Prior $y(x)$')
ax.set_xlim(0, 350)
ax.set_ylim(-5, 5)
ax.set_xlabel('$x$')
ax.set_ylabel('$y(x)$')
ax.legend(fontsize=8)

Each predicted $y(x)$ is drawn from a Gaussian of unit variance, and with off-diagonal elements determined by the covariance function. Try changing `h` to see what happens to the smoothness of the predictions. 

For our data to be well interpolated by this Gaussian Process, it will need to be rescaled such that it has zero mean and unit variance. There are [standard methods for doing this](http://scikit-learn.org/stable/modules/preprocessing.html), but we'll do this rescaling here for transparency - and so we know what to add back in later!

In [None]:
class Rescaler():
    def __init__(self, y, err):
        self.original_data = y
        self.original_err = err
        self.mean = np.mean(y)
        self.std = np.std(y)
        self.transform()
        return
    def transform(self):
        self.y = (self.original_data - self.mean) / self.std
        self.err = self.original_err / self.std
        return()
    def invert(self, scaled_y, scaled_err):
        return (scaled_y * self.std + self.mean, scaled_err * self.std)        

In [None]:
rescaled = Rescaler(y, sigmay)
print('Mean, variance of rescaled data: ',np.round(np.mean(rescaled.y)), np.round(np.var(rescaled.y)))

Check that we can undo the scaling, for any `y` and `sigmay`: 

In [None]:
y2, sigmay2 = rescaled.invert(rescaled.y, rescaled.err)
print('Maximum differences in y, sigmay, after round trip: ',np.max(np.abs(y2 - y)), np.max(np.abs(sigmay2 - sigmay)))

### Constraining the GP

Now, using the same covariance function, lets "fit" the GP by constraining each draw from the GP to go through our data points. Let's first look at how this would work for two data points with no uncertainty. 

In [None]:
# Choose two of our datapoints:
x1 = np.array([x[10], x[12]])
rescaled_y1 = np.array([rescaled.y[10], rescaled.y[12]])
rescaled_sigmay1 = np.array([rescaled.err[10], rescaled.err[12]])

# Instantiate a GP model:
gp1 = GaussianProcess(corr='squared_exponential', theta0=theta0,
                      thetaL=0.0001, thetaU=1000.0,
                      random_state=0)

# Fit it to our two noiseless datapoints:
gp1.fit(x1[:, None], rescaled_y1)

# Now predict y(x) everywhere on our xgrid: 
rescaled_ygrid1, MSE1 = gp1.predict(xgrid[:, None], eval_MSE=True)
rescaled_ygrid1_err = np.sqrt(MSE1)

# And undo scaling:
ygrid1, ygrid1_err = rescaled.invert(rescaled_ygrid1, rescaled_ygrid1_err)
y1, sigmay1 = rescaled.invert(rescaled_y1, rescaled_sigmay1)

In [None]:
ax = fig.add_subplot(222)
ax.plot(xgrid, ygrid1, '-', color='gray', label='Posterior mean $y(x)$')
ax.fill_between(xgrid, ygrid1 - ygrid1_err, ygrid1 + ygrid1_err, color='gray', alpha=0.3)
ax.plot(x1, y1, '.k', ms=6, label='Noiseless constraints')
ax.set_xlim(0, 350)
ax.set_ylim(0, 250)
ax.set_xlabel('$x$')
ax.legend(fontsize=8)
fig

In the absence of information, the GP defaults to its mean function, which we chose to be a constant. 

### Including Observational Uncertainties

The mechanism for including uncertainties is a little esoteric: `scikit-learn` wants to be given a "nugget" to multiply the diagonal elements of the covariance matrix.

In [None]:
# Choose two of our datapoints:
x2 = np.array([x[10], x[12]])
rescaled_y2 = np.array([rescaled.y[10], rescaled.y[12]])
rescaled_sigmay2 = np.array([rescaled.err[10], rescaled.err[12]])

# Instantiate a GP model, including a nugget of observational errors:
gp2 = GaussianProcess(corr='squared_exponential', theta0=theta0,
                      thetaL=0.0001, thetaU=1000.0,
                      nugget=(rescaled_sigmay2 / rescaled_y2) ** 2, 
                      random_state=0)

# Fit it to our two noisy datapoints:
gp2.fit(x2[:, None], rescaled_y2)

# Now predict y(x) everywhere on our xgrid: 
rescaled_ygrid2, MSE2 = gp2.predict(xgrid[:, None], eval_MSE=True)
rescaled_ygrid2_err = np.sqrt(MSE2)

# And undo scaling:
ygrid2, ygrid2_err = rescaled.invert(rescaled_ygrid2, rescaled_ygrid2_err)
y2, sigmay2 = rescaled.invert(rescaled_y2, rescaled_sigmay2)

In [None]:
ax = fig.add_subplot(223)
ax.plot(xgrid, ygrid2, '-', color='gray', label='Posterior mean $y(x)$')
ax.fill_between(xgrid, ygrid2 - ygrid2_err, ygrid2 + ygrid2_err, color='gray', alpha=0.3)
ax.errorbar(x2, y2, sigmay2, fmt='.k', ms=6, label='Noisy constraints')
ax.set_xlim(0, 350)
ax.set_ylim(0, 250)
ax.set_xlabel('$x$')
ax.set_ylabel('$y(x)$')
ax.legend(fontsize=8)
fig

### Using all the Data

Now let's extend the above example to use all of our datapoints. This additional information should pull the predictions further away from the initial mean function.

In [None]:
# Use all of our datapoints:
x3 = x
rescaled_y3 = rescaled.y
rescaled_sigmay3 = rescaled.err

# Instantiate a GP model, including a nugget of observational errors:
gp3 = GaussianProcess(corr='squared_exponential', theta0=theta0,
                      thetaL=0.0001, thetaU=1000.0,
                      nugget=(rescaled_sigmay3 / rescaled_y3) ** 2,
                      random_state=0)

# Fit it to our noisy datapoints:
gp3.fit(x3[:, None], rescaled_y3)

# Now predict y(x) everywhere on our xgrid: 
rescaled_ygrid3, MSE3 = gp3.predict(xgrid[:, None], eval_MSE=True)
rescaled_ygrid3_err = np.sqrt(MSE3)

# And undo scaling:
ygrid3, ygrid3_err = rescaled.invert(rescaled_ygrid3, rescaled_ygrid3_err)
y3, sigmay3 = rescaled.invert(rescaled_y3, rescaled_sigmay3)

# We have fit for the `h` parameter: print the result here:
print("Best-fit h =", np.sqrt(0.5 / gp3.theta_[0]))

In [None]:
ax = fig.add_subplot(224)
ax.plot(xgrid, ygrid3, '-', color='gray', label='Posterior mean $y(x)$')
ax.fill_between(xgrid, ygrid3 - ygrid3_err, ygrid3 + ygrid3_err, color='gray', alpha=0.3)
ax.errorbar(x3, y3, sigmay3, fmt='.k', ms=6, label='Data')
ax.set_xlim(0, 350)
ax.set_ylim(0, 250)
ax.set_xlabel('$x$')
ax.legend(fontsize=8)
fig

We now see the Gaussian Process model providing a smooth interpolation between the points. Show here is the posterior mean curve; samples drawn from the GP will show fluctuations, but all will be plausible under our assumptions.

## Endnote

In `scikit-learn` v0.18, the `GaussianProcessRegressor` model provides the ability to draw samples from the posterior PDF for $P(y(x))$, and a more intuitive API for constructing kernels:

In [None]:
# kernel = RBF(length_scale=10, length_scale_bounds=(0.01, 100.0))
# gp0 = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)