## Session #2: Gaussian process regression

#### Let us now actucally use a Gaussian process and do more than just sampling from it! 

We will define a regression model with a Gaussian process. 

The standard linear regression model uses training data $(X, y)$ to predict the value of the target variable from unseen data $x_*$ by learning the underlying function f(x): 

$$
f(x) = x^T w
$$

We assume that the observations $y$ come with additive Gaussian noise on top, such that: 

$$
y(x) = f(x) + \epsilon
$$

$$
\epsilon \sim \mathcal{N}(0, \sigma_n^2)
$$

To find this underlying function we start off with a Gaussian process prior with zero mean and squared exponential covariance function.

## Your task: 

I provide some training data $(X, y)$ over a given range. The $y-values$ are noisy observations of the underlying function $f$. 
Your task is to 

a) define the GP prior using the training and the testing $x-values$ 

b) condition the GP prior on the training data to get the predictive distribution

c) play with the length scale of the covariance to get a good fit. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt 
%matplotlib inline
plt.style.use('seaborn-deep')

In [None]:
def calculate_covariance_matrix(x_p, x_q, l=.01): 
    # for a convenient computation, we need a two dimensional array
    if x_p.ndim < 2: 
        x_p = x_p.reshape(-1 ,1)
    if x_q.ndim < 2: 
        x_q = x_q.reshape(-1, 1)
    # calculate the squared distance: x^2 - 2xy + y^2
    square_dist = np.sum(x_p ** 2, 1).reshape(-1, 1) + np.sum(x_q ** 2, 1) - 2 * np.dot(x_p, x_q.T)
    
    # return the exponential of the squared distance
    return np.exp((-0.5 / l) * square_dist)

def secret_function(x, omega=2.): 
    return np.sin(omega * x)

### Here comes the data: 

In [None]:
# training data 
n_train = 20
sigma_noise = .2
xtrain = np.hstack((np.linspace(-5 , 2, n_train / 2), np.linspace(4 , 9, n_train / 2)))
ytrain = secret_function(xtrain) + np.random.standard_normal(xtrain.size) * sigma_noise

# we take many testing point for better visualization
n_test = 1000
xtest = np.linspace(-10, 10, n_test)
ytest = secret_function(xtest)

In [None]:
plt.figure(figsize=(15, 5))
plt.plot(xtest, ytest, label='underlying function f(x)')
plt.plot(xtrain, ytrain, 'o', label='training data')
plt.title('Training data and underlying function')
plt.ylabel('y')
plt.xlabel('x')
plt.legend();

### a) Define the GP prior
Because m(x) = 0 for the prior all you need to define the prior is the covariance matrix of the training data.

\begin{align}
\mathbf{f} \sim \mathcal{GP}(\mathbf{0}, k(\mathbf{x, x'})) = \mathcal{N}(0, K(X, X) + \sigma_n^2 \mathbf{I})
\end{align}


In [None]:
# use the training data to define the covariance matrix  


What now? This covariance matrix contains only the training data points $X$. To make predictions we need to incorporate the test data $X_*$ as well. So actually we need the joint covariance matrix of the training and the test data: 

\begin{align}
\begin{bmatrix}
	\mathbf{y} \\ \mathbf{f_*}
\end{bmatrix}
 = \mathcal{N}\left(0, 
	\begin{bmatrix} 
		K(X, X) + \sigma_n^2 \mathbf{I} & K(X,X_*) \\
		K(X_*,X) & K(X_*, X_*)
	\end{bmatrix} \right)
\end{align}

In [None]:
# define the joint prior 
# by combining training and test data covariance matrices into a single large matrix like in the equation above



We could now sample from this GP like before, but we don't. 

Rather, we will calculate the predictive distribution $f_*$ to make predictions for new values $x_*$. 

### b) Calculate the predictive distribution

We get the predictive distribution by conditioning the joint prior on the training data $(X, y)$. The predictive distribution is again, guess what, a Gaussian: 

\begin{align}
p(\mathbf{f}_*| X_*, X, \mathbf{y} ) &\sim \mathcal{N} (m(\mathbf{x}) , \Sigma) \\
\end{align


\begin{align}
m(\mathbf{x}) &= K(X_*, X) [K(X, X) + \sigma_n^2 \mathbf{I}]^{-1}\mathbf{y} \\
\end{align}

\begin{align}
\Sigma &= K(X_*, X_*) - K(X_*, X) [K(X, X) + \sigma_n^2 \mathbf{I}]^{-1}K(X,X_*)
\end{align}

In [None]:
# Use the formulas above to define the mean function and the covariance matrix of the predictive distribution
# the mean function


# the covariance matrix. 



Because the predictive distribution is Gaussian the mean and the covariance completely define our estimation of the underlying function $f$. The mean is our prediction and the variance at every sample is our certainty of the prediction. 

Plot the prediction for $f(x\_test)$ and the corresponding variance or standard deviation at every position, e.g., as shaded aread around the prediction (check out plt.fill_between). 

In [None]:
# get the standard deviation of each individual x_test from the covariance matrix: 


# plot the prediction m(x) 
plt.figure(figsize=(15, 5))

# and the covariance for every x: plt.fill_between

# plot the training data 

# plot the underlying function 
plt.title('Mean and variance of the predictive distr. with training data points');


## c) Play with the length scale parameter of the covariance function ot the get a better fit. 

You can use the plotting function below if you want. It just takes the data and the mean and variance of the predictive distribution and plots the results. 

In [None]:
# calculate the predicitve distribution with a better length scale parameter


# plot the result



## If you are stuck check out the two functions below. They give a solution for a) and b) and let you solve c) by yourself

In [None]:
def gp_regression(xtrain, ytrain, xtest, sigma_noise=.1, l=.1): 

    # calculate the covariance matrix 
    k11 = calculate_covariance_matrix(xtrain, xtrain, l=l) + sigma_noise ** 2 * np.eye(xtrain.shape[0])
    k12 = calculate_covariance_matrix(xtrain, xtest, l=l)
    k22 = calculate_covariance_matrix(xtest, xtest, l=l)
    k21 = calculate_covariance_matrix(xtest, xtrain, l=l)
    
    # Use the formulas above to define the mean function and the covariance matrix of the predictive distribution
    # the mean function
    invers_training_K = np.linalg.inv(k11)
    m = k21.dot(invers_training_K).dot(ytrain)
    # the covariance matrix. 
    sigma = k22 - k21.dot(invers_training_K).dot(k12)
    
    return m.squeeze(), sigma.squeeze()

def plot_gp_regression_results(m, sigma, xtrain, ytrain, xtest, ytest): 

    std = np.sqrt(np.diag(sigma))
    
    upper_std = np.squeeze(m) + std
    lower_std = np.squeeze(m) - std
    
    plt.figure(figsize=(15, 5))
    plt.fill_between(xtest, upper_std, lower_std, alpha=0.4)
    plt.plot(xtest, m, 'r', label='Prediction mean')
    plt.plot(xtrain, ytrain, 'go', label='data')
    plt.plot(xtest, ytest)
    plt.title('Mean and variance of the predictive distr. with training data points')
    plt.legend(loc=0);

In [None]:

# calculate the mean and variance of the predictive distribution: 
cov_length_scale = .5
mean, variance = gp_regression(xtrain, ytrain, xtest, sigma_noise=sigma_noise, l=cov_length_scale)

# plot the results 
plot_gp_regression_results(mean, variance, xtrain, ytrain, xtest, ytest)