In [1]:
import numpy as np
import matplotlib.pyplot as plt

# Gaussian processes
---

We will concentrate on regression tasks here.

### Parametric vs Non-parametric

One way to see the regression problem is through linear regression techniques where we try to approximate the function $f(x)$ by a function $y(x,w) = w^T \phi(x)$ where $\phi(x)$ is a projection of $x$ into a higher dimensional space (which can also be seen as a collection of functions $\phi_i(x)$ called basis functions).

One approach, **parametric**, to regression consists in learning the weights $w$ to minimize the least square error (plus some quadratic terms for regularization). New points are then given the value $y(x) = w^T \phi(x)$.

Another approach, **non-parametric**, would be to define a measure of similarity between points via a kernel function $k(x,x')$. When a new point comes, we assign it a value based on the similarity with examples of the training sets for which we know both $x_n$ and the corresponding $t_n$:

* for any new point $x$, the value $t$ is a combination of the values $t_n$ for the samples $x_n$
* this combination is based on the similarity $x$ has with the different values $x_n$: the more similar, the closer the values

### Kernel methods

If we are interested in non-parametric methods, we can view $w$ as a linear combinations of the samples:

&emsp; $w = \sum_n a_n x_n = X^T a$ where $X$ is the *design matrix* (each row is a sample).

If we replace this inside the cost function of least square $J(w) = \frac{1}{2} \sum (w^T x_n - t_n)^2$, we get:

&emsp; $J(a) = \frac{1}{2} \sum (x_n^T X^T a - t_n)^2 = \frac{1}{2} \sum (t_n^2 + a^T X x_n x_n^T X^T a - 2 x_n^T X^T a t_n)$

&emsp; $J(a) = \frac{1}{2} t^T t + \frac{1}{2} a^T X X^T X X^T a - a^T X X^T t$

&emsp; $J(a) = \frac{1}{2} t^T t + \frac{1}{2} a^T K K a - a^T K t$

where $K = X X^T$ is the Gram matrix:

&emsp; $(K)_{n,m} = \sum_k X_{kn} X_{km} = \phi(x_n)^T \phi(x_m)$, the inner product of the representation of the inputs

We look at the gradient to find the best parameters a:

&emsp; $\nabla_a J(a) = K K a - K t = 0 \implies a = K^{-1} t$

**Result:** Therefore, we can classify new inputs according to their similarity with existing training samples, for a result that will be entirely equivalent to the linear regression method:

&emsp; $a = K^{-1} t$

&emsp; $w = X^T a$

&emsp; $y(x) = w^T \phi(x) = k(x)^T K^{-1} t$

&emsp; $k(x)^T = (k(x, x_1),..., k(x, x_n))$

&emsp; $(K)_{n,m} = \phi(x_n)^T \phi(x_m) = k(x_n, x_m)$

This is what we call the **dual** view of regression. Instead of inverting a matrix of size $D \times D$ where $D$ is the number of dimensions of the destination space of $\phi$ (via the linear regression formula), we instead invert a similarity matrix of dimension $N \times N$ where $N$ is the number of samples. This might be interesting if the number of dimensions grows big (projection to infinite space even gets possible).

### Gaussian processes

GP offer another alternate view that leads to similar formula. Let us go back to the parameteric approach of estimating the value $t$ of an input point $x$. We can marginalize over the values of $w$:

&emsp; $p(t|x) = \int p(t|x,w) p(w|D) dw$, $p(w|D) = \alpha p(D|w) p(w)$ (Bayesian approach, to be opposed to MAP of ML approaches)

The GP offer an alternative. Instead of looking at the *a priori* probability of $w$, and marginalize over the probability of $w$, we can marginalize over the the probability of function $y(x)$.

&emsp; $p(t|x) = \int p(t|x,y) p(y|D) dy$, where $y$ is a function

This looks more complicated than necessary (because the space of functions is so big), but we can find our way through this using some basic observations and assumptions:

* $p(w)$ is gaussian, centered into 0 and with isotropic covariance:
    * $p(w) = N(w|0,\alpha^{-1}I_n)$
    * $Cov[w] = E[w w^T] = \alpha^{-1} I_n$
* $y(x_n) = w^T x_n$ is a combination of gaussian, and is therefore a gaussian
* $Y=Xw$, the vector of all $y(x_n)$, is therefore a collection of gaussian

$Y$ is therefore a multivariate gaussian distribution of dimension $N$, and we can compute its mean and covariance as follows:

* $E[Y] = X E[w] = 0$
* $Cov[Y] = E[Y Y^T] = X E[w w^T] X^T = \frac{1}{\alpha} X X^T = K$ where $K$ is the Gram matrix

A Gaussian Process is defined as a probability distribution over functions $y(x)$ such that the $y(x_1)$ ... $y(x_N)$ have a joint gaussian probability distribution. This distribution is completely specified by:

* its mean, which we generally assume to be equal to 0
* its covariance matrix $K$ such that $(K)_{n,m} = \phi(x_n)^T \phi(x_m)$



### Regression with Gaussian Processes

Until now, we just modeled the distribution of $p(Y)$. Let us assume the noise around the predictions is also gaussian: $p(t_n|y_n) = N(t_n|y_n,\beta^{-1})$ and that the noise is the same for all data points:

&emsp; $p(T|Y) = N(T|Y,\beta^{-1}I_n)$

The product of two independent gaussians is also a gaussian, whose means are added and covariance are added:

&emsp; $p(T|D) = \int p(T|Y,D)p(Y|D)dY$ is also a gaussian $p(T) = N(T|0,C)$ where $C(x_n, x_m) = k(x_n, x_m) + \beta^{-1} \delta_{nm}$ 

Now, we want to be able to output the value $t$ of a new point $x$ through $p(t|x,D)$. We proceed as if we would have an additional data point in our data set, and we compute an extended covariance matrix $C_{N+1}$ from $C_N$ that looks like this:

&emsp; $M = \begin{bmatrix} C_N & k \\ k^T & c \end{bmatrix}$ where $k = (k(x,x_1),...,k(x,x_N))$ and $c = k(x,x) + \beta^{-1}$

What we get from this is a new multivariable (N+1 dimensions) gaussian distribution. Since it is a GP, each of its dimension is also a gaussian, including for the value $t$ we are looking for, and we can find the mean and variance of this $t$ through the formula (**TODO: demonstration**):

* $mean(t) = k(x)^T K^{-1} t$ (same as kernel method shown above)
* $\sigma^2(t) = k(x,x) + \beta^{-1} - k(x)^T K^{-1} k(x)$

We can see the mean of $t$ as either:

* A linear combination of $t_n$: $mean(t) = [k(x)^T K^{-1}] t = \sum_n a_n t_n$
* A linear combination of kernels: $mean(t) = k(x)^T [K^{-1} t] = \sum_n a_n k(x,x_n)$

### Demonstration

In [None]:
# TODO - show some exemple of regression in both linear setting mode (with basis functions) and with Gaussian Processes

# Support Vector Machines
---