# Gaussian Processes

Gaussian processes are a powerful algorithm for both regression and classification. Their greatest practical advantage is that they can give a reliable estimate of their own uncertainty.

## Introduction

In supervised learning, we often use parametric models $p(y|X,\theta)$ to explain data and infer optimal values of parameter $\theta$ via [maximum likelihood](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) or [maximum a posteriori](https://de.wikipedia.org/wiki/Maximum_a_posteriori) estimation. If needed we can also infer a full [posterior distribution](https://en.wikipedia.org/wiki/Posterior_probability) $p(\theta |X, y)$ instead of a point estimate $\hat{\theta}$. With increasing data complexity, models with a higher number of parameters are usually needed to explain data reasonably well. Methods that use models with a fixed number of parameters are called parametric methods.

In non-parametric methods, on the other hand, the number of parameters depend on the dataset size. For example, in [Nadaraya-Watson kernel regression](https://en.wikipedia.org/wiki/Kernel_regression), a weight $w_i$ is assigned to each observed target $y_i$, and for predicting the target value at a new point $x$, a weighted average is computed:

$$f(x) = \sum_{i=1}^N w_i (x) y_i$$

$$w_i(x) = \frac{\kappa(x,x_i)}{\sum_{i'=1}^N \kappa(x,x_{i'})}$$

Observations that are closer to $x$ have a higher weight than observations that are further away. Weights are computed from $x$ and observed $x_i$ with a kernel $\kappa$. A special case is k-nearest neighbors (KNN) where the $k$ closest observations have a weight $1/k$, and all others have a weight $0$. Non-parametric methods often need to process all training data for prediction and therefore slower at inference time than parametric methods. On the other hand, training is usually faster as non-parametric models only need to remember training data.

Another example of non-parametric methods are [Gaussian processes](https://en.wikipedia.org/wiki/Gaussian_process) (GPs). Instead of inferring a distribution over the parameters of a parametric function, Gaussian processes can be used to infer a distribution over functions directly. A Gaussian process defines a prior over functions. After having observed some function values, it can be converted into a posterior over functions. Inference of continuous function values in this context is known as GP regression but GPs can also be used for classification.

A Gaussian process is a [random process](https://en.wikipedia.org/wiki/Stochastic_process) where any point $x \in \mathbb{R}^d$ is assigned a random variable $f(x)$ and where the joint distribution of a finite number of these variables $p(f(x_1),...,f(x_N))$ is itself Gaussian:

$$p(f|X) = \mathcal{N}(f|\mu, K)$$

Where,
-  $f = (f(x_1),...,f(x_N))$
-  $\mu = (m(x_1),...,m(x_N))$
-  $K_{ij}=\kappa(x_i,x_j)$

$m$ is the mean function and it is common to use $m(x) = 0$ as GPs are flexible enough to model the mean arbitrarily well. $\kappa$ is a positive definite _kernel function_ or _covariance function_. Thus, a Gaussian process is a distribution over functions whose shape (smoothness, ...) is defined by $K$. If points $x_i$ and $x_j$ are considered to be similar by the kernel, the function values at these points, $f(x_i)$ and $f(x_j)$, can be expected to be similar too.

A GP prior $p(f|X)$ can be converted into a GP posterior $p(f|X,y)$ after having observed some data $y$. The posterior can then be used to make predictions $f_*$ given new input $X_*$:

$$
\begin{align}
p(f_*|X_*,X,y) &= \int p(f_*|X_*,f) p(f|X,y) df \\
               &= \mathcal{N}(f_*|\mu_*,\Sigma_*)
\end{align}
$$

Equation $(2)$ is the posterior predictive distribution which is also a Gaussian with mean $\mu_*$ and $\Sigma_*$. By definition of the GP, the joint distribution of observed data $y$ and predictions $f_*$ is

$$
\begin{pmatrix}
y \\
f_*
\end{pmatrix}
\sim \mathcal{N}
\begin{pmatrix}
0,
\begin{pmatrix}
K_y & K_* \\
K_*^T & K_{**}
\end{pmatrix}
\end{pmatrix}
$$

With $N$ training data and $N_*$ new input data,
-  $K_y = \kappa(X, X) + \sigma_y^2 I = K + \sigma_y^2 I$ is $N \times N$