# Gaussian Processes

Gaussian processes are a powerful algorithm for both regression and classification. Their greatest practical advantage is that they can give a reliable estimate of their own uncertainty.

## Introduction

In supervised learning, we often use parametric models $p(y|X,\theta)$ to explain data and infer optimal values of parameter $\theta$ via [maximum likelihood](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) or [maximum a posteriori](https://de.wikipedia.org/wiki/Maximum_a_posteriori) estimation. If needed we can also infer a full [posterior distribution](https://en.wikipedia.org/wiki/Posterior_probability) $p(\theta |X, y)$ instead of a point estimate $\hat{\theta}$. With increasing data complexity, models with a higher number of parameters are usually needed to explain data reasonably well. Methods that use models with a fixed number of parameters are called parametric methods.

In non-parametric methods, on the other hand, the number of parameters depend on the dataset size. For example, in [Nadaraya-Watson kernel regression](https://en.wikipedia.org/wiki/Kernel_regression), a weight $w_i$ is assigned to each observed target $y_i$, and for predicting the target value at a new point $x$, a weighted average is computed:

$$f(x) = \sum_{i=1}^N w_i (X) y_i$$

$$w_i(x) = \frac{\kappa(x,x_i)}{\sum_{i'=1}^N \kappa(x,x_{i'})}$$

Observations that are closer to $x$ have a higher weight than observations that are further away. Weights are computed from $x$ and observed $x_i$ with a kernel $\kappa$. A special case is k-nearest neighbors (KNN) where the $k$ closest observations have a weight $1/k$, and all others have a weight $0$. Non-parametric methods often need to process all training data for prediction and therefore slower at inference time than parametric methods. On the other hand, training is usually faster as non-parametric models only need to remember training data.

