## Scalable Gaussian Processes

**Recap Gaussian Process Regression**

In {ref}```GPR-Lemma<lem:gpregrnois>``` we have fully discribed a Gaussian process for regression. For repitition we will write this down again

$$
\begin{aligned}
p(y_* \lvert X_*X,y) &= \mathcal{N}(y_* \lvert \mu_*, \Sigma_*)\\
\mu_* = \mathbb{E}(f(X^*)) &= K(X^*, X) \big(K(X, X) + \sigma_{\mathrm{noise}}^2 I_n\big)^{-1} y\\
\Sigma_* =\text{Cov}(f(X^*)) &= K(X^*, X^*) - K(X^*, X) \big(K(X, X) + \sigma_{\mathrm{noise}}^2 I_n\big)^{-1} K(X, X^*)
\end{aligned}
$$

We will shorten this notation of $K(X^*, X^*)$ to $K_{**}$ , $K(X^*, X)$ to $K_{*}^T$ and $K(X, X^*)$ to $K_{*}$ and this we lead to

$$
\begin{aligned}
p(y_* \lvert X_*X,y) &= \mathcal{N}(y_* \lvert \mu_*, \Sigma_*)\\
\mu_* &= K_{*}^T \big(K + \sigma_{\mathrm{noise}}^2 I_n\big)^{-1} y\\
\Sigma_* &= K_{**} - K_{*}^T \big(K + \sigma_{\mathrm{noise}}^2 I_n\big)^{-1} K_{*}
\end{aligned}
$$

We can predict function values $f^*$ at new inputs $X^*$ with this. The most costly computation step of this is to compute the invserse of a covariance matrix (with noise)

$$
\big(K + \sigma_{\mathrm{noise}}^2 I_n\big)^{-1}
$$

which is needed in the mean and the covariance of the gaussian process.
This matrix has the size of $n \times n$, where $n$ is the number of training samples. The problem now is that, for large $n$ a matrix inversion is intractable.


The posterior can also be defined as

$$
p(f_* | X^*,X, y) = \int p(f_*,f| X^*,X,y)df \quad  \textrm{marginalize }f\\
p(f_* | X^*,X, y) = \int p(f_*| X^*,X,f,y)p(f|X^*,X, y)df \quad \textrm{reserve chain rule}\\
p(f_* | X^*,X, y) = \int p(f_* | X^*,X,f) p(f| X^*,X, y) df \quad \textrm{drop irrelevant y}
$$

**Sparse Gaussian Process Regression**

To avoid costly matrix inversion of $n \times n$-matrix, Authors in {cite}```Titsias2009``` propose to use the so called _inducing variables_. They assume the existence a set of $m$ inducing variables $f_m$ with corresponding inputs $X_m$ that will discribe our model well enough as if we used all datapoints $n$. Note that inducing variables $f_m$ evaluated at the pseudo-inputs $X_m$, which are independent from the training inputs. We will define the approximation posterior as
$$
q(f_*) = \int p(f_* | f_m) \phi(f_m) df_m ,
$$

where $\phi(f_m)$ is the approximation of the intractable $p(f_m|y)$ and is defined by

$$
\phi(f_m) = \mathcal{N}(f_m | \mu_m, A_m).
$$
Here is $\mu_m$ the mean and $A_m$ the covariance matrix. The goal is to find optimal values for the mean $\mu_m$ and covarince $A_m$ as well as the optimal location of the inducing inputs $X_m$. The mean and covariance matrix of the Guassian Process is defined by

$$
q(f_*) = \mathcal{N}(f_* | \mu_*^q, \Sigma_*^q) \\
\mu_*^q = K_{*m} K_{mm}^{-1} \mu_m  \\
\Sigma_*^q = K_{**} - K_{*m} K_{mm}^{-1} K_{m*} + K_{*m} K_{mm}^{-1} \mathbf{A}_m K_{mm}^{-1} K_{m*} 
$$

where $K_{mm}=K(X_m,X_m)$, $K_{*m}=K(X_*,X_m)$ and $K_{m*}=K(X_m,X_*)=K_{*m}^T$. Now we can see, that we have to do the inversion of a $m \times m$ matrix if we found the optimal values for $\mu_m, A_m$ and $X_m$. For optimization of this we will use a variational approach by minimizing the Kullback-Leibler (KL) divergence between the approximate $q(f)$ and the exact posterior $p(f|y)$ over latent variables $f$. 



The minimization of KL divergence is equivalent to maximization of a lower bound $\mathcal{L}(\mu_m, A_m,X_m)$ on the true log marginal likelihood $\log p(y).$ This lower bound can be optimized by analytically solving for $μ_m$ and $A_m$. The resulting lower bound after optimization is a function of $X_m$:

$$
\mathcal{L}(X_m)= \log \mathcal{N}(y|0,\sigma^2_y I+ Q_{nn})-\frac{1}{2 \sigma^2_y}\textrm{Tr}(K{nn}-Q{nn})
$$
with $Q{nn}=K{nm}K{mm}^{-1}K{mn}$. The first term on the RHS is the approximate log likelihood term and the second one is a regularization term which result of using variational approach.The second term can be interpreted as minimizing the error predicting $f$ from inducing variables $f_m$. The better the variables $f_m$ represent the function to be modeled the smaller this error will be. So the optimization will try to find the best postions for the inducing inputs $X_m$. With optimal inducing inputs $X_m$ we ca analytically find values for $\mu_m$ and $A_m$.

$$
\mu_m = \frac{1}{\sigma_y^2} K_{mm} \Sigma K_{mn} y \\
A_m = K_{mm} \Sigma K_{mm} 
$$
where $\Sigma = (K_{mm}+\sigma^2_yK_{mm}K_{nm}^{-1})$ is. 

For a more detailed description of sparse and variational GP, please refer to {cite}```Leibfried2021```.

```{bibliography}
:filter: docname in docnames
:style: plain
```