---
title: "Gaussian Processes"
description: "A Bayesian approach to regression and classification that defines a distribution over functions."
categories: [Regression, Non-parametric]
image: "Figures/15.png"
order: 18
---

## General Principles
Through varying intercepts and slopes, we have seen how to quantify some of the unique features that generate variation across clusters and covariance among the observations within each cluster. But through the covariance matrix that is used to account for correlation between clusters, we are inherently assuming linear relationships between clusters. What if we want to model the relationship between two variables that are not linearly related? In this case, we can use a Gaussian Process (GP) to model the relationship between two variables. 
<!---
Basically, a GP is a varying-slope model with a covariance matrix where each element of the matrix is a [<span style="color:#0D6EFD">kernel function ðŸ›ˆ</span>]{#kernel}.
-->

## Considerations
::: callout-caution
- To capture complex, non-linear relationships in data where the underlying function is smooth but has an unknown functional form, GPs use a [<span style="color:#0D6EFD">kernel ðŸ›ˆ</span>]{#kernel}.
- The choice of kernel hyperparameters can significantly impact results; thus, GPs require choosing an appropriate kernel function that captures the expected behavior of your data.
- Through kernel definition, we can incorporate domain knowledge.
- They scale poorly with dataset size (O(nÂ³) complexity) due to matrix operations; thus, memory requirements can be substantial for large datasets, which has led to neural networks being used instead to resolve large non-linear problems.
<!--- Check scale of the GP -->
:::

## Example
Below is an example code snippet demonstrating Gaussian Process regression using the Bayesian Inference (BI) package. Data consist of a continuous dependent variable (*total_tools*), representing the number of tools invented in the islands, and a continuous independent variable (*population*), representing the population of the islands. The goal is to estimate the effect of population on the total tools. We use the distance matrix of the islands for the kernel function in order to capture the spatial dependence of the relationship. This example is based on @mcelreath2018statistical.

::: {.panel-tabset group="language"}
## Python

In [None]:
from BI import bi, jnp
import pandas as pd
# Setup device------------------------------------------------
m = bi(platform='cpu')

# Import Data & Data Manipulation ------------------------------------------------
# Import
from importlib.resources import files
data_path = m.load.kline2(only_path=True)
m.data(data_path, sep=';') 


data_path2 = files('BI.Resources') / 'islandsDistMatrix.csv'
islandsDistMatrix = pd.read_csv(data_path2, index_col=0)

m.data_to_model(['total_tools', 'population'])
m.data_on_model["society"] = jnp.arange(0,10)# index observations
m.data_on_model["Dmat"] = islandsDistMatrix.values # Distance matrix

def model(Dmat, population, society, total_tools):
    a = m.dist.exponential(1, name = 'a')
    b = m.dist.exponential(1, name = 'b')
    g = m.dist.exponential(1, name = 'g')

    # non-centered Gaussian Process prior
    etasq = m.dist.exponential(2, name = 'etasq')
    rhosq = m.dist.exponential(0.5, name = 'rhosq')
    SIGMA = etasq * jnp.exp(-rhosq * jnp.square(Dmat))
    SIGMA = SIGMA.at[jnp.diag_indices(Dmat.shape[0])].add(etasq)
    k = m.dist.multivariate_normal(0, SIGMA, name = 'k')

    lambda_ = a * population**b / g * jnp.exp(k[society])

    m.dist.poisson(lambda_, obs=total_tools)

# Run sampler ------------------------------------------------
m.fit(model) 
m.summary()

## Python (Build in function)

In [None]:
from BI import bi, jnp
import pandas as pd
# Setup device------------------------------------------------
m = bi(platform='cpu')

# Import Data & Data Manipulation ------------------------------------------------
# Import
from importlib.resources import files
data_path = m.load.kline2(only_path=True)
m.data(data_path, sep=';') 

islandsDistMatrix = m.load.islands_dist_matrix(frame = False)['data']

m.data_to_model(['total_tools', 'population'])
m.data_on_model["society"] = jnp.arange(0,10)# index observations
m.data_on_model["Dmat"] = islandsDistMatrix # Distance matrix


def model(Dmat, population, society, total_tools):
    a = m.dist.exponential(1, name = 'a')
    b = m.dist.exponential(1, name = 'b')
    g = m.dist.exponential(1, name = 'g')

    k = m.gaussian.gaussian_process(Dmat)

    lambda_ = a * population**b / g * jnp.exp(k[society])

    m.dist.poisson(lambda_, obs=total_tools)

# Run sampler ------------------------------------------------
m.fit(model) 
m.summary()

## R
```r
library(BayesianInference)
jnp = reticulate::import('jax.numpy')
pd = reticulate::import('pandas')
# setup platform------------------------------------------------
m=importBI(platform='cpu')

# Import data ------------------------------------------------
m$data(m$load$kline2(only_path=T), sep=';')
islandsDistMatrix = m$load$islands_dist_matrix(frame = FALSE)$data
m$data_to_model(list('total_tools', 'population'))
m$data_on_model$society = jnp$arange(0,10, dtype='int64')
m$data_on_model$Dmat = jnp$array(islandsDistMatrix)


# Define model ------------------------------------------------
model <- function(Dmat, population, society, total_tools){
  a = bi.dist.exponential(1, name = 'a')
  b = bi.dist.exponential(1, name = 'b')
  g = bi.dist.exponential(1, name = 'g')
  
  # non-centered Gaussian Process prior
  etasq = bi.dist.exponential(2, name = 'etasq')
  rhosq = bi.dist.exponential(0.5, name = 'rhosq')
  k = m$gaussian$gaussian_process(Dmat, etasq, rhosq, 0.01)
  
  lambda_ = a * population**b / g * jnp$exp(k[society])
  m$dist$poisson(lambda_, obs=total_tools)
}

# Run MCMC ------------------------------------------------
m$fit(model) # Optimize model parameters through MCMC sampling

# Summary ------------------------------------------------
m$summary() # Get posterior distribution

```

## Julia
```julia
using BayesianInference

# Setup device------------------------------------------------
m = importBI(platform="cpu")

# Import Data & Data Manipulation ------------------------------------------------
# Import
data_path = m.load.kline2(only_path = true)
m.data(data_path, sep=";") 

islandsDistMatrix = m.load.islands_dist_matrix(frame = false)["data"]
m.data_to_model(["total_tools", "population"])
m.data_on_model["society"] = jnp.arange(0,10)# index observations
m.data_on_model["Dmat"] = jnp.array(islandsDistMatrix) # Distance matrix



# Define model ------------------------------------------------
@BI function model(Dmat, population, society, total_tools)
    a = m.dist.exponential(1, name = "a")
    b = m.dist.exponential(1, name = "b")
    g = m.dist.exponential(1, name = "g")

    # non-centered Gaussian Process prior
    etasq = m.dist.exponential(2, name = "etasq")
    rhosq = m.dist.exponential(0.5, name = "rhosq")
    SIGMA = etasq * jnp.exp(-rhosq * jnp.square(Dmat))
    SIGMA = SIGMA.at[jnp.diag_indices(Dmat.shape[0])].add(etasq)
    k = m.dist.multivariate_normal(0, SIGMA, name = "k")

    lambda_ = a * population^b / g * jnp.exp(k[society])

    m.dist.poisson(lambda_, obs=total_tools)

end

# Run mcmc ------------------------------------------------
m.fit(model)  # Optimize model parameters through MCMC sampling

# Summary ------------------------------------------------
m.summary() # Get posterior distributions
```
:::

## Mathematical Details
### *Formula*


The following equation allows us to evaluate the relationship between the dependent variable $Y$ distributed normal, and the independent variable $X$ while incorporating a GP for the effect of variable $Q$:

$$
Y_{[i]} \sim \text{Normal}( \alpha + \beta  X_{[i]} + \gamma_{[Q_{[i]}]}, \sigma)
$$

where:

- $Y_{[i]}$ is the i-th value for the dependent variable $Y$.

- $\alpha$ is the intercept term.

- $\beta$ is the regression coefficient term.

- $X_{[i]}$ is the i-th value for the independent variable $X$.

- $Q_{[i]}$ is an integer-valued independent variable (e.g., year-of-birth, age, year) for observation $i$.

- $\gamma$ is a vector output from a Gaussian process:

$$
\gamma
\sim \text{MVNormal} \left(
Z,
\varsigma\Omega\varsigma
\right)
$$

where:

- $Z$ represents the mean vector of the multivariate normal distribution and set to [<span style="color:#0D6EFD">zero ðŸ›ˆ</span>]{#kernelMean0}.

- $\varsigma$ is a diagonal matrix of standard deviations. 

- $\Omega$ is a correlation matrix. 


- Multiple kernel functions for $\Omega$ exist and will be discussed in the [Note(s)](#notes) section. But the most common one is the quadratic kernel:

$$
\Omega_{[i,j]} = \eta \exp(-\phi^2 D_{[i,j]}^2) 
$$

Where:

- $\eta$ is the maximal correlation.
  
- $\phi$ determines the rate of decline.
  
- $D_{[i,j]}$ is the distance between the $i$-th and $j$-th categories.
  


### *Bayesian model*
In the Bayesian formulation, we define each parameter with [<span style="color:#0D6EFD">priors ðŸ›ˆ</span>]{#prior}. We can express a Bayesian version of this GP using the following model:

$$
Y_i = \alpha + \beta  X_i + \gamma_{Z_i}
$$

$$
\gamma \sim \text{MVNormal} \left(
\begin{pmatrix}
    0 \\
    \vdots \\
    0
\end{pmatrix},
K
\right)
$$

$$
K_{ij} = \eta^2 \exp(-p^2D_{ij}^2) + \delta_{ij} \sigma^2 
$$

$$
\alpha \sim \text{Normal}(0,1)
$$

$$
\eta^2 \sim \text{HalfCauchy}(0,1)
$$

$$
p^2 \sim \text{HalfCauchy}(0,1)
$$

where:

- $Y_i$ is the i-th value for the dependent variable $Y$.

- $\alpha$ is the intercept term with a prior of $\text{Normal}(0,1)$.

- $\beta$ is the regression coefficient term with a prior of $\text{Normal}(0,1)$.

- $X_i$ is the i-th value for the independent variable $X$.

- $\gamma_{Z_i}$ is the Gaussian process i-th value for the independent variable $Z$.

- $\gamma$ is the latent function modeled by the GP.

- $K_{ij}$ is the kernel function evaluated at the corresponding points, $K_{ij} = k(Z_i, Z_j)$, with priors of HalfCauchy(0,1) for $\eta^2$ and $p^2$ to ensure positive values.


## Notes{#notes}
::: callout-note

Common kernel functions include:

- *Radial Basis Function* (RBF) or Squared Exponential Kernel:
$$k(x,x') = \sigma^2 \exp\left(-\frac{||x-x'||^2}{2l^2}\right)$$


- *Rational Quadratic Kernel*, this kernel is equivalent to adding together many RBF kernels with different length scales:
$$k(x,x') = \sigma^2 \left(1 + \frac{||x-x'||^2}{2l^2}\right)^{-\alpha}$$

- *Periodic kernel* allows for modeling functions that repeat themselves exactly:
$$k(x,x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi||x-x'||/p)}{l^2}\right)$$

- *Locally Periodic Kernel*:

$$k(x,x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi||x-x'||/p)}{l^2}\right) \exp\left(-\frac{||x-x'||^2}{2l^2}\right)$$ 

- Any slope or intercept in your model can be defined using a Gaussian Process.

:::



## Reference(s)
::: {#refs}
:::

https://www.cs.toronto.edu/~duvenaud/cookbook/