# Gaussian Process Classification: Complete Theory and Implementation Guide

## Table of Contents
1. [Introduction and Purpose](#introduction)
2. [Mathematical Foundation](#mathematical-foundation)
3. [Prior and Likelihood Definition](#prior-likelihood)
4. [Posterior Approximation via Laplace Method](#posterior-approximation)
5. [Posterior Predictive Distribution](#predictive-distribution)
6. [Implementation Details](#implementation)
7. [Hand Calculations Example](#hand-calculations)
8. [Making Predictions](#predictions)
9. [Code Walkthrough](#code-walkthrough)

<a name="introduction"></a>
## 1. Introduction and Purpose

Gaussian Process Classification (GPC) is a probabilistic, non-parametric approach to binary classification that:

- **Solves**: Binary classification problems where we need to distinguish between two classes
- **Provides**: Probabilistic predictions with uncertainty quantification
- **Excels at**: Learning complex, non-linear decision boundaries without specifying a fixed functional form
- **Use cases**: Medical diagnosis, image classification, anomaly detection

### Key Advantages:
- Automatic complexity control (no overfitting)
- Uncertainty estimates for predictions
- Flexible, non-parametric model
- Principled Bayesian framework

<a name="mathematical-foundation"></a>
## 2. Mathematical Foundation

The GPC model is defined hierarchically:

$$\begin{align}
y|f(\mathbf{x}) &\sim \text{Bernoulli}[\sigma(f(\mathbf{x}))] \\
f(\mathbf{x}) &\sim \mathcal{GP}(0, k(\mathbf{x}, \mathbf{x}'))
\end{align}$$

Where:
- $f(\mathbf{x})$ is a latent function following a Gaussian Process
- $\sigma(\cdot)$ is the sigmoid function: $\sigma(f) = \frac{1}{1 + e^{-f}}$
- $k(\mathbf{x}, \mathbf{x}')$ is the covariance (kernel) function

<a name="prior-likelihood"></a>
## 3. Prior and Likelihood Definition

### 3.1 Prior Distribution

The prior over the latent function is a Gaussian Process:

$$p(\mathbf{f}) = \mathcal{N}(\mathbf{f}|\mathbf{0}, \mathbf{K})$$

Where $\mathbf{K}_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$ is the kernel matrix.

**Squared Exponential Kernel:**
$$k(\mathbf{x}, \mathbf{x}') = \kappa \exp\left(-\frac{||\mathbf{x} - \mathbf{x}'||^2}{2\ell^2}\right)$$

Parameters:
- $\kappa$: magnitude (signal variance)
- $\ell$: lengthscale (smoothness)

### 3.2 Likelihood

For binary classification with labels $y \in \{0, 1\}$:

$$p(y_n|f_n) = \text{Bernoulli}(y_n|\sigma(f_n)) = \sigma(f_n)^{y_n}(1-\sigma(f_n))^{1-y_n}$$

Log-likelihood:
$$\log p(y_n|f_n) = y_n \log \sigma(f_n) + (1-y_n) \log(1-\sigma(f_n))$$

### 3.3 Joint Distribution

$$p(\mathbf{y}, \mathbf{f}) = p(\mathbf{y}|\mathbf{f})p(\mathbf{f}) = \prod_{n=1}^N p(y_n|f_n) \cdot \mathcal{N}(\mathbf{f}|\mathbf{0}, \mathbf{K})$$

<a name="posterior-approximation"></a>
## 4. Posterior Approximation via Laplace Method

Since $p(\mathbf{f}|\mathbf{y})$ is intractable, we use the Laplace approximation:

$$p(\mathbf{f}|\mathbf{y}) \approx q(\mathbf{f}) = \mathcal{N}(\mathbf{f}|\mathbf{m}, \mathbf{S})$$

### 4.1 Finding the MAP Estimate

First, find $\mathbf{f}_{\text{MAP}} = \arg\max_{\mathbf{f}} \log p(\mathbf{y}, \mathbf{f})$

$$\log p(\mathbf{y}, \mathbf{f}) = \sum_{n=1}^N \log p(y_n|f_n) - \frac{1}{2}\mathbf{f}^T\mathbf{K}^{-1}\mathbf{f} - \frac{1}{2}\log|\mathbf{K}| - \frac{N}{2}\log(2\pi)$$

**Gradient:**
$$\nabla_{\mathbf{f}} \log p(\mathbf{y}, \mathbf{f}) = \mathbf{y} - \boldsymbol{\sigma}(\mathbf{f}) - \mathbf{K}^{-1}\mathbf{f}$$

Where $\boldsymbol{\sigma}(\mathbf{f}) = [\sigma(f_1), ..., \sigma(f_N)]^T$

**Hessian:**
$$\nabla_{\mathbf{f}}^2 \log p(\mathbf{y}, \mathbf{f}) = -\boldsymbol{\Lambda} - \mathbf{K}^{-1}$$

Where $\boldsymbol{\Lambda} = \text{diag}(\sigma(f_n)(1-\sigma(f_n)))$

### 4.2 Laplace Approximation Parameters

- Mean: $\mathbf{m} = \mathbf{f}_{\text{MAP}}$
- Covariance: $\mathbf{S} = (\mathbf{K}^{-1} + \boldsymbol{\Lambda})^{-1}$

### 4.3 Numerically Stable Computation

Using Woodbury identity to avoid direct inversion:

$$\mathbf{S} = \mathbf{K} - \mathbf{K}\boldsymbol{\Lambda}^{1/2}(\mathbf{I} + \boldsymbol{\Lambda}^{1/2}\mathbf{K}\boldsymbol{\Lambda}^{1/2})^{-1}\boldsymbol{\Lambda}^{1/2}\mathbf{K}$$

Algorithm:
1. Compute $\boldsymbol{\Lambda}^{1/2} = \text{diag}(\sqrt{\sigma(f_n)(1-\sigma(f_n))})$
2. Form $\mathbf{B} = \mathbf{I} + \boldsymbol{\Lambda}^{1/2}\mathbf{K}\boldsymbol{\Lambda}^{1/2}$
3. Compute Cholesky: $\mathbf{B} = \mathbf{L}_B\mathbf{L}_B^T$
4. Solve: $\mathbf{e} = \mathbf{L}_B^{-1}(\boldsymbol{\Lambda}^{1/2}\mathbf{K})$
5. Compute: $\mathbf{S} = \mathbf{K} - \mathbf{e}^T\mathbf{e}$

<a name="predictive-distribution"></a>
## 5. Posterior Predictive Distribution

### 5.1 Predictive Distribution for Latent Function

For new input $\mathbf{x}_*$:

$$p(f_*|\mathbf{y}, \mathbf{x}_*) \approx \mathcal{N}(f_*|\mu_*, \sigma_*^2)$$

Where:
- $\mu_* = \mathbf{k}_*^T \mathbf{K}^{-1} \mathbf{m}$
- $\sigma_*^2 = k_{**} - \mathbf{k}_*^T \mathbf{K}^{-1} (\mathbf{K} - \mathbf{S}) \mathbf{K}^{-1} \mathbf{k}_*$

With:
- $\mathbf{k}_* = [k(\mathbf{x}_*, \mathbf{x}_1), ..., k(\mathbf{x}_*, \mathbf{x}_N)]^T$
- $k_{**} = k(\mathbf{x}_*, \mathbf{x}_*)$

### 5.2 Predictive Probability for Class Label

$$p(y_*=1|\mathbf{y}, \mathbf{x}_*) = \int \sigma(f_*) p(f_*|\mathbf{y}, \mathbf{x}_*) df_*$$

Using probit approximation:

$$p(y_*=1|\mathbf{y}, \mathbf{x}_*) \approx \Phi\left(\frac{\mu_*}{\sqrt{\frac{8}{\pi} + \sigma_*^2}}\right)$$

Where $\Phi$ is the standard normal CDF.

<a name="implementation"></a>
## 6. Implementation Details

### 6.1 Key Classes

```python
class BernoulliLikelihood:
    """Implements Bernoulli likelihood with sigmoid link"""
    
    def log_lik(self, f):
        # log p(y|f) = y·log(σ(f)) + (1-y)·log(1-σ(f))
        return jnp.sum(self.y * jnp.log(sigmoid(f)) + 
                      (1 - self.y) * jnp.log(1 - sigmoid(f)))
    
    def grad(self, f):
        # ∇f log p(y|f) = y - σ(f)
        return self.y - sigmoid(f)
    
    def hessian(self, f):
        # ∇²f log p(y|f) = -diag(σ(f)(1-σ(f)))
        return jnp.diag(-sigmoid(f) * (1 - sigmoid(f)))
```

```python
class GaussianProcessClassification:
    """Main GPC class using Laplace approximation"""
    
    def __init__(self, X, y, likelihood, kernel, kappa, lengthscale):
        # Store data and parameters
        self.X, self.y = X, y
        self.likelihood = likelihood(y)
        self.kernel = kernel
        
        # Compute kernel matrix
        self.K = kernel.construct_kernel(X, X)
        self.L = jnp.linalg.cholesky(self.K)
        
        # Construct Laplace approximation
        self.construct_laplace_approximation()
```

### 6.2 MAP Optimization

The code uses a reparameterization $\mathbf{f} = \mathbf{K}\mathbf{a}$ to avoid direct inversion:

```python
def log_joint_a(self, a):
    f = self.K @ a  # Reparameterization
    log_prior = -0.5 * jnp.sum(a * f) - jnp.sum(jnp.log(jnp.diag(self.L)))
    log_lik = self.likelihood.log_lik(f)
    return log_prior + log_lik

def compute_f_MAP(self):
    result = minimize(lambda a: -self.log_joint_a(a),
                     jac=lambda a: -self.grad_a(a),
                     x0=jnp.zeros(self.N))
    return self.K @ result.x
```

<a name="hand-calculations"></a>
## 7. Hand Calculations Example

Let's work through a small example with N=3 training points:

### Setup
- Training inputs: $\mathbf{X} = [0, 1, 2]^T$
- Training labels: $\mathbf{y} = [0, 1, 0]^T$
- Kernel parameters: $\kappa = 1, \ell = 1$

### Step 1: Compute Kernel Matrix

$$K_{ij} = \exp\left(-\frac{(x_i - x_j)^2}{2}\right)$$

$$\mathbf{K} = \begin{bmatrix}
1.000 & 0.607 & 0.135 \\
0.607 & 1.000 & 0.607 \\
0.135 & 0.607 & 1.000
\end{bmatrix}$$

### Step 2: Find MAP Estimate

Initialize: $\mathbf{f}^{(0)} = [0, 0, 0]^T$

Iterate using gradient ascent:
$$\mathbf{f}^{(t+1)} = \mathbf{f}^{(t)} + \alpha \nabla \log p(\mathbf{y}, \mathbf{f}^{(t)})$$

At convergence: $\mathbf{f}_{\text{MAP}} \approx [-0.8, 0.9, -0.8]^T$

### Step 3: Compute Hessian

At MAP:
- $\sigma(f_1) \approx 0.31, \sigma(f_2) \approx 0.71, \sigma(f_3) \approx 0.31$
- $\Lambda_{11} = 0.31 \times 0.69 \approx 0.214$
- $\Lambda_{22} = 0.71 \times 0.29 \approx 0.206$
- $\Lambda_{33} = 0.31 \times 0.69 \approx 0.214$

$$\boldsymbol{\Lambda} = \begin{bmatrix}
0.214 & 0 & 0 \\
0 & 0.206 & 0 \\
0 & 0 & 0.214
\end{bmatrix}$$

### Step 4: Compute Posterior Covariance

Using the stable algorithm:
1. $\boldsymbol{\Lambda}^{1/2} = \text{diag}(0.463, 0.454, 0.463)$
2. $\mathbf{B} = \mathbf{I} + \boldsymbol{\Lambda}^{1/2}\mathbf{K}\boldsymbol{\Lambda}^{1/2}$
3. Compute $\mathbf{S}$ using Cholesky decomposition

### Step 5: Prediction at $x_* = 1.5$

1. Compute $\mathbf{k}_* = [k(1.5, 0), k(1.5, 1), k(1.5, 2)]^T$
   $$\mathbf{k}_* = [0.325, 0.607, 0.607]^T$$

2. Compute predictive mean:
   $$\mu_* = \mathbf{k}_*^T \mathbf{K}^{-1} \mathbf{m} \approx 0.3$$

3. Compute predictive variance:
   $$\sigma_*^2 = k_{**} - \mathbf{k}_*^T \mathbf{K}^{-1}(\mathbf{K} - \mathbf{S})\mathbf{K}^{-1}\mathbf{k}_* \approx 0.15$$

4. Compute class probability:
   $$p(y_*=1) \approx \Phi\left(\frac{0.3}{\sqrt{2.546 + 0.15}}\right) \approx 0.53$$

<a name="predictions"></a>
## 8. Making Predictions

### 8.1 Point Predictions

```python
# Predict probabilities
p_test = gpc.predict_y(Xtest)

# Binary classification
ytest_hat = (p_test > 0.5).astype(float)
```

### 8.2 Uncertainty Quantification

```python
# Get predictive distribution
mu, Sigma = gpc.predict_f(Xstar)

# Predictive mean and variance
mean_predictions = mu
variance_predictions = jnp.diag(Sigma)

# 95% confidence intervals
lower_bound = mu - 1.96 * jnp.sqrt(jnp.diag(Sigma))
upper_bound = mu + 1.96 * jnp.sqrt(jnp.diag(Sigma))
```

### 8.3 Sampling from Posterior

```python
# Generate samples from posterior predictive
f_samples = gpc.posterior_samples(Xstar, num_samples=100)

# Convert to probability samples
p_samples = sigmoid(f_samples)
```

<a name="code-walkthrough"></a>
## 9. Code Walkthrough

### 9.1 Complete Training Pipeline

```python
# 1. Setup kernel and likelihood
kernel = StationaryIsotropicKernel(squared_exponential)
likelihood = BernoulliLikelihood

# 2. Create and train GPC model
gpc = GaussianProcessClassification(
    X, y, 
    likelihood, 
    kernel, 
    kappa=3.0,      # Signal variance
    lengthscale=1.0  # Smoothness parameter
)

# 3. Make predictions
mu, Sigma = gpc.predict_f(Xtest)  # Latent function
p_test = gpc.predict_y(Xtest)     # Class probabilities
```

### 9.2 Visualization Example

```python
# Create prediction grid
x_grid = np.linspace(-5, 5, 50)
X_grid = np.meshgrid(x_grid, x_grid)
X_flat = np.column_stack([X_grid[0].ravel(), X_grid[1].ravel()])

# Predict on grid
p_grid = gpc.predict_y(X_flat).reshape(50, 50)

# Plot decision boundary
plt.contourf(X_grid[0], X_grid[1], p_grid, levels=20, cmap='RdBu')
plt.colorbar(label='P(y=1)')
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')
plt.title('GPC Decision Boundary')
```

### 9.3 Model Selection

```python
# Grid search over hyperparameters
kappa_values = [0.1, 1.0, 10.0]
lengthscale_values = [0.1, 1.0, 10.0]

best_score = -np.inf
best_params = None

for kappa in kappa_values:
    for lengthscale in lengthscale_values:
        gpc = GaussianProcessClassification(
            X_train, y_train, 
            likelihood, kernel,
            kappa=kappa, 
            lengthscale=lengthscale
        )
        
        # Evaluate on validation set
        p_val = gpc.predict_y(X_val)
        log_likelihood = np.sum(
            y_val * np.log(p_val) + 
            (1 - y_val) * np.log(1 - p_val)
        )
        
        if log_likelihood > best_score:
            best_score = log_likelihood
            best_params = (kappa, lengthscale)
```

## Summary

Gaussian Process Classification provides a principled, probabilistic approach to binary classification that:

1. **Models complexity automatically** through the kernel function
2. **Provides uncertainty estimates** via the posterior predictive distribution
3. **Handles non-linear boundaries** without explicit feature engineering
4. **Scales reasonably** to moderate-sized datasets (up to ~10,000 points)

The Laplace approximation makes inference tractable while maintaining the benefits of the Bayesian framework, providing both point predictions and principled uncertainty quantification.