# Day 3: Gaussian Processes for Nonparametric Regression in Trading

## Week 20: Bayesian Methods

---

## Learning Objectives

By the end of this notebook, you will:

1. **Understand Gaussian Process fundamentals** - distributions over functions, prior and posterior
2. **Master kernel functions** - RBF, Matérn, periodic, and custom kernels for financial data
3. **Implement GP regression** - from scratch and using GPyTorch/scikit-learn
4. **Apply GPs to trading problems** - volatility modeling, price prediction, uncertainty quantification
5. **Understand GP limitations** - computational complexity, scalability solutions

---

## Table of Contents

1. [Introduction to Gaussian Processes](#1-introduction)
2. [Mathematical Foundations](#2-mathematical-foundations)
3. [Kernel Functions for Finance](#3-kernel-functions)
4. [GP Regression from Scratch](#4-gp-from-scratch)
5. [GP with GPyTorch](#5-gpytorch)
6. [Trading Applications](#6-trading-applications)
7. [Scalable GP Methods](#7-scalable-gp)
8. [Practice Exercises](#8-exercises)

---

## 1. Introduction to Gaussian Processes <a name="1-introduction"></a>

### What is a Gaussian Process?

A **Gaussian Process (GP)** is a collection of random variables, any finite number of which have a joint Gaussian distribution. GPs provide a **nonparametric** approach to regression that:

- Places a **prior distribution over functions** rather than parameters
- Provides **uncertainty estimates** for predictions
- Is **flexible** and can model complex nonlinear relationships
- Is **data-efficient** for small datasets

### Why GPs for Trading?

| Advantage | Trading Application |
|-----------|--------------------|
| **Uncertainty quantification** | Risk management, position sizing |
| **Nonparametric flexibility** | Regime-adaptive modeling |
| **Smooth interpolation** | Volatility surface modeling |
| **Bayesian framework** | Sequential updating with new data |
| **Small data effectiveness** | Limited historical scenarios |

### GP vs Parametric Models

```
Parametric (e.g., Linear Regression):
- Fixed functional form: f(x) = wx + b
- Learn parameters w, b
- Limited flexibility

Gaussian Process:
- No fixed functional form
- Distribution over all possible functions
- Flexibility controlled by kernel choice
```

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.linalg import cholesky, cho_solve
from scipy.optimize import minimize
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

---

## 2. Mathematical Foundations <a name="2-mathematical-foundations"></a>

### Definition

A Gaussian Process is fully specified by its **mean function** $m(x)$ and **covariance function** (kernel) $k(x, x')$:

$$f(x) \sim \mathcal{GP}(m(x), k(x, x'))$$

For any finite set of points $\{x_1, ..., x_n\}$:

$$\begin{bmatrix} f(x_1) \\ \vdots \\ f(x_n) \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} m(x_1) \\ \vdots \\ m(x_n) \end{bmatrix}, \begin{bmatrix} k(x_1, x_1) & \cdots & k(x_1, x_n) \\ \vdots & \ddots & \vdots \\ k(x_n, x_1) & \cdots & k(x_n, x_n) \end{bmatrix} \right)$$

### GP Prior

Before observing data, the GP prior encodes our beliefs about function smoothness, periodicity, etc.

### GP Posterior

Given training data $\mathbf{X}, \mathbf{y}$ and test points $\mathbf{X}_*$:

$$\mathbf{f}_* | \mathbf{X}, \mathbf{y}, \mathbf{X}_* \sim \mathcal{N}(\boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*)$$

Where:
- $\boldsymbol{\mu}_* = K(\mathbf{X}_*, \mathbf{X})[K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1}\mathbf{y}$
- $\boldsymbol{\Sigma}_* = K(\mathbf{X}_*, \mathbf{X}_*) - K(\mathbf{X}_*, \mathbf{X})[K(\mathbf{X}, \mathbf{X}) + \sigma_n^2 I]^{-1}K(\mathbf{X}, \mathbf{X}_*)$

In [None]:
# Visualize GP Prior - Sampling functions from prior

def rbf_kernel(X1, X2, length_scale=1.0, variance=1.0):
    """
    Radial Basis Function (RBF) / Squared Exponential kernel.
    
    k(x, x') = variance * exp(-0.5 * ||x - x'||^2 / length_scale^2)
    """
    X1 = np.atleast_2d(X1)
    X2 = np.atleast_2d(X2)
    
    # Compute squared Euclidean distances
    sqdist = np.sum(X1**2, axis=1).reshape(-1, 1) + \
             np.sum(X2**2, axis=1) - 2 * np.dot(X1, X2.T)
    
    return variance * np.exp(-0.5 * sqdist / length_scale**2)

# Create test points
X_test = np.linspace(-5, 5, 200).reshape(-1, 1)

# Sample from GP prior with different length scales
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

length_scales = [0.5, 1.0, 2.0]

for ax, ls in zip(axes, length_scales):
    # Compute covariance matrix
    K = rbf_kernel(X_test, X_test, length_scale=ls)
    K += 1e-8 * np.eye(len(X_test))  # Numerical stability
    
    # Sample functions from prior
    samples = np.random.multivariate_normal(np.zeros(len(X_test)), K, size=5)
    
    for i, sample in enumerate(samples):
        ax.plot(X_test, sample, alpha=0.7, label=f'Sample {i+1}')
    
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_title(f'GP Prior Samples (length_scale={ls})')
    ax.set_ylim(-3, 3)

plt.suptitle('Effect of Length Scale on GP Prior Samples', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("Shorter length scale → more wiggly functions")
print("Longer length scale → smoother functions")

In [None]:
# Visualize GP Posterior - Conditioning on observations

def gp_posterior(X_train, y_train, X_test, length_scale=1.0, variance=1.0, noise=1e-4):
    """
    Compute GP posterior mean and covariance.
    
    Parameters:
    -----------
    X_train : training inputs
    y_train : training targets
    X_test : test inputs
    length_scale : kernel length scale
    variance : kernel variance
    noise : observation noise variance
    
    Returns:
    --------
    mu : posterior mean
    cov : posterior covariance
    """
    # Compute kernel matrices
    K = rbf_kernel(X_train, X_train, length_scale, variance) + noise * np.eye(len(X_train))
    K_s = rbf_kernel(X_train, X_test, length_scale, variance)
    K_ss = rbf_kernel(X_test, X_test, length_scale, variance)
    
    # Cholesky decomposition for numerical stability
    L = cholesky(K, lower=True)
    
    # Solve for alpha = K^{-1} y
    alpha = cho_solve((L, True), y_train)
    
    # Posterior mean
    mu = K_s.T @ alpha
    
    # Solve for v = L^{-1} K_s
    v = cho_solve((L, True), K_s)
    
    # Posterior covariance
    cov = K_ss - K_s.T @ v
    
    return mu, cov

# Generate synthetic training data
np.random.seed(42)
X_train = np.array([-4, -3, -2, -1, 1, 2, 3, 4]).reshape(-1, 1)
y_train = np.sin(X_train).flatten() + 0.1 * np.random.randn(len(X_train))

# Test points
X_test = np.linspace(-5, 5, 200).reshape(-1, 1)

# Compute posterior
mu, cov = gp_posterior(X_train, y_train, X_test, length_scale=1.0, variance=1.0, noise=0.01)
std = np.sqrt(np.diag(cov))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Prior
ax = axes[0]
K_prior = rbf_kernel(X_test, X_test) + 1e-8 * np.eye(len(X_test))
prior_samples = np.random.multivariate_normal(np.zeros(len(X_test)), K_prior, size=5)
for sample in prior_samples:
    ax.plot(X_test, sample, alpha=0.5)
ax.fill_between(X_test.flatten(), -2, 2, alpha=0.2, color='gray', label='Prior uncertainty')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('GP Prior (Before Seeing Data)')
ax.set_ylim(-3, 3)

# Posterior
ax = axes[1]
ax.fill_between(X_test.flatten(), mu - 2*std, mu + 2*std, alpha=0.3, color='blue', label='95% CI')
ax.plot(X_test, mu, 'b-', lw=2, label='Posterior mean')
ax.scatter(X_train, y_train, c='red', s=100, zorder=5, label='Training data')
ax.plot(X_test, np.sin(X_test), 'k--', alpha=0.5, label='True function')

# Sample from posterior
cov_stable = cov + 1e-8 * np.eye(len(X_test))
posterior_samples = np.random.multivariate_normal(mu.flatten(), cov_stable, size=3)
for i, sample in enumerate(posterior_samples):
    ax.plot(X_test, sample, alpha=0.4, linestyle='--')

ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('GP Posterior (After Conditioning on Data)')
ax.legend(loc='upper right')
ax.set_ylim(-3, 3)

plt.tight_layout()
plt.show()

print("\nKey observations:")
print("- Uncertainty is reduced near training points")
print("- Uncertainty grows away from data")
print("- Posterior samples pass through/near training points")

---

## 3. Kernel Functions for Finance <a name="3-kernel-functions"></a>

The kernel (covariance function) encodes our assumptions about the function we're modeling.

### Common Kernels

| Kernel | Formula | Properties | Financial Use |
|--------|---------|------------|---------------|
| **RBF/SE** | $k(x,x') = \sigma^2 \exp\left(-\frac{\|x-x'\|^2}{2l^2}\right)$ | Infinitely differentiable, smooth | General regression |
| **Matérn** | Complex | Adjustable smoothness | Realistic price paths |
| **Periodic** | $k(x,x') = \sigma^2 \exp\left(-\frac{2\sin^2(\pi|x-x'|/p)}{l^2}\right)$ | Captures periodicity | Seasonality, calendar effects |
| **Rational Quadratic** | $k(x,x') = \sigma^2\left(1 + \frac{\|x-x'\|^2}{2\alpha l^2}\right)^{-\alpha}$ | Multi-scale | Mixed regime dynamics |
| **Linear** | $k(x,x') = \sigma_b^2 + \sigma_v^2(x-c)(x'-c)$ | Linear trends | Trend modeling |

In [None]:
# Implement various kernel functions

class Kernels:
    """Collection of kernel functions for Gaussian Processes."""
    
    @staticmethod
    def rbf(X1, X2, length_scale=1.0, variance=1.0):
        """Radial Basis Function (Squared Exponential) kernel."""
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * X1 @ X2.T
        return variance * np.exp(-0.5 * sqdist / length_scale**2)
    
    @staticmethod
    def matern32(X1, X2, length_scale=1.0, variance=1.0):
        """Matérn 3/2 kernel - once differentiable."""
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        dist = np.sqrt(np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * X1 @ X2.T + 1e-10)
        scaled_dist = np.sqrt(3) * dist / length_scale
        return variance * (1 + scaled_dist) * np.exp(-scaled_dist)
    
    @staticmethod
    def matern52(X1, X2, length_scale=1.0, variance=1.0):
        """Matérn 5/2 kernel - twice differentiable."""
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        dist = np.sqrt(np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * X1 @ X2.T + 1e-10)
        scaled_dist = np.sqrt(5) * dist / length_scale
        return variance * (1 + scaled_dist + scaled_dist**2 / 3) * np.exp(-scaled_dist)
    
    @staticmethod
    def periodic(X1, X2, length_scale=1.0, variance=1.0, period=1.0):
        """Periodic kernel for seasonal patterns."""
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        dist = np.abs(X1 - X2.T)
        return variance * np.exp(-2 * np.sin(np.pi * dist / period)**2 / length_scale**2)
    
    @staticmethod
    def rational_quadratic(X1, X2, length_scale=1.0, variance=1.0, alpha=1.0):
        """Rational Quadratic kernel - mixture of RBFs with different length scales."""
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * X1 @ X2.T
        return variance * (1 + sqdist / (2 * alpha * length_scale**2))**(-alpha)
    
    @staticmethod
    def linear(X1, X2, variance=1.0, variance_bias=1.0, offset=0.0):
        """Linear kernel for linear trends."""
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        return variance_bias + variance * (X1 - offset) @ (X2 - offset).T

# Visualize different kernels
X = np.linspace(-5, 5, 200).reshape(-1, 1)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

kernels_info = [
    ('RBF', lambda X1, X2: Kernels.rbf(X1, X2, length_scale=1.0)),
    ('Matérn 3/2', lambda X1, X2: Kernels.matern32(X1, X2, length_scale=1.0)),
    ('Matérn 5/2', lambda X1, X2: Kernels.matern52(X1, X2, length_scale=1.0)),
    ('Periodic', lambda X1, X2: Kernels.periodic(X1, X2, length_scale=1.0, period=2.0)),
    ('Rational Quadratic', lambda X1, X2: Kernels.rational_quadratic(X1, X2, alpha=0.5)),
    ('Linear', lambda X1, X2: Kernels.linear(X1, X2))
]

for ax, (name, kernel_func) in zip(axes.flatten(), kernels_info):
    K = kernel_func(X, X) + 1e-8 * np.eye(len(X))
    samples = np.random.multivariate_normal(np.zeros(len(X)), K, size=5)
    
    for sample in samples:
        ax.plot(X, sample, alpha=0.7)
    
    ax.set_title(f'{name} Kernel')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')

plt.suptitle('GP Prior Samples with Different Kernels', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Kernel combinations - sum and product

def combined_kernel(X1, X2, config):
    """
    Combine multiple kernels for complex patterns.
    
    Example: Trend + Seasonality + Noise
    k(x,x') = k_linear(x,x') + k_periodic(x,x') + k_rbf(x,x')
    """
    K = np.zeros((len(X1), len(X2)))
    
    if 'linear' in config:
        K += config['linear']['weight'] * Kernels.linear(X1, X2, **config['linear']['params'])
    
    if 'periodic' in config:
        K += config['periodic']['weight'] * Kernels.periodic(X1, X2, **config['periodic']['params'])
    
    if 'rbf' in config:
        K += config['rbf']['weight'] * Kernels.rbf(X1, X2, **config['rbf']['params'])
    
    return K

# Financial time series kernel: Trend + Seasonality
X = np.linspace(0, 10, 200).reshape(-1, 1)

config = {
    'linear': {'weight': 0.3, 'params': {'variance': 0.5, 'variance_bias': 0.1}},
    'periodic': {'weight': 0.5, 'params': {'length_scale': 1.0, 'period': 2.5}},
    'rbf': {'weight': 0.2, 'params': {'length_scale': 0.5, 'variance': 0.5}}
}

K_combined = combined_kernel(X, X, config) + 1e-8 * np.eye(len(X))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Samples from combined kernel
ax = axes[0]
samples = np.random.multivariate_normal(np.zeros(len(X)), K_combined, size=5)
for i, sample in enumerate(samples):
    ax.plot(X, sample, alpha=0.7, label=f'Sample {i+1}')
ax.set_xlabel('Time')
ax.set_ylabel('Value')
ax.set_title('GP Samples: Linear + Periodic + RBF Kernel')
ax.legend()

# Kernel matrix visualization
ax = axes[1]
im = ax.imshow(K_combined, cmap='viridis', aspect='auto')
ax.set_xlabel('Time index j')
ax.set_ylabel('Time index i')
ax.set_title('Covariance Matrix K(i,j)')
plt.colorbar(im, ax=ax)

plt.tight_layout()
plt.show()

print("Combined kernels capture multiple patterns:")
print("- Linear: Overall trend")
print("- Periodic: Seasonality/cycles")
print("- RBF: Local variations")

---

## 4. GP Regression from Scratch <a name="4-gp-from-scratch"></a>

### Key Components

1. **Kernel computation** - Build covariance matrices
2. **Posterior inference** - Condition on training data
3. **Hyperparameter optimization** - Maximize marginal likelihood
4. **Prediction** - Mean and uncertainty

In [None]:
class GaussianProcessRegressor:
    """
    Gaussian Process Regressor from scratch.
    
    Implements:
    - RBF kernel
    - Marginal likelihood optimization
    - Posterior prediction with uncertainty
    """
    
    def __init__(self, kernel='rbf', length_scale=1.0, variance=1.0, noise=1e-4):
        self.kernel = kernel
        self.length_scale = length_scale
        self.variance = variance
        self.noise = noise
        self.X_train = None
        self.y_train = None
        self.L = None  # Cholesky decomposition
        self.alpha = None  # K^{-1} y
        
    def _kernel(self, X1, X2):
        """Compute kernel matrix."""
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * X1 @ X2.T
        return self.variance * np.exp(-0.5 * sqdist / self.length_scale**2)
    
    def _negative_log_marginal_likelihood(self, theta, X, y):
        """
        Compute negative log marginal likelihood.
        
        log p(y|X,theta) = -0.5 * y^T K^{-1} y - 0.5 * log|K| - n/2 * log(2π)
        """
        self.length_scale = np.exp(theta[0])
        self.variance = np.exp(theta[1])
        self.noise = np.exp(theta[2])
        
        K = self._kernel(X, X) + self.noise * np.eye(len(X))
        
        try:
            L = cholesky(K, lower=True)
        except:
            return 1e10  # Return large value if not positive definite
        
        alpha = cho_solve((L, True), y)
        
        # Log marginal likelihood
        log_likelihood = -0.5 * y @ alpha - np.sum(np.log(np.diag(L))) - 0.5 * len(X) * np.log(2 * np.pi)
        
        return -log_likelihood
    
    def fit(self, X, y, optimize=True):
        """
        Fit the GP model.
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
        y : array-like, shape (n_samples,)
        optimize : bool, whether to optimize hyperparameters
        """
        self.X_train = np.atleast_2d(X)
        self.y_train = np.atleast_1d(y)
        
        if optimize:
            # Optimize hyperparameters
            theta0 = np.log([self.length_scale, self.variance, self.noise])
            
            result = minimize(
                self._negative_log_marginal_likelihood,
                theta0,
                args=(self.X_train, self.y_train),
                method='L-BFGS-B',
                bounds=[(-5, 5), (-5, 5), (-10, 1)]
            )
            
            self.length_scale = np.exp(result.x[0])
            self.variance = np.exp(result.x[1])
            self.noise = np.exp(result.x[2])
        
        # Precompute for predictions
        K = self._kernel(self.X_train, self.X_train) + self.noise * np.eye(len(self.X_train))
        self.L = cholesky(K, lower=True)
        self.alpha = cho_solve((self.L, True), self.y_train)
        
        return self
    
    def predict(self, X, return_std=True, return_cov=False):
        """
        Make predictions.
        
        Returns:
        --------
        mu : posterior mean
        std : posterior standard deviation (if return_std=True)
        cov : posterior covariance (if return_cov=True)
        """
        X = np.atleast_2d(X)
        
        K_s = self._kernel(self.X_train, X)
        K_ss = self._kernel(X, X)
        
        # Posterior mean
        mu = K_s.T @ self.alpha
        
        # Posterior covariance
        v = cho_solve((self.L, True), K_s)
        cov = K_ss - K_s.T @ v
        
        if return_cov:
            return mu, cov
        elif return_std:
            std = np.sqrt(np.maximum(np.diag(cov), 1e-10))
            return mu, std
        else:
            return mu
    
    def sample_posterior(self, X, n_samples=5):
        """Sample functions from the posterior."""
        mu, cov = self.predict(X, return_cov=True)
        cov += 1e-8 * np.eye(len(X))  # Numerical stability
        return np.random.multivariate_normal(mu, cov, size=n_samples)

print("GaussianProcessRegressor class defined!")

In [None]:
# Test the GP regressor on synthetic data

# Generate data from a nonlinear function
np.random.seed(42)
n_train = 20

def true_function(x):
    return np.sin(x) + 0.3 * np.sin(3*x)

X_train = np.sort(np.random.uniform(-4, 4, n_train)).reshape(-1, 1)
y_train = true_function(X_train).flatten() + 0.1 * np.random.randn(n_train)

X_test = np.linspace(-5, 5, 200).reshape(-1, 1)

# Fit GP
gp = GaussianProcessRegressor(length_scale=1.0, variance=1.0, noise=0.01)
gp.fit(X_train, y_train, optimize=True)

print(f"Optimized hyperparameters:")
print(f"  Length scale: {gp.length_scale:.4f}")
print(f"  Variance: {gp.variance:.4f}")
print(f"  Noise: {gp.noise:.6f}")

# Predict
mu, std = gp.predict(X_test)

# Plot
plt.figure(figsize=(12, 6))

# Confidence intervals
plt.fill_between(X_test.flatten(), mu - 2*std, mu + 2*std, alpha=0.2, color='blue', label='95% CI')
plt.fill_between(X_test.flatten(), mu - std, mu + std, alpha=0.3, color='blue', label='68% CI')

# Mean prediction
plt.plot(X_test, mu, 'b-', lw=2, label='GP Mean')

# True function
plt.plot(X_test, true_function(X_test), 'k--', lw=1.5, alpha=0.7, label='True function')

# Training data
plt.scatter(X_train, y_train, c='red', s=80, zorder=5, edgecolors='k', label='Training data')

# Posterior samples
samples = gp.sample_posterior(X_test, n_samples=3)
for i, sample in enumerate(samples):
    plt.plot(X_test, sample, '--', alpha=0.5, lw=1)

plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Gaussian Process Regression (from scratch)', fontsize=14)
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

## 5. GP with GPyTorch <a name="5-gpytorch"></a>

GPyTorch provides GPU-accelerated GP models with:
- Scalable inference (inducing points, variational methods)
- Deep kernel learning
- Integration with PyTorch

In [None]:
# Install GPyTorch if needed
try:
    import gpytorch
    import torch
    print(f"GPyTorch version: {gpytorch.__version__}")
    print(f"PyTorch version: {torch.__version__}")
except ImportError:
    print("Installing GPyTorch...")
    !pip install gpytorch
    import gpytorch
    import torch

In [None]:
import torch
import gpytorch
from gpytorch.models import ExactGP
from gpytorch.likelihoods import GaussianLikelihood
from gpytorch.means import ConstantMean, LinearMean
from gpytorch.kernels import ScaleKernel, RBFKernel, MaternKernel, PeriodicKernel
from gpytorch.distributions import MultivariateNormal

class ExactGPModel(ExactGP):
    """
    Standard GP model using GPyTorch.
    
    Components:
    - Mean function: constant or linear
    - Kernel: RBF with learnable length scale
    - Likelihood: Gaussian noise
    """
    
    def __init__(self, train_x, train_y, likelihood, kernel_type='rbf'):
        super().__init__(train_x, train_y, likelihood)
        
        # Mean function
        self.mean_module = ConstantMean()
        
        # Covariance function
        if kernel_type == 'rbf':
            self.covar_module = ScaleKernel(RBFKernel())
        elif kernel_type == 'matern':
            self.covar_module = ScaleKernel(MaternKernel(nu=2.5))
        elif kernel_type == 'periodic':
            self.covar_module = ScaleKernel(PeriodicKernel())
    
    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return MultivariateNormal(mean_x, covar_x)


def train_gp_model(model, likelihood, train_x, train_y, n_iterations=100, lr=0.1):
    """
    Train GP model by maximizing marginal likelihood.
    """
    model.train()
    likelihood.train()
    
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
    
    losses = []
    
    for i in range(n_iterations):
        optimizer.zero_grad()
        output = model(train_x)
        loss = -mll(output, train_y)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
        
        if (i + 1) % 20 == 0:
            print(f'Iteration {i+1}/{n_iterations} - Loss: {loss.item():.4f}')
    
    return losses

print("GPyTorch model and training function defined!")

In [None]:
# Train GPyTorch model on synthetic data

# Convert to tensors
train_x = torch.tensor(X_train.flatten(), dtype=torch.float32)
train_y = torch.tensor(y_train, dtype=torch.float32)
test_x = torch.tensor(X_test.flatten(), dtype=torch.float32)

# Initialize model
likelihood = GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood, kernel_type='rbf')

# Train
print("Training GP model...")
losses = train_gp_model(model, likelihood, train_x, train_y, n_iterations=100)

# Print learned hyperparameters
print(f"\nLearned hyperparameters:")
print(f"  Noise: {likelihood.noise.item():.6f}")
print(f"  Output scale: {model.covar_module.outputscale.item():.4f}")
print(f"  Length scale: {model.covar_module.base_kernel.lengthscale.item():.4f}")

In [None]:
# Make predictions
model.eval()
likelihood.eval()

with torch.no_grad(), gpytorch.settings.fast_pred_var():
    # Get predictive distribution
    observed_pred = likelihood(model(test_x))
    
    # Mean and confidence intervals
    mean = observed_pred.mean.numpy()
    lower, upper = observed_pred.confidence_region()
    lower, upper = lower.numpy(), upper.numpy()

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Predictions
ax = axes[0]
ax.fill_between(test_x.numpy(), lower, upper, alpha=0.3, color='blue', label='95% CI')
ax.plot(test_x.numpy(), mean, 'b-', lw=2, label='GP Mean')
ax.plot(X_test.flatten(), true_function(X_test), 'k--', alpha=0.7, label='True function')
ax.scatter(train_x.numpy(), train_y.numpy(), c='red', s=80, zorder=5, edgecolors='k', label='Training data')
ax.set_xlabel('x')
ax.set_ylabel('f(x)')
ax.set_title('GPyTorch Predictions')
ax.legend()

# Training loss
ax = axes[1]
ax.plot(losses, 'b-', lw=2)
ax.set_xlabel('Iteration')
ax.set_ylabel('Negative Log Marginal Likelihood')
ax.set_title('Training Loss')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 6. Trading Applications <a name="6-trading-applications"></a>

### Application 1: Volatility Surface Modeling

GPs can model the implied volatility surface as a smooth function of strike and maturity.

In [None]:
# Volatility Surface Modeling with GP

# Generate synthetic implied volatility data
np.random.seed(42)

def true_vol_surface(moneyness, maturity):
    """
    Synthetic volatility surface with smile and term structure.
    
    Vol = base_vol + smile_effect + term_effect
    """
    base_vol = 0.20
    smile = 0.05 * (moneyness - 1)**2  # Smile around ATM
    term = 0.02 * np.sqrt(maturity)  # Term structure
    skew = -0.03 * (moneyness - 1)  # Skew
    return base_vol + smile + term + skew

# Generate training data (market observations)
n_samples = 50
moneyness_train = np.random.uniform(0.8, 1.2, n_samples)
maturity_train = np.random.uniform(0.1, 2.0, n_samples)
X_vol_train = np.column_stack([moneyness_train, maturity_train])
y_vol_train = true_vol_surface(moneyness_train, maturity_train) + 0.005 * np.random.randn(n_samples)

# Create grid for prediction
moneyness_grid = np.linspace(0.8, 1.2, 30)
maturity_grid = np.linspace(0.1, 2.0, 30)
M, T = np.meshgrid(moneyness_grid, maturity_grid)
X_vol_test = np.column_stack([M.ravel(), T.ravel()])

# Fit GP
gp_vol = GaussianProcessRegressor(length_scale=0.3, variance=0.01, noise=0.0001)
gp_vol.fit(X_vol_train, y_vol_train, optimize=True)

# Predict
vol_mean, vol_std = gp_vol.predict(X_vol_test)
vol_mean = vol_mean.reshape(M.shape)
vol_std = vol_std.reshape(M.shape)
vol_true = true_vol_surface(M, T)

print(f"GP Vol Surface - Optimized parameters:")
print(f"  Length scale: {gp_vol.length_scale:.4f}")
print(f"  Variance: {gp_vol.variance:.6f}")
print(f"  Noise: {gp_vol.noise:.8f}")

In [None]:
# Visualize volatility surface
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(16, 5))

# True surface
ax1 = fig.add_subplot(131, projection='3d')
ax1.plot_surface(M, T, vol_true, cmap='viridis', alpha=0.7)
ax1.scatter(moneyness_train, maturity_train, y_vol_train, c='red', s=30)
ax1.set_xlabel('Moneyness')
ax1.set_ylabel('Maturity')
ax1.set_zlabel('IV')
ax1.set_title('True Vol Surface')

# GP prediction
ax2 = fig.add_subplot(132, projection='3d')
ax2.plot_surface(M, T, vol_mean, cmap='viridis', alpha=0.7)
ax2.scatter(moneyness_train, maturity_train, y_vol_train, c='red', s=30)
ax2.set_xlabel('Moneyness')
ax2.set_ylabel('Maturity')
ax2.set_zlabel('IV')
ax2.set_title('GP Predicted Surface')

# Uncertainty
ax3 = fig.add_subplot(133, projection='3d')
ax3.plot_surface(M, T, vol_std, cmap='Reds', alpha=0.7)
ax3.scatter(moneyness_train, maturity_train, np.zeros_like(y_vol_train), c='black', s=30)
ax3.set_xlabel('Moneyness')
ax3.set_ylabel('Maturity')
ax3.set_zlabel('Std Dev')
ax3.set_title('Prediction Uncertainty')

plt.tight_layout()
plt.show()

# Error analysis
rmse = np.sqrt(np.mean((vol_mean - vol_true)**2))
print(f"\nRMSE: {rmse*100:.2f} vol points")
print(f"Mean uncertainty: {vol_std.mean()*100:.2f} vol points")

### Application 2: Return Prediction with Uncertainty

In [None]:
# Generate synthetic price/return data
np.random.seed(42)

# Simulate a stock with mean-reverting features
n_days = 500
dates = pd.date_range(start='2023-01-01', periods=n_days, freq='B')

# Features: RSI-like, MA deviation, volatility
def generate_features_and_returns(n_days):
    # Generate correlated features
    rsi = 50 + 20 * np.sin(np.linspace(0, 8*np.pi, n_days)) + 10 * np.random.randn(n_days)
    rsi = np.clip(rsi, 0, 100)
    
    ma_dev = 0.02 * np.sin(np.linspace(0, 6*np.pi, n_days)) + 0.01 * np.random.randn(n_days)
    
    volatility = 0.15 + 0.05 * np.abs(np.sin(np.linspace(0, 4*np.pi, n_days))) + 0.02 * np.random.randn(n_days)
    volatility = np.clip(volatility, 0.05, 0.4)
    
    # Generate returns with feature dependency
    base_return = 0.0005  # Daily drift
    rsi_effect = -0.0002 * (rsi - 50)  # Mean reversion
    ma_effect = -0.01 * ma_dev  # MA reversion
    
    returns = base_return + rsi_effect + ma_effect + volatility * np.random.randn(n_days) / 16
    
    features = np.column_stack([rsi, ma_dev, volatility])
    return features, returns

features, returns = generate_features_and_returns(n_days)

# Create DataFrame
df = pd.DataFrame({
    'Date': dates,
    'RSI': features[:, 0],
    'MA_Dev': features[:, 1],
    'Volatility': features[:, 2],
    'Return': returns
})
df.set_index('Date', inplace=True)

print(df.head(10))
print(f"\nDataset shape: {df.shape}")

In [None]:
# Train GP for return prediction

# Prepare data
feature_cols = ['RSI', 'MA_Dev', 'Volatility']
X = df[feature_cols].values
y = df['Return'].values

# Normalize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train/test split
train_size = 400
X_train, X_test = X_scaled[:train_size], X_scaled[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

# Fit GP
gp_returns = GaussianProcessRegressor(length_scale=1.0, variance=0.001, noise=0.0001)
gp_returns.fit(X_train, y_train, optimize=True)

print(f"Optimized hyperparameters:")
print(f"  Length scale: {gp_returns.length_scale:.4f}")
print(f"  Variance: {gp_returns.variance:.6f}")
print(f"  Noise: {gp_returns.noise:.8f}")

# Predict
y_pred, y_std = gp_returns.predict(X_test)

In [None]:
# Visualize predictions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

test_dates = df.index[train_size:]

# Predictions with uncertainty
ax = axes[0, 0]
ax.fill_between(range(len(y_test)), y_pred - 2*y_std, y_pred + 2*y_std, 
                alpha=0.3, color='blue', label='95% CI')
ax.plot(range(len(y_test)), y_pred, 'b-', lw=1.5, label='GP Prediction')
ax.plot(range(len(y_test)), y_test, 'r-', lw=1, alpha=0.7, label='Actual')
ax.set_xlabel('Day')
ax.set_ylabel('Return')
ax.set_title('GP Return Predictions with Uncertainty')
ax.legend()
ax.axhline(0, color='k', linestyle='--', alpha=0.3)

# Calibration plot
ax = axes[0, 1]
z_scores = (y_test - y_pred) / y_std
ax.hist(z_scores, bins=30, density=True, alpha=0.7, color='blue')
x_norm = np.linspace(-4, 4, 100)
ax.plot(x_norm, stats.norm.pdf(x_norm), 'r-', lw=2, label='N(0,1)')
ax.set_xlabel('Z-score')
ax.set_ylabel('Density')
ax.set_title('Uncertainty Calibration')
ax.legend()

# Prediction vs Actual
ax = axes[1, 0]
ax.scatter(y_test, y_pred, alpha=0.5, c=y_std, cmap='viridis')
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
ax.set_xlabel('Actual Return')
ax.set_ylabel('Predicted Return')
ax.set_title('Predicted vs Actual (color = uncertainty)')
plt.colorbar(ax.collections[0], ax=ax, label='Std Dev')

# Uncertainty over time
ax = axes[1, 1]
ax.plot(range(len(y_test)), y_std, 'g-', lw=1.5)
ax.set_xlabel('Day')
ax.set_ylabel('Prediction Std Dev')
ax.set_title('Prediction Uncertainty Over Time')
ax.fill_between(range(len(y_test)), 0, y_std, alpha=0.3, color='green')

plt.tight_layout()
plt.show()

# Metrics
rmse = np.sqrt(np.mean((y_test - y_pred)**2))
corr = np.corrcoef(y_test, y_pred)[0, 1]
coverage = np.mean(np.abs(z_scores) < 2)  # Should be ~95%

print(f"\nPerformance Metrics:")
print(f"  RMSE: {rmse:.6f}")
print(f"  Correlation: {corr:.4f}")
print(f"  95% CI Coverage: {coverage:.1%}")

### Application 3: Position Sizing with GP Uncertainty

In [None]:
# Use GP uncertainty for position sizing

def kelly_position_size(predicted_return, std, risk_free=0):
    """
    Kelly criterion position sizing using GP predictions.
    
    f* = (mu - r) / sigma^2
    
    With uncertainty:
    - Conservative: use lower bound of return
    - Scale by confidence
    """
    # Conservative Kelly: use prediction - 1 std
    conservative_return = predicted_return - std
    
    # Variance estimate (include prediction uncertainty)
    variance = std ** 2
    
    # Kelly fraction (capped)
    kelly = np.clip(conservative_return / variance, -2, 2)
    
    # Further scale by confidence (inverse of uncertainty)
    confidence = 1 / (1 + std * 100)  # Scale std to [0, 1]
    position = kelly * confidence
    
    return position

# Calculate positions
positions = kelly_position_size(y_pred, y_std)

# Simple strategy: trade based on GP signal
df_test = df.iloc[train_size:].copy()
df_test['Prediction'] = y_pred
df_test['Uncertainty'] = y_std
df_test['Position'] = positions
df_test['Strategy_Return'] = df_test['Position'].shift(1) * df_test['Return']
df_test['Cumulative_Return'] = (1 + df_test['Strategy_Return']).cumprod()
df_test['BuyHold_Return'] = (1 + df_test['Return']).cumprod()

# Compare strategies
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Cumulative returns
ax = axes[0, 0]
ax.plot(df_test.index, df_test['Cumulative_Return'], 'b-', lw=2, label='GP Strategy')
ax.plot(df_test.index, df_test['BuyHold_Return'], 'r--', lw=2, label='Buy & Hold')
ax.set_xlabel('Date')
ax.set_ylabel('Cumulative Return')
ax.set_title('Strategy Performance')
ax.legend()
ax.grid(True, alpha=0.3)

# Position sizes
ax = axes[0, 1]
ax.plot(df_test.index, df_test['Position'], 'g-', lw=1, alpha=0.7)
ax.axhline(0, color='k', linestyle='--', alpha=0.3)
ax.set_xlabel('Date')
ax.set_ylabel('Position Size')
ax.set_title('GP-Based Position Sizes')
ax.grid(True, alpha=0.3)

# Position vs Uncertainty
ax = axes[1, 0]
scatter = ax.scatter(df_test['Uncertainty'], np.abs(df_test['Position']), 
                     c=df_test['Prediction'], cmap='RdYlGn', alpha=0.6)
ax.set_xlabel('Prediction Uncertainty')
ax.set_ylabel('Absolute Position Size')
ax.set_title('Position Size vs Uncertainty')
plt.colorbar(scatter, ax=ax, label='Predicted Return')

# Return distribution
ax = axes[1, 1]
ax.hist(df_test['Strategy_Return'].dropna(), bins=40, alpha=0.6, color='blue', 
        label=f'GP Strategy (Sharpe: {df_test["Strategy_Return"].mean()/df_test["Strategy_Return"].std()*np.sqrt(252):.2f})')
ax.hist(df_test['Return'], bins=40, alpha=0.6, color='red', 
        label=f'Buy & Hold (Sharpe: {df_test["Return"].mean()/df_test["Return"].std()*np.sqrt(252):.2f})')
ax.set_xlabel('Return')
ax.set_ylabel('Frequency')
ax.set_title('Return Distribution')
ax.legend()

plt.tight_layout()
plt.show()

# Strategy stats
gp_sharpe = df_test['Strategy_Return'].mean() / df_test['Strategy_Return'].std() * np.sqrt(252)
bh_sharpe = df_test['Return'].mean() / df_test['Return'].std() * np.sqrt(252)
gp_final = df_test['Cumulative_Return'].iloc[-1]
bh_final = df_test['BuyHold_Return'].iloc[-1]

print(f"\nStrategy Comparison:")
print(f"  GP Strategy Sharpe: {gp_sharpe:.2f}")
print(f"  Buy & Hold Sharpe: {bh_sharpe:.2f}")
print(f"  GP Final Return: {(gp_final-1)*100:.1f}%")
print(f"  Buy & Hold Return: {(bh_final-1)*100:.1f}%")

---

## 7. Scalable GP Methods <a name="7-scalable-gp"></a>

### Challenge: Computational Complexity

Standard GP has $O(n^3)$ complexity due to matrix inversion. For large datasets:

| Method | Complexity | Description |
|--------|------------|-------------|
| **Sparse GP (Inducing Points)** | $O(nm^2)$ | Use $m << n$ inducing points |
| **KISS-GP** | $O(n)$ | Kernel interpolation |
| **Variational GP** | $O(nm^2)$ | Variational inference |
| **Local GP** | $O(k^3)$ | Fit on local neighborhoods |

In [None]:
# Sparse GP Implementation using Inducing Points

class SparseGPRegressor:
    """
    Sparse GP using inducing points (FITC approximation).
    
    Approximates the full GP by using m inducing points:
    K ≈ Q + diag(K - Q) where Q = K_nm @ K_mm^{-1} @ K_mn
    """
    
    def __init__(self, n_inducing=50, length_scale=1.0, variance=1.0, noise=1e-4):
        self.n_inducing = n_inducing
        self.length_scale = length_scale
        self.variance = variance
        self.noise = noise
        self.inducing_points = None
        
    def _kernel(self, X1, X2):
        """RBF kernel."""
        X1, X2 = np.atleast_2d(X1), np.atleast_2d(X2)
        sqdist = np.sum(X1**2, 1).reshape(-1, 1) + np.sum(X2**2, 1) - 2 * X1 @ X2.T
        return self.variance * np.exp(-0.5 * sqdist / self.length_scale**2)
    
    def fit(self, X, y):
        """
        Fit sparse GP.
        
        Select inducing points via k-means clustering.
        """
        self.X_train = np.atleast_2d(X)
        self.y_train = np.atleast_1d(y)
        
        # Select inducing points via k-means
        from sklearn.cluster import KMeans
        n_inducing = min(self.n_inducing, len(X))
        kmeans = KMeans(n_clusters=n_inducing, random_state=42, n_init=10)
        kmeans.fit(self.X_train)
        self.inducing_points = kmeans.cluster_centers_
        
        # Compute kernel matrices
        K_mm = self._kernel(self.inducing_points, self.inducing_points)
        K_nm = self._kernel(self.X_train, self.inducing_points)
        K_nn_diag = np.full(len(self.X_train), self.variance)  # Diagonal of K_nn
        
        # FITC approximation
        L_m = cholesky(K_mm + 1e-8 * np.eye(n_inducing), lower=True)
        V = cho_solve((L_m, True), K_nm.T)  # K_mm^{-1} @ K_mn
        Q_nn_diag = np.sum(K_nm * V.T, axis=1)  # Diagonal of Q = K_nm @ K_mm^{-1} @ K_mn
        
        # Lambda = diag(K_nn - Q_nn) + noise
        Lambda = np.maximum(K_nn_diag - Q_nn_diag, 1e-8) + self.noise
        
        # Woodbury identity for efficient computation
        Lambda_inv = 1.0 / Lambda
        B = np.eye(n_inducing) + V @ np.diag(Lambda_inv) @ V.T
        L_B = cholesky(B, lower=True)
        
        self.L_m = L_m
        self.L_B = L_B
        self.Lambda_inv = Lambda_inv
        self.K_nm = K_nm
        
        # Precompute for predictions
        self.alpha = cho_solve((L_B, True), V @ (Lambda_inv * self.y_train))
        
        return self
    
    def predict(self, X, return_std=True):
        """Make predictions using sparse GP."""
        X = np.atleast_2d(X)
        
        K_sm = self._kernel(X, self.inducing_points)
        V_s = cho_solve((self.L_m, True), K_sm.T)
        
        # Posterior mean
        mu = K_sm @ cho_solve((self.L_m, True), self.alpha)
        
        if return_std:
            # Posterior variance (approximate)
            V_s_B = cho_solve((self.L_B, True), V_s)
            var = self.variance - np.sum(V_s * V_s, axis=0) + np.sum(V_s * V_s_B, axis=0)
            var = np.maximum(var, 1e-10)
            return mu.flatten(), np.sqrt(var)
        
        return mu.flatten()

print("SparseGPRegressor defined!")

In [None]:
# Compare Full GP vs Sparse GP
import time

# Generate larger dataset
np.random.seed(42)
n_large = 1000

X_large = np.sort(np.random.uniform(-5, 5, n_large)).reshape(-1, 1)
y_large = np.sin(X_large).flatten() + 0.1 * np.random.randn(n_large)

X_test_large = np.linspace(-5, 5, 200).reshape(-1, 1)

# Full GP
print("Training Full GP...")
start = time.time()
gp_full = GaussianProcessRegressor(length_scale=1.0, variance=1.0, noise=0.01)
gp_full.fit(X_large, y_large, optimize=False)  # Skip optimization for fair comparison
mu_full, std_full = gp_full.predict(X_test_large)
time_full = time.time() - start
print(f"  Time: {time_full:.3f}s")

# Sparse GP with different numbers of inducing points
inducing_counts = [20, 50, 100]
results = []

for n_ind in inducing_counts:
    print(f"\nTraining Sparse GP (m={n_ind})...")
    start = time.time()
    gp_sparse = SparseGPRegressor(n_inducing=n_ind, length_scale=1.0, variance=1.0, noise=0.01)
    gp_sparse.fit(X_large, y_large)
    mu_sparse, std_sparse = gp_sparse.predict(X_test_large)
    time_sparse = time.time() - start
    
    rmse = np.sqrt(np.mean((mu_full - mu_sparse)**2))
    results.append((n_ind, time_sparse, rmse))
    print(f"  Time: {time_sparse:.3f}s, RMSE vs Full: {rmse:.6f}")

In [None]:
# Visualize Full vs Sparse comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Full GP
ax = axes[0, 0]
ax.fill_between(X_test_large.flatten(), mu_full - 2*std_full, mu_full + 2*std_full, alpha=0.3)
ax.plot(X_test_large, mu_full, 'b-', lw=2, label='Full GP')
ax.scatter(X_large[::20], y_large[::20], c='red', s=20, alpha=0.5, label='Data (subset)')
ax.set_title(f'Full GP (n={n_large})')
ax.legend()

# Sparse GP with 50 inducing points
ax = axes[0, 1]
gp_sparse = SparseGPRegressor(n_inducing=50, length_scale=1.0, variance=1.0, noise=0.01)
gp_sparse.fit(X_large, y_large)
mu_sparse, std_sparse = gp_sparse.predict(X_test_large)

ax.fill_between(X_test_large.flatten(), mu_sparse - 2*std_sparse, mu_sparse + 2*std_sparse, alpha=0.3)
ax.plot(X_test_large, mu_sparse, 'b-', lw=2, label='Sparse GP (m=50)')
ax.scatter(gp_sparse.inducing_points, 
           gp_sparse.predict(gp_sparse.inducing_points, return_std=False),
           c='green', s=100, marker='^', label='Inducing points')
ax.set_title('Sparse GP (m=50)')
ax.legend()

# Time comparison
ax = axes[1, 0]
methods = ['Full'] + [f'Sparse (m={r[0]})' for r in results]
times = [time_full] + [r[1] for r in results]
bars = ax.bar(methods, times, color=['red'] + ['blue']*len(results))
ax.set_ylabel('Time (seconds)')
ax.set_title('Computational Time Comparison')
for bar, t in zip(bars, times):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, f'{t:.3f}s', 
            ha='center', va='bottom')

# Accuracy comparison
ax = axes[1, 1]
ax.plot([r[0] for r in results], [r[2] for r in results], 'bo-', markersize=10, lw=2)
ax.set_xlabel('Number of Inducing Points')
ax.set_ylabel('RMSE vs Full GP')
ax.set_title('Approximation Accuracy')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Insight: Sparse GP achieves similar accuracy with much less computation!")

---

## 8. Practice Exercises <a name="8-exercises"></a>

### Exercise 1: Custom Financial Kernel

Implement a kernel that captures both **short-term mean reversion** and **long-term momentum**:

$$k(x, x') = k_{\text{short}}(x, x') + k_{\text{long}}(x, x')$$

where short uses small length scale (mean reversion) and long uses large length scale (momentum).

In [None]:
# Exercise 1: Implement dual-scale kernel

def dual_scale_kernel(X1, X2, 
                      short_length=0.5, short_variance=0.3,
                      long_length=5.0, long_variance=0.7):
    """
    Kernel combining short-term and long-term dynamics.
    
    TODO: Implement this kernel
    
    Hint: Sum two RBF kernels with different length scales
    """
    # Your code here
    pass

# Test your kernel
# X = np.linspace(0, 10, 200).reshape(-1, 1)
# K = dual_scale_kernel(X, X)
# samples = np.random.multivariate_normal(np.zeros(len(X)), K + 1e-8*np.eye(len(X)), size=5)
# plt.plot(X, samples.T)
# plt.title('Dual-Scale Kernel Samples')
# plt.show()

### Exercise 2: GP for Realized Volatility Prediction

Build a GP model to predict next-day realized volatility using:
- Past realized volatility (HAR-like features)
- VIX level
- Return features

In [None]:
# Exercise 2: GP for Volatility Prediction

# Generate synthetic volatility data
np.random.seed(42)
n_days = 500

# HAR-like volatility dynamics
rv = np.zeros(n_days)
rv[0] = 0.15

for t in range(1, n_days):
    # HAR model: RV_t = c + b_d*RV_{t-1} + b_w*RV_weekly + b_m*RV_monthly + eps
    rv_d = rv[t-1]
    rv_w = np.mean(rv[max(0,t-5):t]) if t > 0 else rv[0]
    rv_m = np.mean(rv[max(0,t-22):t]) if t > 0 else rv[0]
    
    rv[t] = 0.02 + 0.4*rv_d + 0.3*rv_w + 0.2*rv_m + 0.02*np.random.randn()
    rv[t] = np.clip(rv[t], 0.05, 0.5)

# Create features
# TODO: Build HAR-like features and fit GP to predict next-day volatility

# Your code here
print("Implement GP for volatility prediction")

### Exercise 3: Bayesian Optimization for Hyperparameter Tuning

Use GP to optimize trading strategy hyperparameters (e.g., lookback window, threshold).

In [None]:
# Exercise 3: Bayesian Optimization

def strategy_sharpe(lookback, threshold, returns):
    """
    Simple momentum strategy.
    
    Parameters:
    - lookback: rolling window for momentum signal
    - threshold: entry threshold
    - returns: asset returns
    
    Returns: Sharpe ratio
    """
    lookback = int(lookback)
    momentum = pd.Series(returns).rolling(lookback).mean()
    signal = np.where(momentum > threshold, 1, np.where(momentum < -threshold, -1, 0))
    strat_returns = np.array(signal[:-1]) * np.array(returns[1:])
    
    if np.std(strat_returns) == 0:
        return 0
    
    return np.mean(strat_returns) / np.std(strat_returns) * np.sqrt(252)

# Generate returns
np.random.seed(42)
returns = 0.0005 + 0.02 * np.random.randn(500)

# TODO: Use GP to find optimal (lookback, threshold) that maximizes Sharpe
# Hint: Sample some initial points, fit GP, use acquisition function

# Your code here
print("Implement Bayesian optimization for strategy tuning")

---

## Summary

### Key Takeaways

1. **Gaussian Processes** provide nonparametric regression with built-in uncertainty quantification

2. **Kernel choice** determines GP properties:
   - RBF: Smooth functions
   - Matérn: Adjustable roughness
   - Periodic: Seasonal patterns
   - Combinations: Complex dynamics

3. **Trading applications**:
   - Volatility surface modeling
   - Return prediction with uncertainty
   - Position sizing via Kelly criterion
   - Bayesian optimization for strategy tuning

4. **Scalability solutions**:
   - Sparse GP with inducing points
   - Variational inference
   - Local GP methods

### Financial Use Cases

| Application | Why GP? |
|-------------|----------|
| **Volatility surfaces** | Smooth interpolation, uncertainty |
| **Return prediction** | Uncertainty for risk management |
| **Strategy optimization** | Efficient hyperparameter search |
| **Regime detection** | Nonparametric flexibility |
| **Factor modeling** | Capturing nonlinear relationships |

### Next Steps

- Day 4: **Bayesian Neural Networks** - Combining deep learning with uncertainty
- Explore **Deep Kernel Learning** (GPyTorch)
- Apply to real market data with proper backtesting

In [None]:
print("Day 3 Complete: Gaussian Processes for Trading!")
print("\nKey concepts covered:")
print("  ✓ GP mathematical foundations")
print("  ✓ Kernel functions for finance")
print("  ✓ GP regression from scratch")
print("  ✓ GPyTorch implementation")
print("  ✓ Trading applications")
print("  ✓ Scalable GP methods")