<a href="https://github.com/timeseriesAI/tsai-rs" target="_parent"><img src="https://img.shields.io/badge/tsai--rs-Time%20Series%20AI%20in%20Rust-blue" alt="tsai-rs"/></a>

# PatchTST: A New Transformer for Long-term Time Series Forecasting

This notebook demonstrates how to use **PatchTST** for long-term multivariate time series forecasting using **tsai-rs**.

Based on:
* Nie, Y., Nguyen, N. H., Sinthong, P., & Kalagnanam, J. (2022). **A Time Series is Worth 64 Words: Long-term Forecasting with Transformers**. arXiv preprint arXiv:2211.14730.
* Presented at ICLR 2023

## What is PatchTST?

PatchTST is a state-of-the-art transformer architecture for time series forecasting that introduces two key innovations:

### 1. Patching
Instead of treating each time step as a token, PatchTST groups consecutive time steps into **patches**. This:
- Reduces the number of tokens (improving efficiency)
- Captures local semantic information
- Extends the receptive field of attention

### 2. Channel Independence
Each variable (channel) is processed independently by the transformer. This:
- Reduces model complexity
- Improves generalization
- Allows the model to focus on temporal patterns

## Install tsai-rs

```bash
cd crates/tsai_python
maturin develop --release
```

## Import Libraries

In [None]:
import tsai_rs
import numpy as np
import matplotlib.pyplot as plt

print(f"tsai-rs version: {tsai_rs.version()}")
tsai_rs.my_setup()

## Understanding Patching

The key insight of PatchTST is to divide the time series into patches, similar to how Vision Transformers divide images into patches.

In [None]:
# Visualize how patching works
seq_len = 96  # Input sequence length
patch_len = 16  # Each patch contains 16 time steps
stride = 8  # Patches overlap by 8 time steps

# Calculate number of patches
n_patches = (seq_len - patch_len) // stride + 1

print(f"Input sequence length: {seq_len}")
print(f"Patch length: {patch_len}")
print(f"Stride: {stride}")
print(f"Number of patches: {n_patches}")
print(f"\nThis reduces tokens from {seq_len} to {n_patches} (a {seq_len/n_patches:.1f}x reduction)")

In [None]:
# Visualize patching
np.random.seed(42)
time_series = np.cumsum(np.random.randn(seq_len)) + 50

fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Original time series
axes[0].plot(time_series, 'b-', linewidth=1.5)
axes[0].set_title('Original Time Series (96 time steps)')
axes[0].set_xlabel('Time Step')

# Time series with patches highlighted
colors = plt.cm.tab10(np.linspace(0, 1, n_patches))
axes[1].plot(time_series, 'gray', alpha=0.3, linewidth=1)

for i in range(n_patches):
    start = i * stride
    end = start + patch_len
    patch_indices = np.arange(start, end)
    axes[1].plot(patch_indices, time_series[patch_indices], color=colors[i], linewidth=2, label=f'Patch {i+1}')
    axes[1].axvspan(start, end-1, alpha=0.1, color=colors[i])

axes[1].set_title(f'Patched Time Series ({n_patches} patches, each {patch_len} steps, stride {stride})')
axes[1].set_xlabel('Time Step')
axes[1].legend(loc='upper right', ncol=3, fontsize=8)

plt.tight_layout()
plt.show()

## PatchTST Configuration

In [None]:
# Load sample data for configuration
dsid = 'NATOPS'
X_train, y_train, X_test, y_test = tsai_rs.get_UCR_data(dsid, return_split=True)

n_vars = X_train.shape[1]
seq_len = X_train.shape[2]
n_classes = len(np.unique(y_train))

print(f"Dataset: {dsid}")
print(f"Variables: {n_vars}, Sequence length: {seq_len}, Classes: {n_classes}")

In [None]:
# PatchTST configuration for classification
patchtst_config = tsai_rs.PatchTSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    patch_len=16,       # Length of each patch
    stride=8,           # Stride between patches
    d_model=128,        # Model dimension
    n_heads=4,          # Number of attention heads
    n_layers=3,         # Number of transformer layers
    d_ff=256,           # Feed-forward dimension
    dropout=0.1         # Dropout rate
)

print(f"PatchTST config: {patchtst_config}")

## PatchTST for Long-term Forecasting

PatchTST was originally designed for forecasting. Here's how to configure it for different forecast horizons.

In [None]:
# Forecasting configuration
context_len = 512   # Look-back window
forecast_len = 96   # Prediction horizon
n_forecast_vars = 7  # Number of variables to forecast

# PatchTST for forecasting
patchtst_forecast = tsai_rs.PatchTSTConfig(
    n_vars=n_forecast_vars,
    seq_len=context_len,
    n_classes=forecast_len,  # For forecasting, n_classes = forecast horizon
    patch_len=16,
    stride=8,
    d_model=256,
    n_heads=8,
    n_layers=6,
    d_ff=512,
    dropout=0.2
)

print(f"Forecasting config: {patchtst_forecast}")
print(f"\nContext: {context_len} steps -> Forecast: {forecast_len} steps")

## Comparing Forecast Horizons

In [None]:
# Different forecast horizons as in the paper
horizons = [96, 192, 336, 720]

print(f"{'Horizon':<10} {'Description'}")
print("-" * 40)
for h in horizons:
    if h == 96:
        desc = "Short-term (1 day for hourly data)"
    elif h == 192:
        desc = "Medium-term (2 days)"
    elif h == 336:
        desc = "Long-term (2 weeks for daily data)"
    else:
        desc = "Very long-term (1 month)"
    print(f"{h:<10} {desc}")

In [None]:
# Configure PatchTST for different horizons
configs = {}
for horizon in horizons:
    configs[horizon] = tsai_rs.PatchTSTConfig(
        n_vars=7,
        seq_len=512,
        n_classes=horizon,
        patch_len=16,
        stride=8,
        d_model=256,
        n_heads=8,
        n_layers=6
    )
    print(f"Horizon {horizon}: {configs[horizon]}")

## PatchTST vs Standard TST

Let's compare PatchTST with the standard Time Series Transformer.

In [None]:
# Load data for comparison
dsid = 'NATOPS'
X_train, y_train, X_test, y_test = tsai_rs.get_UCR_data(dsid, return_split=True)

n_vars = X_train.shape[1]
seq_len = X_train.shape[2]
n_classes = len(np.unique(y_train))

# Standard TST
tst_config = tsai_rs.TSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    d_model=128,
    n_heads=4,
    n_layers=3
)

# PatchTST
patchtst_config = tsai_rs.PatchTSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    patch_len=8,
    stride=4,
    d_model=128,
    n_heads=4,
    n_layers=3
)

print("Standard TST:")
print(f"  Tokens per sequence: {seq_len}")
print(f"  Config: {tst_config}")

n_patches = (seq_len - 8) // 4 + 1
print(f"\nPatchTST:")
print(f"  Tokens per sequence: {n_patches}")
print(f"  Config: {patchtst_config}")
print(f"  Token reduction: {seq_len/n_patches:.1f}x")

## Channel Independence

A key feature of PatchTST is that it processes each variable independently, which reduces complexity and improves generalization.

In [None]:
# Visualize channel independence concept
fig, axes = plt.subplots(2, 2, figsize=(14, 8))

# Create sample multivariate time series
np.random.seed(42)
n_channels = 4
length = 96

# Generate correlated channels
base = np.cumsum(np.random.randn(length))
channels = [base + i * 10 + np.random.randn(length) * 2 for i in range(n_channels)]

# Left: Traditional approach (all channels together)
axes[0, 0].set_title('Traditional: All Channels Together')
for i, ch in enumerate(channels):
    axes[0, 0].plot(ch, label=f'Channel {i+1}')
axes[0, 0].legend()
axes[0, 0].set_xlabel('Time')

# Right: PatchTST approach (channels processed independently)
colors = ['blue', 'orange', 'green', 'red']
for i in range(n_channels):
    row = i // 2
    col = i % 2
    if i == 0:
        ax = axes[0, 1]
    elif i == 1:
        ax = axes[1, 0]
    elif i == 2:
        ax = axes[1, 1]
    else:
        continue
    
    ax.plot(channels[i], color=colors[i])
    ax.set_title(f'PatchTST: Channel {i+1} (Independent)')
    ax.set_xlabel('Time')

plt.suptitle('Channel Independence in PatchTST', fontsize=14)
plt.tight_layout()
plt.show()

## Complete PatchTST Pipeline

In [None]:
# Complete classification pipeline
def patchtst_classification_pipeline(dsid):
    """Complete PatchTST classification pipeline."""
    
    # 1. Load data
    print(f"Loading dataset: {dsid}")
    X_train, y_train, X_test, y_test = tsai_rs.get_UCR_data(dsid, return_split=True)
    
    # 2. Get dimensions
    n_vars = X_train.shape[1]
    seq_len = X_train.shape[2]
    n_classes = len(np.unique(y_train))
    
    print(f"  Shape: {X_train.shape}")
    print(f"  Variables: {n_vars}, Length: {seq_len}, Classes: {n_classes}")
    
    # 3. Standardize
    X_train_std = tsai_rs.ts_standardize(X_train.astype(np.float32), by_sample=True)
    X_test_std = tsai_rs.ts_standardize(X_test.astype(np.float32), by_sample=True)
    
    # 4. Configure PatchTST
    # Choose patch_len based on sequence length
    patch_len = min(16, seq_len // 4)
    stride = patch_len // 2
    
    config = tsai_rs.PatchTSTConfig(
        n_vars=n_vars,
        seq_len=seq_len,
        n_classes=n_classes,
        patch_len=patch_len,
        stride=stride,
        d_model=128,
        n_heads=4,
        n_layers=3,
        d_ff=256,
        dropout=0.1
    )
    print(f"  Config: {config}")
    
    # 5. Create datasets
    train_ds = tsai_rs.TSDataset(X_train_std, y_train)
    test_ds = tsai_rs.TSDataset(X_test_std, y_test)
    
    # 6. Configure training
    learner_config = tsai_rs.LearnerConfig(
        lr=1e-4,
        weight_decay=0.01,
        grad_clip=1.0
    )
    
    print(f"  Ready for training!")
    
    return config, train_ds, test_ds, learner_config

# Run pipeline
config, train_ds, test_ds, learner_config = patchtst_classification_pipeline('NATOPS')

## Hyperparameter Recommendations

In [None]:
# Recommended hyperparameters based on the paper
print("PatchTST Hyperparameter Recommendations")
print("=" * 50)

recommendations = {
    'patch_len': '16 (default), adjust based on seq_len',
    'stride': 'patch_len // 2 (50% overlap)',
    'd_model': '128-512 depending on dataset size',
    'n_heads': '4-16 (must divide d_model)',
    'n_layers': '2-6 depending on complexity',
    'd_ff': '2-4x d_model',
    'dropout': '0.1-0.3',
    'learning_rate': '1e-4 to 1e-3',
    'batch_size': '32-128',
}

for param, rec in recommendations.items():
    print(f"{param:<15}: {rec}")

## When to Use PatchTST

| Scenario | Recommendation |
|----------|---------------|
| Long sequences (>100) | PatchTST (efficient) |
| Short sequences (<50) | Standard TST or CNN |
| Long-term forecasting | PatchTST |
| Many variables | PatchTST (channel independence) |
| Limited compute | PatchTST (fewer tokens) |

## Summary

This notebook demonstrated PatchTST for time series:

### Key Innovations
1. **Patching**: Reduces tokens and captures local patterns
2. **Channel Independence**: Improves efficiency and generalization

### Configuration
```python
config = tsai_rs.PatchTSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    patch_len=16,
    stride=8,
    d_model=128,
    n_heads=4,
    n_layers=3
)
```

### Key Benefits
- State-of-the-art for long-term forecasting
- Efficient attention (fewer tokens)
- Better generalization through channel independence

In [None]:
# Quick reference
print("PatchTST Quick Reference")
print("=" * 50)
print("\n# Configuration")
print("config = tsai_rs.PatchTSTConfig(")
print("    n_vars=n_vars,")
print("    seq_len=seq_len,")
print("    n_classes=n_classes,")
print("    patch_len=16,     # Patch length")
print("    stride=8,         # Stride (overlap)")
print("    d_model=128,      # Model dimension")
print("    n_heads=4,        # Attention heads")
print("    n_layers=3,       # Transformer layers")
print(")")

print("\n# For forecasting")
print("# n_classes = forecast_horizon")