<a href="https://github.com/timeseriesAI/tsai-rs" target="_parent"><img src="https://img.shields.io/badge/tsai--rs-Time%20Series%20AI%20in%20Rust-blue" alt="tsai-rs"/></a>

# Time Series Classification with Transformers in tsai-rs

This notebook demonstrates using Transformer architectures (TST and PatchTST) for time series classification using **tsai-rs**.

## TST (Time Series Transformer)

Based on:
* Zerveas, G., et al. (2020). **A Transformer-based Framework for Multivariate Time Series Representation Learning**.
* Vaswani, A., et al. (2017). **Attention is all you need**.

Transformers excel at capturing long-range dependencies in sequential data through self-attention mechanisms.

## Install tsai-rs

```bash
cd crates/tsai_python
maturin develop --release
```

## Import Libraries

In [None]:
import tsai_rs
import numpy as np

print(f"tsai-rs version: {tsai_rs.version()}")
tsai_rs.my_setup()

## TST Configuration Options

Key hyperparameters for TST:

| Parameter | Description | Typical Values | Default |
|-----------|-------------|----------------|--------|
| `d_model` | Model dimension | 64-512 | 128 |
| `n_heads` | Attention heads | 4-16 | 8 |
| `n_layers` | Encoder layers | 2-8 | 3 |
| `d_ff` | Feed-forward dim | 128-2048 | 256 |
| `dropout` | Encoder dropout | 0.0-0.3 | 0.1 |
| `fc_dropout` | Classifier dropout | 0.0-0.8 | 0.0 |

## Load Data

In [None]:
# Load multivariate dataset
dsid = 'NATOPS'
X_train, y_train, X_test, y_test = tsai_rs.get_UCR_data(dsid, return_split=True)

# Get dimensions
n_vars = X_train.shape[1]
seq_len = X_train.shape[2]
n_classes = len(np.unique(y_train))

print(f"Dataset: {dsid}")
print(f"X_train shape: {X_train.shape}")
print(f"Variables: {n_vars}, Sequence length: {seq_len}, Classes: {n_classes}")

In [None]:
# Standardize data (recommended for transformers)
X_train_std = tsai_rs.ts_standardize(X_train.astype(np.float32), by_sample=True)
X_test_std = tsai_rs.ts_standardize(X_test.astype(np.float32), by_sample=True)

print(f"Standardized mean: {X_train_std.mean():.6f}")
print(f"Standardized std: {X_train_std.std():.6f}")

## TST Configuration

In [None]:
# Basic TST configuration
tst_basic = tsai_rs.TSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes
)
print(f"Basic TST: {tst_basic}")

In [None]:
# Custom TST with specific hyperparameters
tst_custom = tsai_rs.TSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    d_model=128,      # Model dimension
    n_heads=8,        # Number of attention heads
    n_layers=3,       # Number of encoder layers
    d_ff=256,         # Feed-forward dimension
    dropout=0.1,      # Encoder dropout
    fc_dropout=0.0    # Classifier dropout
)
print(f"Custom TST: {tst_custom}")

In [None]:
# TST with high dropout (for overfitting prevention)
tst_regularized = tsai_rs.TSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    d_model=128,
    n_heads=8,
    n_layers=3,
    d_ff=256,
    dropout=0.3,      # Higher encoder dropout
    fc_dropout=0.8    # High classifier dropout
)
print(f"Regularized TST: {tst_regularized}")

## PatchTST Configuration

PatchTST divides the time series into patches before applying the transformer, similar to Vision Transformer (ViT).

In [None]:
# Basic PatchTST configuration
patchtst_basic = tsai_rs.PatchTSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    patch_len=8,      # Length of each patch
    stride=8          # Stride between patches (non-overlapping)
)
print(f"Basic PatchTST: {patchtst_basic}")

In [None]:
# PatchTST with overlapping patches
patchtst_overlap = tsai_rs.PatchTSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    patch_len=16,
    stride=8,         # Overlapping patches
    d_model=128,
    n_heads=8,
    n_layers=3,
    d_ff=256,
    dropout=0.1
)
print(f"PatchTST (overlapping): {patchtst_overlap}")

In [None]:
# Large PatchTST configuration
patchtst_large = tsai_rs.PatchTSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    patch_len=16,
    stride=8,
    d_model=256,
    n_heads=16,
    n_layers=4,
    d_ff=512,
    dropout=0.2
)
print(f"Large PatchTST: {patchtst_large}")

## Comparing Transformer Architectures

In [None]:
# Compare TST vs PatchTST
configs = {
    'TST (basic)': tsai_rs.TSTConfig(
        n_vars=n_vars, seq_len=seq_len, n_classes=n_classes
    ),
    'TST (custom)': tsai_rs.TSTConfig(
        n_vars=n_vars, seq_len=seq_len, n_classes=n_classes,
        d_model=128, n_heads=8, n_layers=3, dropout=0.1
    ),
    'TST (regularized)': tsai_rs.TSTConfig(
        n_vars=n_vars, seq_len=seq_len, n_classes=n_classes,
        d_model=128, n_heads=8, n_layers=3, dropout=0.3, fc_dropout=0.8
    ),
    'PatchTST (basic)': tsai_rs.PatchTSTConfig(
        n_vars=n_vars, seq_len=seq_len, n_classes=n_classes,
        patch_len=8, stride=8
    ),
    'PatchTST (overlap)': tsai_rs.PatchTSTConfig(
        n_vars=n_vars, seq_len=seq_len, n_classes=n_classes,
        patch_len=16, stride=8, d_model=128, n_heads=8
    ),
}

print(f"{'Configuration':<25} {'Details'}")
print("-" * 80)
for name, config in configs.items():
    print(f"{name:<25} {config}")

## Training Configuration

In [None]:
# Learner configuration for transformers
# Note: Transformers typically need lower learning rates
learner_config = tsai_rs.LearnerConfig(
    lr=1e-4,          # Lower lr for transformers
    weight_decay=0.01,
    grad_clip=1.0
)
print(f"Learner config: {learner_config}")

In [None]:
# One-cycle scheduler
n_epochs = 100
batch_size = 64
n_samples = X_train.shape[0]
steps_per_epoch = (n_samples + batch_size - 1) // batch_size
total_steps = n_epochs * steps_per_epoch

scheduler = tsai_rs.OneCycleLR.simple(max_lr=1e-4, total_steps=total_steps)

print(f"Training setup:")
print(f"  Epochs: {n_epochs}")
print(f"  Batch size: {batch_size}")
print(f"  Total steps: {total_steps}")

## Tips for Using Transformers

### 1. Lower Learning Rate

Transformers typically require lower learning rates (1e-4 to 1e-5) compared to CNNs (1e-3).

In [None]:
# Typical learning rates for different architectures
lr_configs = {
    'InceptionTimePlus': 1e-3,
    'ResNetPlus': 1e-3,
    'TST': 1e-4,
    'PatchTST': 1e-4,
    'RNNPlus': 1e-3,
}

print("Recommended learning rates:")
for arch, lr in lr_configs.items():
    print(f"  {arch}: {lr}")

### 2. Standardize Data by Variable

For transformers, standardizing each variable independently often works better.

In [None]:
# Standardize by sample (each sample normalized)
X_std_sample = tsai_rs.ts_standardize(X_train.astype(np.float32), by_sample=True)
print(f"By sample - Sample 0 mean: {X_std_sample[0].mean():.6f}")

# For transformers, consider standardizing by variable
# This maintains relative differences across samples
X_std_global = tsai_rs.ts_standardize(X_train.astype(np.float32), by_sample=False)
print(f"By dataset - Global mean: {X_std_global.mean():.6f}")

### 3. Use Dropout to Prevent Overfitting

Transformers can easily overfit. Increase dropout and fc_dropout if you see this.

In [None]:
# Dropout configurations for different overfitting scenarios
dropout_configs = {
    'No overfitting': {'dropout': 0.1, 'fc_dropout': 0.0},
    'Mild overfitting': {'dropout': 0.2, 'fc_dropout': 0.3},
    'Moderate overfitting': {'dropout': 0.3, 'fc_dropout': 0.5},
    'Severe overfitting': {'dropout': 0.3, 'fc_dropout': 0.8},
}

print("Dropout configurations:")
for scenario, config in dropout_configs.items():
    print(f"  {scenario}: dropout={config['dropout']}, fc_dropout={config['fc_dropout']}")

### 4. Choose Patch Size Wisely (PatchTST)

In [None]:
# Calculate number of patches for different configurations
def calc_n_patches(seq_len, patch_len, stride):
    return (seq_len - patch_len) // stride + 1

patch_configs = [
    (8, 8),    # Non-overlapping small patches
    (16, 16),  # Non-overlapping large patches
    (16, 8),   # Overlapping patches
    (8, 4),    # Dense overlapping
]

print(f"For seq_len={seq_len}:")
print(f"{'Patch':<10} {'Stride':<10} {'# Patches':<15} {'Overlap'}")
print("-" * 50)
for patch_len, stride in patch_configs:
    n_patches = calc_n_patches(seq_len, patch_len, stride)
    overlap = 'No' if patch_len == stride else f'Yes ({patch_len - stride} pts)'
    print(f"{patch_len:<10} {stride:<10} {n_patches:<15} {overlap}")

## Testing on Different Datasets

In [None]:
datasets = ['ECG200', 'GunPoint', 'FordA', 'NATOPS', 'Wafer']

print(f"{'Dataset':<15} {'Vars':<6} {'Len':<8} {'TST d_model':<12} {'PatchTST patch':<15}")
print("-" * 60)

for dsid in datasets:
    try:
        X_train, y_train, X_test, y_test = tsai_rs.get_UCR_data(dsid, return_split=True)
        n_vars = X_train.shape[1]
        seq_len = X_train.shape[2]
        n_classes = len(np.unique(y_train))
        
        # Suggested configurations
        d_model = 128 if seq_len > 50 else 64
        patch_len = min(16, seq_len // 4) if seq_len >= 16 else seq_len // 2
        
        print(f"{dsid:<15} {n_vars:<6} {seq_len:<8} {d_model:<12} {patch_len:<15}")
    except Exception as e:
        print(f"{dsid:<15} Error: {e}")

## Complete Example: TST Configuration

In [None]:
# Complete TST setup example
dsid = 'NATOPS'
X_train, y_train, X_test, y_test = tsai_rs.get_UCR_data(dsid, return_split=True)

# Get dimensions
n_vars = X_train.shape[1]
seq_len = X_train.shape[2]
n_classes = len(np.unique(y_train))

# Standardize
X_train_std = tsai_rs.ts_standardize(X_train.astype(np.float32), by_sample=True)
X_test_std = tsai_rs.ts_standardize(X_test.astype(np.float32), by_sample=True)

# Create dataset
train_ds = tsai_rs.TSDataset(X_train_std, y_train)
test_ds = tsai_rs.TSDataset(X_test_std, y_test)

# Configure TST
tst_config = tsai_rs.TSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    d_model=128,
    n_heads=8,
    n_layers=3,
    d_ff=256,
    dropout=0.3,
    fc_dropout=0.8
)

# Configure training
learner_config = tsai_rs.LearnerConfig(
    lr=1e-4,
    weight_decay=0.01,
    grad_clip=1.0
)

print(f"Dataset: {dsid}")
print(f"TST config: {tst_config}")
print(f"Learner config: {learner_config}")
print(f"\nReady for training!")

## Complete Example: PatchTST Configuration

In [None]:
# Complete PatchTST setup example
dsid = 'NATOPS'
X_train, y_train, X_test, y_test = tsai_rs.get_UCR_data(dsid, return_split=True)

# Get dimensions
n_vars = X_train.shape[1]
seq_len = X_train.shape[2]
n_classes = len(np.unique(y_train))

# Configure PatchTST
patchtst_config = tsai_rs.PatchTSTConfig(
    n_vars=n_vars,
    seq_len=seq_len,
    n_classes=n_classes,
    patch_len=16,
    stride=8,
    d_model=128,
    n_heads=8,
    n_layers=3,
    d_ff=256,
    dropout=0.1
)

# Calculate number of patches
n_patches = (seq_len - 16) // 8 + 1

print(f"Dataset: {dsid}")
print(f"Sequence length: {seq_len}")
print(f"Number of patches: {n_patches}")
print(f"PatchTST config: {patchtst_config}")
print(f"\nReady for training!")

## Summary

This notebook demonstrated Transformer architectures in tsai-rs:

### TST (Time Series Transformer)
- Pure transformer for time series
- Each timestep is a token
- Good for capturing long-range dependencies

### PatchTST
- Patches time series before applying transformer
- Reduces sequence length (better for long series)
- Inspired by Vision Transformer (ViT)

### Key Tips
1. Use lower learning rates (1e-4 to 1e-5)
2. Standardize data (by sample or by variable)
3. Use dropout to prevent overfitting
4. For PatchTST, choose patch size based on sequence length

### When to Use Transformers
- Long sequences with complex patterns
- When capturing long-range dependencies is important
- Multivariate time series with cross-variable relationships

In [None]:
# Quick reference
print("TST Configuration:")
print("  tsai_rs.TSTConfig(n_vars, seq_len, n_classes,")
print("      d_model=128, n_heads=8, n_layers=3,")
print("      d_ff=256, dropout=0.3, fc_dropout=0.8)")

print("\nPatchTST Configuration:")
print("  tsai_rs.PatchTSTConfig(n_vars, seq_len, n_classes,")
print("      patch_len=16, stride=8,")
print("      d_model=128, n_heads=8, n_layers=3)")

print("\nRecommended Learning Rate: 1e-4")