# Chapter 93: Foundation Models for Time-Series

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand what foundation models are and how they differ from traditional task‑specific models.
- Identify the emerging landscape of time‑series foundation models (Chronos, Lag‑Llama, Moirai, etc.).
- Explain the pre‑training strategies used for time‑series foundation models (e.g., contrastive learning, masked modeling).
- Apply fine‑ tuning techniques to adapt a foundation model to a specific dataset like NEPSE.
- Leverage zero‑shot forecasting capabilities of foundation models for quick prototyping.
- Evaluate the strengths and limitations of foundation models compared to traditional approaches.
- Understand the computational requirements and practical considerations for deploying foundation models.
- Explore future directions and ongoing research in this rapidly evolving field.

---

## **93.1 Introduction to Foundation Models**

Foundation models are large‑scale machine learning models pre‑trained on vast amounts of data, which can then be adapted (fine‑tuned) for a wide range of downstream tasks. The term was popularised by models like GPT (for text) and DALL‑E (for images). These models capture general patterns and representations that transfer effectively to new tasks with minimal additional training.

In the time‑series domain, foundation models are an emerging paradigm. Instead of training a model from scratch for each new forecasting task (e.g., NEPSE stock prediction, retail sales, energy demand), we can leverage a model pre‑trained on a diverse collection of time series from many domains. This offers several potential benefits:

- **Reduced training time and data requirements**: Fine‑tuning a foundation model often requires less data and compute than training from scratch.
- **Improved performance on small datasets**: The model brings prior knowledge that helps generalise.
- **Zero‑shot forecasting**: In some cases, the model can make reasonable predictions on a new dataset without any fine‑tuning.
- **Unified architecture**: A single model can handle many different time‑series tasks (forecasting, classification, anomaly detection).

However, foundation models also come with challenges: they are computationally expensive to pre‑train, may require careful fine‑tuning, and can be opaque in their reasoning.

In this chapter, we will explore the current state of time‑series foundation models, how to use them for the NEPSE prediction task, and what the future might hold.

---

## **93.2 The Landscape of Time‑Series Foundation Models**

Several foundation models have recently been proposed for time series. Let's briefly review the most prominent ones.

### **93.2.1 Chronos (Amazon)**
Chronos is a family of pre‑trained time‑series forecasting models based on the T5 architecture (encoder‑decoder). It is trained on a large corpus of public time‑series data from various domains. Chronos tokenises time series by scaling and quantising values into a fixed vocabulary, then treats forecasting as a language modeling task.

**Key features**:
- Available in different sizes (tiny, mini, small, base, large).
- Supports probabilistic forecasting (generates multiple samples).
- Can be used zero‑shot or fine‑tuned.

### **93.2.2 Lag‑Llama (University of Oxford)**
Lag‑Llama is a foundation model for univariate probabilistic forecasting. It is based on the LLaMA architecture (decoder‑only) and uses lagged features as inputs. It is pre‑trained on a large collection of time series and can produce predictions with uncertainty estimates.

**Key features**:
- Decoder‑only transformer.
- Uses lagged values as covariates.
- Strong zero‑shot performance on many benchmarks.

### **93.2.3 Moirai (Salesforce)**
Moirai is a multivariate time‑series foundation model that can handle multiple frequencies and missing values. It uses a transformer architecture with a novel "masked encoder" pre‑training objective.

**Key features**:
- Supports multivariate forecasting.
- Handles irregularly sampled data.
- Pre‑trained on a massive dataset (Lotka‑Volterra).

### **93.2.4 Others**
- **TimesNet**: A task‑specific model that has inspired foundation approaches.
- **UniTime**: A unified model for multiple time‑series tasks.
- **GPT‑4 for time series**: Researchers have explored prompting large language models with numerical data, though results are mixed.

For the NEPSE system, we will focus on Chronos and Lag‑Llama as they are publicly available and well‑documented.

---

## **93.3 Pre‑training Strategies**

Understanding how these models are pre‑trained helps in using them effectively.

### **93.3.1 Data Curation**
Foundation models are trained on massive, diverse collections of time series. For example, Chronos uses data from:

- Monash Time Series Forecasting Repository
- M4, M5 competitions
- UCR Time Series Classification Archive
- Synthetic data

The diversity ensures the model learns general temporal patterns: trends, seasonality, noise, etc.

### **93.3.2 Tokenization**
Time series are continuous, but transformers expect discrete tokens. Chronos addresses this by:

1. Scaling each time series (e.g., by its mean absolute value).
2. Quantising the scaled values into a fixed number of bins (e.g., 4096 bins).
3. Representing each observation as a token (the bin index).

This transforms forecasting into a next‑token prediction task, similar to language modeling.

### **93.3.3 Masked Modeling (Moirai)**
Moirai uses a masked autoencoder approach: random patches of the time series are masked, and the model learns to reconstruct them. This forces the model to capture dependencies across time.

### **93.3.4 Contrastive Learning**
Some models use contrastive objectives: positive pairs (e.g., different views of the same series) are pulled together, while negative pairs are pushed apart. This learns robust representations.

### **93.3.5 Next‑Step Prediction (Lag‑Llama)**
Lag‑Llama is trained to predict the next value given a context window of lagged values, using a causal transformer.

The result of pre‑training is a set of weights that capture general time‑series patterns. These weights can then be fine‑tuned on specific datasets like NEPSE.

---

## **93.4 Zero‑Shot Forecasting**

One of the most exciting capabilities of foundation models is zero‑shot forecasting: making predictions on a new dataset without any additional training.

### **93.4.1 When Zero‑Shot Works**
Zero‑shot works well when the new dataset's patterns are similar to those seen during pre‑training. For example, if the model has seen many retail sales time series, it may generalise to a new retail dataset. For NEPSE, if the pre‑training data included stock prices, zero‑shot might be reasonable.

### **93.4.2 Example: Zero‑Shot with Chronos**

```python
# pip install git+https://github.com/amazon-science/chronos-forecasting.git
import torch
from chronos import ChronosPipeline

# Load the pre-trained Chronos model (small version for demo)
pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-small",
    device_map="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.bfloat16,
)

# Prepare NEPSE data (a single time series)
# Assume we have a pandas Series 'close_prices' with daily close prices
context = torch.tensor(close_prices.values[-100:], dtype=torch.float32)  # last 100 days

# Generate 24 forecast samples (24 days ahead)
forecast_samples = pipeline.predict(
    context=context,
    prediction_length=24,
    num_samples=20,  # number of samples for probabilistic forecast
)

# forecast_samples shape: (num_samples, prediction_length)
# Take median as point forecast
median_forecast = np.median(forecast_samples.numpy(), axis=0)

# Plot
import matplotlib.pyplot as plt
plt.plot(range(-100, 0), context, label="History")
plt.plot(range(1, 25), median_forecast, label="Chronos forecast")
plt.legend()
plt.show()
```

**Explanation**:

- Chronos is loaded as a pipeline. The model expects a context window (the last N observations) and outputs samples of future values.
- The `predict` method returns multiple samples, allowing us to compute quantiles for probabilistic forecasting.
- In zero‑shot mode, we use the model as is, with no fine‑tuning. The quality of the forecast depends on how similar NEPSE is to the pre‑training data.

---

## **93.5 Fine‑Tuning for NEPSE**

If zero‑shot performance is insufficient, we can fine‑tune the foundation model on historical NEPSE data. Fine‑tuning adapts the model's weights to the specific characteristics of our dataset.

### **93.5.1 When to Fine‑Tune**
- You have enough historical data (at least a few hundred observations).
- The data distribution differs from the pre‑training distribution (e.g., unique patterns in NEPSE).
- You need higher accuracy than zero‑shot provides.

### **93.5.2 Fine‑Tuning Chronos**

Chronos supports fine‑tuning through its Hugging Face integration.

```python
from chronos import ChronosPipeline
from chronos.utils import ChronosDataset
from transformers import TrainingArguments, Trainer
import torch

# Load the base model
model = ChronosPipeline.from_pretrained("amazon/chronos-t5-small")

# Prepare dataset (list of time series)
# For NEPSE, we might have multiple stocks, each as a separate time series
train_series = [df[df['symbol'] == sym]['close'].values for sym in symbols]

# Chronos expects a specific format
dataset = ChronosDataset(
    series=train_series,
    context_length=512,  # max context length
    prediction_length=24, # forecast horizon
    tokenizer=model.tokenizer,
    freq='D'  # daily frequency
)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./chronos-nepse",
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    num_train_epochs=3,
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="no",
    save_total_limit=2,
    remove_unused_columns=False,
)

# Trainer
trainer = Trainer(
    model=model.model,
    args=training_args,
    train_dataset=dataset,
)

# Fine‑tune
trainer.train()

# Save the fine‑tuned model
model.save_pretrained("./chronos-nepse-finetuned")
```

**Explanation**:

- The `ChronosDataset` handles tokenization and formatting.
- We train for a few epochs with a low learning rate (typical for fine‑tuning).
- After fine‑tuning, the model is saved and can be loaded for inference as before.

### **93.5.3 Fine‑Tuning Lag‑Llama**

Lag‑Llama can be fine‑tuned using its PyTorch implementation.

```python
# Lag‑Llama fine‑tuning (simplified)
from lag_llama import LagLlamaModel, LagLlamaConfig
import pytorch_lightning as pl

config = LagLlamaConfig(
    context_length=256,
    prediction_length=24,
    num_layers=6,
    d_model=256,
    n_heads=8,
)

model = LagLlamaModel(config)

# Prepare data loader for NEPSE
train_loader = ...  # yields (context, target) pairs

# Trainer
trainer = pl.Trainer(max_epochs=10, gpus=1 if torch.cuda.is_available() else 0)
trainer.fit(model, train_loader)
```

---

## **93.6 Evaluation and Comparison**

When using a foundation model, we should compare its performance against traditional approaches (e.g., XGBoost, ARIMA, LSTM trained from scratch).

### **93.6.1 Metrics**
Use the same metrics as before: MAE, RMSE, MAPE. For probabilistic forecasts, use CRPS or pinball loss.

### **93.6.2 Experimental Setup**
- **Zero‑shot**: Evaluate directly on test data (no training).
- **Fine‑tuned**: Train on training period, evaluate on test period.
- **Baseline**: Train XGBoost/LSTM on same training period.

**Example evaluation**:

```python
def evaluate_model(model_fn, X_train, y_train, X_test, y_test):
    # model_fn returns predictions
    y_pred = model_fn(X_train, y_train, X_test)
    return mean_absolute_error(y_test, y_pred)

# Zero‑shot Chronos
def chronos_zero_shot(context):
    # use Chronos as above
    ...

# Fine‑tuned Chronos
def chronos_finetuned(context):
    # load fine‑tuned model
    ...

# XGBoost baseline
def xgboost_baseline(X_train, y_train, X_test):
    model = xgb.XGBRegressor()
    model.fit(X_train, y_train)
    return model.predict(X_test)

# Compare
mae_zs = evaluate_model(chronos_zero_shot, ...)
mae_ft = evaluate_model(chronos_finetuned, ...)
mae_xgb = evaluate_model(xgboost_baseline, ...)

print(f"Zero‑shot MAE: {mae_zs:.2f}")
print(f"Fine‑tuned MAE: {mae_ft:.2f}")
print(f"XGBoost MAE: {mae_xgb:.2f}")
```

---

## **93.7 Computational Requirements and Practical Considerations**

### **93.7.1 Hardware**
Foundation models are large. Even the "small" Chronos model has ~50 million parameters; the large version has billions. Inference requires a GPU for reasonable speed, and fine‑tuning definitely requires one (or multiple). For the NEPSE system, a single T4 GPU (available on many cloud platforms) is sufficient for the small/medium models.

### **93.7.2 Latency**
Inference with foundation models is slower than with a simple XGBoost. For real‑time APIs, you may need to batch requests or use a smaller model. For daily batch predictions, latency is less critical.

### **93.7.3 Cold Start**
Zero‑shot eliminates the need for training data, which is useful for new stocks with little history. However, the model may not capture stock‑specific nuances.

### **93.7.4 Interpretability**
Foundation models are black boxes. If interpretability is required (e.g., for regulatory reasons), they may not be suitable. However, techniques like attention visualisation can provide some insight.

### **93.7.5 Licensing**
Check the license of each foundation model. Some are for research use only; others are Apache 2.0. For a commercial NEPSE system, ensure compliance.

---

## **93.8 Limitations and Challenges**

Foundation models are not a silver bullet. Be aware of:

- **Domain shift**: If NEPSE behaves very differently from the pre‑training data (e.g., extreme volatility, circuit breakers), zero‑shot may fail.
- **Fine‑tuning data requirements**: Fine‑tuning still needs sufficient data; if you have very little, you may overfit.
- **Catastrophic forgetting**: Fine‑tuning can cause the model to forget general knowledge. Use a low learning rate and possibly freeze early layers.
- **Cost**: Pre‑training is prohibitively expensive for most organisations. We rely on publicly released models.
- **Evaluation complexity**: With multiple foundation models emerging, choosing the right one requires careful benchmarking.

---

## **93.9 Future Directions**

The field of time‑series foundation models is evolving rapidly. Expect to see:

- **Larger, more diverse pre‑training datasets**: Including more financial, economic, and climate data.
- **Multivariate foundation models**: Handling multiple correlated time series (e.g., all NEPSE stocks together).
- **Integration with LLMs**: Using language models to incorporate textual information (news, reports) alongside time series.
- **Efficient fine‑tuning techniques**: Like LoRA (Low‑Rank Adaptation) to adapt models with minimal parameters.
- **On‑device deployment**: Smaller, distilled versions for edge devices.
- **Standardised benchmarks**: To fairly compare foundation models.

For the NEPSE system, keeping an eye on these developments will help you decide when to upgrade.

---

## **93.10 Best Practices**

1. **Start with zero‑shot**: It's quick and gives a baseline. If performance is acceptable, you may not need fine‑tuning.
2. **Benchmark against simple models**: Ensure the added complexity is justified.
3. **Use fine‑tuning judiciously**: Monitor for overfitting; use a validation set.
4. **Monitor for drift**: Even foundation models can suffer from concept drift. Retrain or fine‑tune periodically.
5. **Consider ensemble**: Combine a foundation model with a traditional model for robustness.
6. **Document everything**: Which model version, fine‑tuning data, and hyperparameters were used.

---

## **Chapter Summary**

In this chapter, we explored the emerging paradigm of foundation models for time‑series forecasting. We introduced several models (Chronos, Lag‑Llama, Moirai) and explained their pre‑training strategies. Using the NEPSE prediction system as an example, we demonstrated zero‑shot forecasting and fine‑tuning with Chronos. We discussed the computational requirements, limitations, and future directions.

Foundation models represent a significant shift in how we approach time‑series prediction. They offer the promise of general‑purpose forecasting models that can be adapted to new tasks with minimal effort. For the NEPSE system, they provide a powerful new tool in the forecasting toolbox, complementing the traditional approaches we've built throughout this handbook.

In the next chapter, we will explore **Large Language Models for Time‑Series**, diving deeper into how models like GPT can be used for numerical prediction and reasoning about time‑series data.

---

**End of Chapter 93**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../12. industry_best_practices_and_standards/92. troubleshooting_and_debugging.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='94. large_language_models_for_time_series.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
