# Chapter 28: Transformer Models for Time‑Series

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Understand the fundamental limitations of RNNs and CNNs that led to the development of Transformers
- Explain the core components of the Transformer architecture: self‑attention, multi‑head attention, and positional encoding
- Recognize why the attention mechanism is particularly suited for capturing long‑range dependencies in time‑series data
- Implement a simplified Transformer block for time‑series forecasting using TensorFlow/Keras
- Explore specialized Transformer variants (Informer, Autoformer, FEDformer) designed for long sequence forecasting
- Discuss pre‑training strategies and how they can be applied to time‑series
- Apply fine‑tuning techniques to adapt pre‑trained models to the NEPSE prediction task
- Address practical considerations such as computational complexity, overfitting, and data requirements
- Compare Transformer‑based models with RNNs, CNNs, and traditional statistical methods on the NEPSE dataset

---

## **28.1 Introduction to the Transformer Architecture**

The Transformer model, introduced by Vaswani et al. in 2017, revolutionized natural language processing (NLP) by replacing recurrent and convolutional layers with a mechanism called **self‑attention**. Unlike RNNs, which process sequences step‑by‑step, Transformers process all elements of a sequence in parallel, making them highly efficient and capable of capturing long‑range dependencies without the vanishing gradient problem.

In time‑series forecasting, Transformers have gained popularity because they can model complex temporal patterns across many time steps. For the NEPSE prediction system, a Transformer could theoretically learn relationships between events far apart in time—for example, linking a price movement in January to a movement in June—without the need for carefully engineered lag features.

### **28.1.1 Why Attention for Time‑Series?**

- **Long‑range dependencies:** Financial markets often exhibit dependencies that span weeks or months. RNNs struggle with such long horizons; Transformers handle them naturally.
- **Parallelization:** Training on sequences of hundreds of time steps is much faster with Transformers than with RNNs.
- **Interpretability:** Attention weights can be visualized to understand which past time points the model focuses on when making a prediction.

However, Transformers come with challenges: they are data‑hungry, computationally intensive for very long sequences (quadratic complexity in sequence length), and prone to overfitting on small datasets. For a single NEPSE stock with a few thousand daily observations, we must apply strong regularization and potentially use pre‑trained models or transfer learning.

---

## **28.2 Attention Mechanism**

The core innovation of the Transformer is the **scaled dot‑product attention**. Given a query `Q`, a key `K`, and a value `V` (all matrices), the attention output is computed as:

`Attention(Q, K, V) = softmax( (Q Kᵀ) / √dₖ ) V`

Intuitively, the dot product between a query and all keys measures how much each value should contribute to the output. The scaling factor `√dₖ` prevents the dot products from growing too large, which would push the softmax into regions of extremely small gradients.

In self‑attention, the queries, keys, and values all come from the same input sequence—each element attends to all others. This allows the model to weigh the importance of other time steps when encoding a particular time step.

### **28.2.1 Self‑Attention for Time‑Series**

For a time series, each time step (e.g., a vector of features at day `t`) becomes a token. Self‑attention computes a weighted sum of all time steps for each time step, where the weights are determined by the similarity between the steps. This can capture patterns like: "today's return should be influenced by the returns of the past three days, but especially by the day that had a similar volatility spike."

---

## **28.3 Multi‑Head Attention**

Instead of performing a single attention function, the Transformer uses **multi‑head attention**: it linearly projects the queries, keys, and values `h` times with different learned projections, performs attention in parallel, concatenates the results, and projects again. Each head can focus on different types of relationships (e.g., one head might capture short‑term momentum, another might capture weekly seasonality).

Mathematically:

`MultiHead(Q, K, V) = Concat(head₁, …, head_h) Wᴼ`  
where `headᵢ = Attention(Q Wᵢ^Q, K Wᵢ^K, V Wᵢ^V)`

---

## **28.4 Positional Encoding**

Since Transformers have no inherent notion of order (they process the sequence as a set), we must inject information about the position of each time step. The original Transformer adds **positional encodings** to the input embeddings. These encodings are usually sine and cosine functions of different frequencies, which allow the model to easily learn to attend by relative position.

For time‑series, we can also use learned positional embeddings or simply concatenate the time index (e.g., day number) as a feature. However, sine‑cosine encodings have the advantage of extrapolating to sequence lengths not seen during training.

---

## **28.5 Encoder‑Decoder Architecture**

The original Transformer was designed for sequence‑to‑sequence tasks (e.g., machine translation). It consists of:

- **Encoder:** Processes the input sequence and produces a sequence of hidden representations.
- **Decoder:** Generates the output sequence autoregressively, attending to the encoder's outputs and its own previous outputs.

For time‑series forecasting, we often need only the encoder for tasks like predicting the next value from a window (many‑to‑one). For multi‑step forecasting, we can use an encoder‑decoder where the decoder produces a sequence of future values.

---

## **28.6 Building a Simple Transformer for Time‑Series Forecasting**

We will implement a simplified Transformer encoder for the NEPSE return prediction task (predict next day's return from a window of past returns). We'll use TensorFlow/Keras, building custom layers for multi‑head attention and feed‑forward networks.

### **28.6.1 Data Preparation (Same as Chapter 26)**

We'll reuse the sequence data from Chapter 26: 20‑day windows of returns, scaled.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Load and prepare NEPSE data (single symbol)
df = pd.read_csv('nepse_data.csv')
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['Symbol', 'Date']).reset_index(drop=True)
symbol = df['Symbol'].unique()[0]
df_stock = df[df['Symbol'] == symbol].copy()

# Compute returns
df_stock['Return'] = df_stock['Close'].pct_change() * 100

# Use only returns as feature for simplicity
feature_columns = ['Return']
df_stock = df_stock.dropna(subset=feature_columns)

# Function to create sequences
def create_sequences(data, seq_length):
    X, y = [], []
    for i in range(len(data) - seq_length):
        X.append(data[i:i+seq_length, :])
        y.append(data[i+seq_length, 0])
    return np.array(X), np.array(y)

seq_length = 20
data = df_stock[feature_columns].values
X, y = create_sequences(data, seq_length)

# Temporal split
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Scale features
original_shape = X_train.shape
X_train_2d = X_train.reshape(-1, X_train.shape[-1])
scaler = StandardScaler()
X_train_scaled_2d = scaler.fit_transform(X_train_2d)
X_train_scaled = X_train_scaled_2d.reshape(original_shape)

X_test_2d = X_test.reshape(-1, X_test.shape[-1])
X_test_scaled_2d = scaler.transform(X_test_2d)
X_test_scaled = X_test_scaled_2d.reshape(X_test.shape)

print(f"X_train_scaled shape: {X_train_scaled.shape}")
```

### **28.6.2 Defining the Transformer Encoder Block**

We'll implement a single encoder block with multi‑head self‑attention and a feed‑forward network, followed by layer normalization and residual connections.

```python
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
    """
    A single transformer encoder block.
    Args:
        inputs: (batch, time, features)
        head_size: dimension of each attention head
        num_heads: number of attention heads
        ff_dim: hidden units in feed-forward network
        dropout: dropout rate
    Returns:
        (batch, time, features) after self-attention and FFN
    """
    # Multi-head self-attention
    attention = layers.MultiHeadAttention(
        key_dim=head_size, num_heads=num_heads, dropout=dropout
    )(inputs, inputs)
    attention = layers.Dropout(dropout)(attention)
    attention = layers.LayerNormalization(epsilon=1e-6)(inputs + attention)  # residual + norm

    # Feed-forward network
    ff = layers.Conv1D(filters=ff_dim, kernel_size=1, activation='relu')(attention)
    ff = layers.Dropout(dropout)(ff)
    ff = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(ff)  # project back to original dim
    ff = layers.Dropout(dropout)(ff)
    outputs = layers.LayerNormalization(epsilon=1e-6)(attention + ff)  # second residual + norm
    return outputs
```

**Explanation:**

- `layers.MultiHeadAttention` is a built‑in Keras layer that efficiently computes scaled dot‑product attention with multiple heads. We pass the same sequence as query, key, and value for self‑attention.
- Residual connections and layer normalization are applied after each sub‑layer (pre‑norm variant, common in modern Transformers).
- The feed‑forward network uses two 1D convolutions with kernel size 1 (equivalent to position‑wise dense layers). The first expands the dimension, the second projects back to the original feature size.
- Dropout is applied for regularization.

### **28.6.3 Building the Complete Model**

We'll stack several encoder blocks and then add a global pooling layer and a dense output.

```python
def build_transformer_model(input_shape, head_size, num_heads, ff_dim, num_blocks, dropout=0):
    inputs = keras.Input(shape=input_shape)
    x = inputs
    for _ in range(num_blocks):
        x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)
    
    # We need to produce a single prediction per sample. Options:
    # - Use global average pooling over time
    # - Take the last time step (since the model has seen all, both are valid)
    # Here we use global average pooling.
    x = layers.GlobalAveragePooling1D()(x)
    x = layers.Dropout(dropout)(x)
    outputs = layers.Dense(1)(x)
    
    model = keras.Model(inputs, outputs)
    return model

input_shape = (X_train_scaled.shape[1], X_train_scaled.shape[2])
model = build_transformer_model(
    input_shape=input_shape,
    head_size=64,
    num_heads=4,
    ff_dim=128,
    num_blocks=2,
    dropout=0.2
)

model.compile(optimizer='adam', loss='mse', metrics=['mae'])
model.summary()
```

**Explanation:**

- We stack two transformer encoder blocks (`num_blocks=2`). Each block has 4 attention heads, each of dimension 64.
- After the encoder, we apply global average pooling over the time dimension to obtain a fixed‑size vector, then a final dense layer.
- Dropout is applied inside each block and after pooling to reduce overfitting.
- The total number of parameters is relatively modest, suitable for our dataset size.

### **28.6.4 Training and Evaluation**

```python
# Validation split (temporal)
val_size = int(len(X_train_scaled) * 0.1)
X_val = X_train_scaled[-val_size:]
y_val = y_train[-val_size:]
X_train_final = X_train_scaled[:-val_size]
y_train_final = y_train[:-val_size]

# Callbacks
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6)

history = model.fit(
    X_train_final, y_train_final,
    validation_data=(X_val, y_val),
    epochs=100,
    batch_size=32,
    callbacks=[early_stop, reduce_lr],
    verbose=1
)

# Test evaluation
y_pred = model.predict(X_test_scaled).flatten()
rmse_transformer = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Transformer Test RMSE: {rmse_transformer:.4f}")
```

**Explanation:**

- We train with early stopping and learning rate reduction on plateau.
- The test RMSE can be compared with LSTM and CNN models from previous chapters.

---

## **28.7 Time‑Series Transformer Variants**

The vanilla Transformer, while powerful, has limitations for long sequence time‑series forecasting (LSTF): the quadratic complexity of self‑attention with respect to sequence length makes it expensive for sequences of hundreds or thousands of steps. Several variants have been proposed to address this:

### **28.7.1 Informer**

Informer (Zhou et al., 2021) introduces:

- **ProbSparse attention:** Instead of computing attention for all query‑key pairs, it selects only the most important queries based on a sparsity measurement, reducing complexity to O(L log L).
- **Self‑attention distilling:** Downsamples the input by half each layer, further reducing dimensions.
- **Generative style decoder:** Produces long sequences in one forward pass rather than step‑by‑step.

Informer is particularly suited for datasets where long‑term forecasting is needed (e.g., electricity consumption, traffic). For NEPSE, with only daily data, the sequence lengths may not be long enough to benefit from these optimizations, but the ideas are worth noting.

### **28.7.2 Autoformer**

Autoformer (Wu et al., 2021) replaces the self‑attention mechanism with an **auto‑correlation** mechanism that discovers period‑based dependencies. It uses:

- **Series decomposition:** Decomposes the series into trend and seasonal components.
- **Auto‑correlation:** Computes series similarity based on time delays, which is more efficient and interpretable for seasonality.

Autoformer often outperforms standard Transformers on datasets with clear seasonal patterns. For NEPSE, we might have weak seasonality (e.g., day‑of‑week effects), so Autoformer could be beneficial.

### **28.7.3 FEDformer**

FEDformer (Zhou et al., 2022) applies a **frequency‑enhanced** approach, using Fourier and wavelet transforms to capture global and local patterns in the frequency domain. It is even more efficient and effective for long sequences.

### **28.7.4 When to Use These Variants**

For the NEPSE dataset with daily data and a few thousand samples, the vanilla Transformer with moderate sequence length (e.g., 60 days) is feasible. However, if we were to use higher frequency data (e.g., minute‑by‑minute) or very long windows, these variants would become necessary. Implementing them from scratch is complex; usually we rely on existing implementations (e.g., in GitHub repositories or libraries like `tsai`).

---

## **28.8 Pre‑training and Fine‑tuning for Time‑Series**

Transformers have benefited enormously from pre‑training on large corpora in NLP. For time‑series, pre‑training is an emerging area. The idea is to train a model on a large collection of time series (e.g., many stocks, different markets) to learn general temporal patterns, then fine‑tune on a target series (e.g., a specific NEPSE stock).

### **28.8.1 Pre‑training Strategies**

- **Masked autoencoding:** Mask a portion of the time steps and train the model to reconstruct them (similar to BERT).
- **Contrastive learning:** Learn representations such that different views of the same series are close, while views of different series are far apart.
- **Forecasting pre‑training:** Train on a large dataset to predict future values, then fine‑tune on the target.

### **28.8.2 Fine‑tuning on NEPSE**

If we have a pre‑trained model (e.g., trained on US stock data), we could load its weights, replace the final layer, and fine‑tune on NEPSE data. However, such models are not yet widely available for time‑series. In practice, for NEPSE, we would likely train from scratch.

---

## **28.9 Implementation Considerations**

### **28.9.1 Data Requirements**

Transformers generally require more data than RNNs or CNNs. For a single stock with a few thousand samples, a small Transformer (2‑4 layers, 4‑8 heads) with strong regularization may work, but we risk overfitting. We can augment data by:

- Using multiple stocks as separate training samples (treat each stock as independent).
- Applying time‑series augmentation techniques (e.g., jittering, scaling, time warping) – though care must be taken not to break temporal relationships.

### **28.9.2 Regularization**

- Dropout in attention and feed‑forward layers (0.1‑0.3).
- Layer normalization (already used).
- Weight decay (L2 regularization) on dense layers.
- Early stopping.

### **28.9.3 Computational Complexity**

For sequence length `L`, self‑attention is O(L²). With `L=60`, L²=3600, which is fine. For `L=500`, L²=250,000, which becomes heavy. Use efficient variants if needed.

### **28.9.4 Positional Encoding**

We used the default sine‑cosine encodings in the Keras `MultiHeadAttention` layer (it adds them internally when `use_positional_encoding=True`? Actually, the Keras layer does not add positional encodings automatically; we must add them ourselves. In our implementation, we omitted them because with only one feature, the model might learn positional information implicitly through the residual connections? Better to explicitly add them. We can add a positional encoding layer.

```python
class PositionalEncoding(layers.Layer):
    def __init__(self, sequence_length, d_model):
        super().__init__()
        self.pos_encoding = self.positional_encoding(sequence_length, d_model)
    
    def positional_encoding(self, length, d_model):
        angle_rads = self.get_angles(np.arange(length)[:, np.newaxis],
                                     np.arange(d_model)[np.newaxis, :],
                                     d_model)
        # apply sin to even indices, cos to odd
        angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
        angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
        pos_encoding = angle_rads[np.newaxis, ...]  # (1, length, d_model)
        return tf.cast(pos_encoding, dtype=tf.float32)
    
    def get_angles(self, pos, i, d_model):
        angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
        return pos * angle_rates
    
    def call(self, inputs):
        return inputs + self.pos_encoding[:, :tf.shape(inputs)[1], :]
```

Then add it after the input layer: `x = PositionalEncoding(seq_length, d_model)(x)` where `d_model` is the feature dimension after an initial linear projection (we may need to project to a higher dimension first).

---

## **28.10 Practical Example: NEPSE Direction Prediction with Transformer**

Let's adapt the model for binary classification (direction). We'll change the output layer to sigmoid and use binary crossentropy loss.

```python
# Binary target
y_binary = (y > 0).astype(int)
y_train_bin, y_test_bin = y_binary[:split_idx], y_binary[split_idx:]

# Build transformer classifier (same encoder, but output with sigmoid)
def build_transformer_classifier(input_shape, head_size, num_heads, ff_dim, num_blocks, dropout=0):
    inputs = keras.Input(shape=input_shape)
    x = inputs
    # Optional: initial projection to d_model if features != d_model
    # x = layers.Dense(head_size * num_heads)(x)  # project to d_model
    # x = PositionalEncoding(seq_length, head_size * num_heads)(x)
    
    for _ in range(num_blocks):
        x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)
    
    x = layers.GlobalAveragePooling1D()(x)
    x = layers.Dropout(dropout)(x)
    outputs = layers.Dense(1, activation='sigmoid')(x)
    model = keras.Model(inputs, outputs)
    return model

model_clf = build_transformer_classifier(
    input_shape=input_shape,
    head_size=32,
    num_heads=4,
    ff_dim=64,
    num_blocks=2,
    dropout=0.2
)
model_clf.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train
history_clf = model_clf.fit(
    X_train_final, y_train_bin[:-val_size] if val_size>0 else y_train_bin,  # align with validation split
    validation_data=(X_val, y_binary[split_idx - val_size:split_idx] if val_size>0 else None),
    epochs=50,
    batch_size=32,
    callbacks=[early_stop],
    verbose=1
)

# Test accuracy
y_pred_prob = model_clf.predict(X_test_scaled).flatten()
y_pred_class = (y_pred_prob > 0.5).astype(int)
accuracy = np.mean(y_pred_class == y_test_bin)
print(f"Transformer classifier test accuracy: {accuracy:.4f}")
```

---

## **28.11 Chapter Summary**

In this chapter, we introduced Transformer models and their application to time‑series forecasting, using the NEPSE dataset as a concrete example.

- **Transformer architecture:** We explained self‑attention, multi‑head attention, positional encoding, and the encoder‑decoder structure.
- **Implementation:** We built a simplified Transformer encoder in Keras and applied it to predict next‑day returns and direction.
- **Efficient variants:** We discussed Informer, Autoformer, and FEDformer, which are designed for long sequence forecasting.
- **Pre‑training and fine‑tuning:** We touched on the potential of transfer learning for time‑series.
- **Practical considerations:** Data requirements, regularization, and computational complexity are key when using Transformers on relatively small financial datasets.

### **Practical Takeaways for the NEPSE System:**

- For a single stock with a few thousand daily observations, a small Transformer (2‑3 layers, 4‑8 heads) with strong dropout and early stopping can be a competitive model.
- Use positional encodings to retain temporal order.
- Compare Transformer performance with LSTM and CNN baselines; often Transformers perform similarly or slightly better, but require more careful tuning.
- If longer sequences (e.g., 100+ days) are used, consider efficient variants or reduce sequence length.
- For multi‑step forecasting, an encoder‑decoder Transformer may be appropriate, but the increased complexity must be justified by performance gains.

In the next chapter, **Chapter 29: Specialized Time‑Series Architectures**, we will explore other advanced models such as N‑BEATS, DeepAR, Temporal Fusion Transformers, and Gaussian Processes, which are specifically designed for probabilistic forecasting and hierarchical time series.

---

**End of Chapter 28**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='27. convolutional_neural_networks_for_time_series.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='29. specialized_time_series_architectures.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
