# Week 15: Attention & Transformers for Finance

## üéØ Learning Objectives

By the end of this week, you will understand:
- **Attention Mechanism**: Focus on relevant parts of input
- **Transformer Architecture**: Self-attention without recurrence
- **Temporal Fusion Transformer**: State-of-the-art time series
- **Finance Applications**: Multi-horizon forecasting

---

## Why Transformers?

- **Parallel processing**: No sequential bottleneck
- **Long-range dependencies**: Direct connections
- **Interpretable**: Attention weights show what matters

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("‚úÖ Libraries loaded!")
print("üìö Week 15: Attention & Transformers")

---

## Part 1: Attention Mechanism

### Scaled Dot-Product Attention

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

### Components

- **Query (Q)**: What am I looking for?
- **Key (K)**: What do I contain?
- **Value (V)**: What information do I provide?

### ü§î Simple Explanation

Attention is like a search engine. The Query asks a question, Keys are indexed by their content, and Values are retrieved based on Query-Key match strength.

In [None]:
def scaled_dot_product_attention(Q, K, V):
    """Compute scaled dot-product attention"""
    d_k = Q.shape[-1]
    
    # Compute attention scores
    scores = np.matmul(Q, K.T) / np.sqrt(d_k)
    
    # Softmax
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
    
    # Weighted sum of values
    output = np.matmul(attention_weights, V)
    
    return output, attention_weights

# Example: 5 time steps, 4 features
seq_len = 5
d_model = 4

# In self-attention, Q, K, V come from the same input
X = np.random.randn(seq_len, d_model)

output, attention = scaled_dot_product_attention(X, X, X)

print("Self-Attention Example")
print("="*50)
print(f"Input shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"\nAttention weights (each row sums to 1):")
print(attention.round(3))

# Visualize attention
plt.figure(figsize=(8, 6))
plt.imshow(attention, cmap='Blues')
plt.colorbar(label='Attention Weight')
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.title('Self-Attention Weights')
plt.show()

---

## Part 2: Multi-Head Attention

### Multiple Attention Heads

Run attention multiple times with different projections:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(head_1, ..., head_h)W^O$$

Where: $head_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

### ü§î Simple Explanation

Each head looks at the data from a different "perspective." One head might focus on short-term patterns, another on long-term trends.

In [None]:
class MultiHeadAttention:
    def __init__(self, d_model, n_heads):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        # Projection matrices
        self.W_q = np.random.randn(d_model, d_model) * 0.1
        self.W_k = np.random.randn(d_model, d_model) * 0.1
        self.W_v = np.random.randn(d_model, d_model) * 0.1
        self.W_o = np.random.randn(d_model, d_model) * 0.1
    
    def forward(self, X):
        # Project
        Q = X @ self.W_q
        K = X @ self.W_k
        V = X @ self.W_v
        
        # Split into heads
        seq_len = X.shape[0]
        Q = Q.reshape(seq_len, self.n_heads, self.d_k)
        K = K.reshape(seq_len, self.n_heads, self.d_k)
        V = V.reshape(seq_len, self.n_heads, self.d_k)
        
        # Attention for each head
        heads = []
        for h in range(self.n_heads):
            out, _ = scaled_dot_product_attention(Q[:, h, :], K[:, h, :], V[:, h, :])
            heads.append(out)
        
        # Concatenate and project
        concat = np.concatenate(heads, axis=-1)
        output = concat @ self.W_o
        
        return output

# Test
mha = MultiHeadAttention(d_model=8, n_heads=2)
X = np.random.randn(5, 8)
output = mha.forward(X)
print(f"Multi-Head Attention output shape: {output.shape}")

---

## Part 3: Transformer Architecture

### Encoder Block

1. Multi-Head Self-Attention
2. Add & Normalize (residual connection)
3. Feed-Forward Network
4. Add & Normalize

### Positional Encoding

Since Transformers have no recurrence, we add position information:

$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$
$$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

In [None]:
def positional_encoding(seq_len, d_model):
    """Generate positional encodings"""
    PE = np.zeros((seq_len, d_model))
    position = np.arange(seq_len).reshape(-1, 1)
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    
    PE[:, 0::2] = np.sin(position * div_term)
    PE[:, 1::2] = np.cos(position * div_term)
    
    return PE

# Visualize
PE = positional_encoding(50, 64)

plt.figure(figsize=(12, 4))
plt.imshow(PE.T, aspect='auto', cmap='RdBu')
plt.colorbar()
plt.xlabel('Position')
plt.ylabel('Encoding Dimension')
plt.title('Positional Encoding')
plt.show()

---

## Part 4: Transformer for Time Series

### Adaptations for Finance

1. **Causal masking**: Prevent looking at future
2. **Temporal features**: Time-aware embeddings
3. **Multi-horizon output**: Predict multiple steps ahead

In [None]:
try:
    import torch
    import torch.nn as nn
    
    class TimeSeriesTransformer(nn.Module):
        def __init__(self, input_dim, d_model=64, nhead=4, num_layers=2, forecast_horizon=5):
            super().__init__()
            self.embedding = nn.Linear(input_dim, d_model)
            encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead, batch_first=True)
            self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
            self.fc_out = nn.Linear(d_model, forecast_horizon)
        
        def forward(self, x):
            # x: (batch, seq_len, features)
            x = self.embedding(x)
            x = self.transformer(x)
            return self.fc_out(x[:, -1, :])  # Use last position for forecast
    
    # Test
    model = TimeSeriesTransformer(input_dim=5, d_model=32, forecast_horizon=3)
    x = torch.randn(16, 20, 5)  # batch=16, seq_len=20, features=5
    out = model(x)
    print(f"Transformer output shape: {out.shape}")
    
except ImportError:
    print("‚ö†Ô∏è PyTorch not installed")

---

## Interview Questions

### Conceptual
1. What advantage does attention have over RNNs?
2. Why do we need positional encoding?
3. What do attention weights tell us about the model?

### Technical
1. Explain the computational complexity of self-attention.
2. How does multi-head attention help?
3. What is causal masking and why is it important?

### Finance-Specific
1. How would you interpret attention weights in a trading context?
2. What are the challenges of using Transformers for financial data?
3. How would you handle multiple assets with Transformers?

---

## Key Takeaways

| Concept | Key Point |
|---------|----------|
| Attention | Focus on relevant inputs |
| Transformers | Parallel, long-range dependencies |
| Finance Use | Multi-horizon, interpretable forecasts |