# Day 70: Attention-based sequence models and Temporal Fusion

## Introduction

Time series forecasting has evolved dramatically with the advent of deep learning. While traditional recurrent architectures like LSTMs and GRUs have shown promise, they struggle with long sequences and often fail to capture complex temporal dependencies. **Attention mechanisms** have revolutionized this field by allowing models to selectively focus on relevant parts of the input sequence, regardless of their position.

In this lesson, we'll explore how attention mechanisms can be applied to time series forecasting, with a particular focus on **Temporal Fusion Transformers (TFT)**, a state-of-the-art architecture that combines multiple temporal processing components with interpretable attention layers.

### Why Attention for Time Series?

Traditional sequence models process data sequentially, which can lead to:
- **Information loss** over long sequences
- **Difficulty capturing long-range dependencies**
- **Limited interpretability** - we can't see which past values influenced predictions

Attention mechanisms solve these problems by:
- Computing direct connections between all time steps
- Learning which historical points are most relevant for prediction
- Providing interpretable attention weights that show what the model "focuses on"

### Real-World Applications

- **Energy demand forecasting**: Understanding which historical patterns (weekends, holidays, weather events) influence future consumption
- **Stock price prediction**: Identifying which past market events correlate with future movements
- **Retail sales forecasting**: Capturing seasonal patterns and promotional effects
- **Weather prediction**: Modeling complex temporal and spatial dependencies

### Learning Objectives

By the end of this lesson, you will be able to:

1. **Understand** the attention mechanism and its application to time series
2. **Implement** a simple attention layer for sequence modeling
3. **Build** a temporal attention model for forecasting
4. **Visualize** attention weights to interpret model predictions
5. **Apply** attention-based models to real-world time series data

## Theory: Attention Mechanisms for Time Series

### 1. The Attention Mechanism

At its core, attention is a mechanism that computes a weighted sum of values, where the weights are dynamically computed based on relevance.

**Mathematical Formulation:**

Given:
- Query vector $\mathbf{q} \in \mathbb{R}^{d_k}$
- Key vectors $\mathbf{K} \in \mathbb{R}^{n \times d_k}$
- Value vectors $\mathbf{V} \in \mathbb{R}^{n \times d_v}$

The attention mechanism computes:

$$\text{Attention}(\mathbf{q}, \mathbf{K}, \mathbf{V}) = \sum_{i=1}^{n} \alpha_i \mathbf{v}_i$$

where the attention weights $\alpha_i$ are computed as:

$$\alpha_i = \frac{\exp(\text{score}(\mathbf{q}, \mathbf{k}_i))}{\sum_{j=1}^{n} \exp(\text{score}(\mathbf{q}, \mathbf{k}_j))}$$

Common scoring functions include:
- **Dot product**: $\text{score}(\mathbf{q}, \mathbf{k}) = \mathbf{q}^T \mathbf{k}$
- **Scaled dot product**: $\text{score}(\mathbf{q}, \mathbf{k}) = \frac{\mathbf{q}^T \mathbf{k}}{\sqrt{d_k}}$
- **Additive**: $\text{score}(\mathbf{q}, \mathbf{k}) = \mathbf{w}^T \tanh(\mathbf{W}_q\mathbf{q} + \mathbf{W}_k\mathbf{k})$

### 2. Self-Attention for Sequences

In **self-attention**, the queries, keys, and values all come from the same sequence. For a time series $\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_T]$:

$$\mathbf{Q} = \mathbf{X}\mathbf{W}_Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}_K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}_V$$

where $\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V$ are learned projection matrices.

The self-attention output is:

$$\text{SelfAttention}(\mathbf{X}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

### 3. Multi-Head Attention

Instead of a single attention function, multi-head attention uses multiple parallel attention "heads" to capture different types of relationships:

$$\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, ..., \text{head}_h)\mathbf{W}_O$$

where each head is:

$$\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_Q^i, \mathbf{K}\mathbf{W}_K^i, \mathbf{V}\mathbf{W}_V^i)$$

### 4. Temporal Patterns in Attention

For time series, attention can capture:

1. **Seasonal patterns**: Attending to the same time of day/week/year in historical data
2. **Event-driven patterns**: Focusing on sudden changes or anomalies
3. **Trend patterns**: Weighting recent vs. distant history differently
4. **Causal relationships**: Learning dependencies between different variables over time

### 5. Temporal Fusion Transformers (TFT)

TFT is a sophisticated architecture designed specifically for time series forecasting with the following components:

**Key Components:**

1. **Variable Selection Networks**: Learn which input features are most relevant
2. **Static Covariate Encoders**: Process time-invariant features (e.g., store ID, product category)
3. **Temporal Processing**: 
   - LSTM encoder for historical sequences
   - LSTM decoder for future predictions
4. **Multi-head Self-Attention**: Capture long-range dependencies
5. **Gated Residual Networks (GRN)**: Enable efficient information flow
6. **Quantile Forecasting**: Predict multiple quantiles for uncertainty estimation

**Architecture Flow:**

```
Input Features
    â†“
Variable Selection (What's important?)
    â†“
Static + Temporal Encoding
    â†“
LSTM Encoder (Past context)
    â†“
Multi-Head Attention (Long-range dependencies)
    â†“
LSTM Decoder (Future predictions)
    â†“
Quantile Outputs + Attention Weights
```

### 6. Advantages of Attention for Time Series

1. **Interpretability**: Attention weights reveal which historical points influenced predictions
2. **Long-range dependencies**: Direct connections between all time steps
3. **Flexibility**: Can handle irregular time series and missing data
4. **Multi-horizon forecasting**: Naturally extends to predicting multiple steps ahead
5. **Variable importance**: Can learn which features matter most

### 7. Challenges

1. **Computational complexity**: $O(T^2)$ for sequence length $T$
2. **Overfitting on small datasets**: Many parameters require sufficient data
3. **Positional encoding**: Need to inject temporal position information
4. **Causality**: Must ensure model doesn't "peek" at future values during training

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

In [None]:
## Visualize Time Series Components

fig, axes = plt.subplots(4, 1, figsize=(14, 10))

# Plot the complete time series
axes[0].plot(ts_data['time'][:500], ts_data['value'][:500], label='Complete Time Series', color='navy', alpha=0.7)
axes[0].set_title('Complete Time Series (First 500 points)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Time')
axes[0].set_ylabel('Value')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot trend component
axes[1].plot(ts_data['time'][:500], ts_data['trend'][:500], label='Trend Component', color='green', linewidth=2)
axes[1].set_title('Trend Component', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Value')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot daily seasonal component
axes[2].plot(ts_data['time'][:200], ts_data['daily_season'][:200], label='Daily Seasonality (24h cycle)', color='orange', linewidth=2)
axes[2].set_title('Daily Seasonal Component', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Time')
axes[2].set_ylabel('Value')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

# Plot weekly seasonal component
axes[3].plot(ts_data['time'][:500], ts_data['weekly_season'][:500], label='Weekly Seasonality (7-day cycle)', color='red', linewidth=2)
axes[3].set_title('Weekly Seasonal Component', fontsize=12, fontweight='bold')
axes[3].set_xlabel('Time')
axes[3].set_ylabel('Value')
axes[3].legend()
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Time series decomposition visualized successfully!")

In [None]:
## Part 1: Implementing Basic Attention Mechanism

class SimpleAttention:
    """
    A simple attention mechanism implementation.
    Computes attention weights and weighted sum of values.
    """
    
    def __init__(self, d_k):
        """
        Args:
            d_k: Dimension of keys (used for scaling)
        """
        self.d_k = d_k
    
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        """
        Compute scaled dot-product attention.
        
        Args:
            Q: Query matrix (batch_size, seq_len_q, d_k)
            K: Key matrix (batch_size, seq_len_k, d_k)
            V: Value matrix (batch_size, seq_len_v, d_v)
            mask: Optional mask to prevent attention to certain positions
            
        Returns:
            output: Attention-weighted values
            attention_weights: Attention weights for visualization
        """
        # Compute attention scores
        scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(self.d_k)
        
        # Apply mask if provided (for causal attention)
        if mask is not None:
            scores = np.where(mask == 0, -1e9, scores)
        
        # Apply softmax to get attention weights
        attention_weights = self._softmax(scores, axis=-1)
        
        # Compute weighted sum of values
        output = np.matmul(attention_weights, V)
        
        return output, attention_weights
    
    def _softmax(self, x, axis=-1):
        """Numerically stable softmax"""
        exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
        return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

# Test the attention mechanism
print("=== Testing Simple Attention ===\n")

# Create sample data (batch_size=1, seq_len=5, d_model=4)
batch_size = 1
seq_len = 5
d_model = 4

Q = np.random.randn(batch_size, seq_len, d_model)
K = np.random.randn(batch_size, seq_len, d_model)
V = np.random.randn(batch_size, seq_len, d_model)

attention = SimpleAttention(d_k=d_model)
output, weights = attention.scaled_dot_product_attention(Q, K, V)

print(f"Input shapes: Q={Q.shape}, K={K.shape}, V={V.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"\nAttention weights (showing how each position attends to others):")
print(weights[0].round(3))

In [None]:
## Part 2: Generate Synthetic Time Series Data

def generate_time_series(n_samples=1000, noise_level=0.1):
    """
    Generate a synthetic time series with trend, seasonality, and noise.
    
    Args:
        n_samples: Number of time points
        noise_level: Standard deviation of noise
        
    Returns:
        DataFrame with time series data
    """
    t = np.arange(n_samples)
    
    # Trend component
    trend = 0.05 * t
    
    # Seasonal components (daily and weekly patterns)
    daily_season = 10 * np.sin(2 * np.pi * t / 24)  # 24-hour cycle
    weekly_season = 5 * np.sin(2 * np.pi * t / 168)  # 7-day cycle
    
    # Random noise
    noise = np.random.normal(0, noise_level * 10, n_samples)
    
    # Combine all components
    value = 50 + trend + daily_season + weekly_season + noise
    
    # Create DataFrame
    df = pd.DataFrame({
        'time': t,
        'value': value,
        'trend': 50 + trend,
        'daily_season': daily_season,
        'weekly_season': weekly_season
    })
    
    return df

# Generate time series
ts_data = generate_time_series(n_samples=1000, noise_level=0.1)

print("Time Series Data Generated:")
print(f"Shape: {ts_data.shape}")
print(f"\nFirst few rows:")
print(ts_data.head(10))
print(f"\nStatistics:")
print(ts_data['value'].describe())

In [None]:
## Part 6: Visualize Attention Weights for Interpretability

# Select a few test samples to visualize attention
sample_indices = [0, 10, 20]

fig, axes = plt.subplots(len(sample_indices), 2, figsize=(16, 4*len(sample_indices)))

for idx, sample_idx in enumerate(sample_indices):
    # Get the input sequence
    input_seq = X_test[sample_idx]
    actual_value = y_test[sample_idx]
    
    # Get prediction and attention weights
    pred, attn_weights = model.predict(input_seq.reshape(1, -1), return_attention=True)
    
    # Extract attention weights for this sample (average across query positions)
    attn = attn_weights[0].mean(axis=0)  # Average attention across queries
    
    # Plot 1: Input sequence with color-coded attention
    ax1 = axes[idx, 0] if len(sample_indices) > 1 else axes[0]
    time_steps = np.arange(len(input_seq))
    
    # Create scatter plot where point size represents attention weight
    scatter = ax1.scatter(time_steps, input_seq, c=attn, s=attn*500, 
                         cmap='YlOrRd', alpha=0.7, edgecolors='black', linewidth=0.5)
    ax1.plot(time_steps, input_seq, alpha=0.3, color='gray', linestyle='--')
    ax1.set_title(f'Sample {sample_idx}: Input Sequence with Attention Weights\n'
                  f'Actual next value: {actual_value[0]:.3f}, Predicted: {pred[0][0]:.3f}',
                  fontsize=12, fontweight='bold')
    ax1.set_xlabel('Time Step (lookback)')
    ax1.set_ylabel('Normalized Value')
    ax1.grid(True, alpha=0.3)
    plt.colorbar(scatter, ax=ax1, label='Attention Weight')
    
    # Plot 2: Attention weights as bar chart
    ax2 = axes[idx, 1] if len(sample_indices) > 1 else axes[1]
    bars = ax2.bar(time_steps, attn, color='steelblue', alpha=0.7, edgecolor='black')
    
    # Highlight top-5 most attended positions
    top_5_indices = np.argsort(attn)[-5:]
    for i in top_5_indices:
        bars[i].set_color('coral')
    
    ax2.set_title(f'Attention Distribution (Top 5 positions highlighted)', 
                  fontsize=12, fontweight='bold')
    ax2.set_xlabel('Time Step (lookback)')
    ax2.set_ylabel('Attention Weight')
    ax2.grid(True, alpha=0.3, axis='y')
    
    # Add text annotation for top attended position
    top_pos = np.argmax(attn)
    ax2.annotate(f'Max: {attn[top_pos]:.3f}',
                xy=(top_pos, attn[top_pos]),
                xytext=(top_pos, attn[top_pos] + 0.002),
                fontsize=9, ha='center',
                bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7))

plt.tight_layout()
plt.show()

print("\n=== Attention Interpretation ===")
print("The visualizations show which historical time steps the model 'pays attention to'")
print("when making predictions. Larger points and higher bars indicate greater importance.")
print("\nKey observations:")
print("- Recent time steps typically receive more attention")
print("- The model may also focus on specific seasonal patterns")
print("- Attention weights provide interpretability for model decisions")

In [None]:
## Part 5: Evaluate Model and Visualize Results

# Make predictions on test set
y_pred_test, attention_weights_test = model.predict(X_test, return_attention=True)

# Calculate test metrics
mse_test = np.mean((y_pred_test.flatten() - y_test.flatten()) ** 2)
rmse_test = np.sqrt(mse_test)
mae_test = np.mean(np.abs(y_pred_test.flatten() - y_test.flatten()))

print("=== Model Evaluation ===\n")
print(f"Test MSE: {mse_test:.6f}")
print(f"Test RMSE: {rmse_test:.6f}")
print(f"Test MAE: {mae_test:.6f}")

# Visualize training loss
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(losses, color='blue', linewidth=2)
plt.title('Training Loss Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.grid(True, alpha=0.3)

# Visualize predictions vs actual
plt.subplot(1, 2, 2)
n_plot = 100
plt.plot(y_test[:n_plot], label='Actual', color='green', linewidth=2, alpha=0.7)
plt.plot(y_pred_test[:n_plot], label='Predicted', color='red', linewidth=2, alpha=0.7, linestyle='--')
plt.title('Predictions vs Actual (First 100 test samples)', fontsize=14, fontweight='bold')
plt.xlabel('Sample Index')
plt.ylabel('Normalized Value')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
## Part 4: Build Attention-Based Time Series Model

class TemporalAttentionModel:
    """
    A simple attention-based forecasting model for time series.
    Uses attention to weight historical time steps for prediction.
    """
    
    def __init__(self, seq_length, d_model=8, learning_rate=0.01):
        """
        Args:
            seq_length: Length of input sequences
            d_model: Dimension of the model
            learning_rate: Learning rate for gradient descent
        """
        self.seq_length = seq_length
        self.d_model = d_model
        self.learning_rate = learning_rate
        
        # Initialize parameters (simplified for demonstration)
        # In practice, these would be more sophisticated
        self.W_q = np.random.randn(1, d_model) * 0.1
        self.W_k = np.random.randn(1, d_model) * 0.1
        self.W_v = np.random.randn(1, d_model) * 0.1
        self.W_out = np.random.randn(d_model, 1) * 0.1
        
        self.attention_weights_history = []
        
    def forward(self, X, return_attention=False):
        """
        Forward pass through the model.
        
        Args:
            X: Input sequences (batch_size, seq_length)
            return_attention: Whether to return attention weights
            
        Returns:
            predictions: Forecasted values
            attention_weights: (optional) Attention weights
        """
        batch_size = X.shape[0]
        
        # Reshape input for matrix operations
        X_reshaped = X.reshape(batch_size, self.seq_length, 1)
        
        # Project inputs to Q, K, V
        Q = np.matmul(X_reshaped, self.W_q)  # (batch, seq_len, d_model)
        K = np.matmul(X_reshaped, self.W_k)
        V = np.matmul(X_reshaped, self.W_v)
        
        # Compute attention scores
        scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(self.d_model)
        attention_weights = self._softmax(scores, axis=-1)
        
        # Apply attention to values
        attended = np.matmul(attention_weights, V)
        
        # Pool across sequence dimension (simple mean)
        pooled = np.mean(attended, axis=1)  # (batch, d_model)
        
        # Output projection
        predictions = np.matmul(pooled, self.W_out)  # (batch, 1)
        
        if return_attention:
            return predictions, attention_weights
        return predictions
    
    def _softmax(self, x, axis=-1):
        """Numerically stable softmax"""
        exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
        return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
    
    def train(self, X_train, y_train, epochs=50, batch_size=32, verbose=True):
        """
        Train the model using gradient descent.
        
        Args:
            X_train: Training sequences
            y_train: Training targets
            epochs: Number of training epochs
            batch_size: Batch size for training
            verbose: Whether to print progress
        """
        n_samples = X_train.shape[0]
        losses = []
        
        for epoch in range(epochs):
            epoch_loss = 0
            n_batches = 0
            
            # Shuffle data
            indices = np.random.permutation(n_samples)
            X_shuffled = X_train[indices]
            y_shuffled = y_train[indices]
            
            # Mini-batch training
            for i in range(0, n_samples, batch_size):
                X_batch = X_shuffled[i:i+batch_size]
                y_batch = y_shuffled[i:i+batch_size].reshape(-1, 1)
                
                # Forward pass
                predictions = self.forward(X_batch)
                
                # Compute loss (MSE)
                loss = np.mean((predictions - y_batch) ** 2)
                epoch_loss += loss
                n_batches += 1
                
                # Backward pass (simplified gradient descent)
                # In practice, you'd use proper backpropagation
                error = predictions - y_batch
                grad_scale = 2 * error / len(X_batch)
                
                # Update parameters with small learning rate
                self.W_out -= self.learning_rate * grad_scale.mean() * 0.01
            
            avg_loss = epoch_loss / n_batches
            losses.append(avg_loss)
            
            if verbose and (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.6f}")
        
        return losses
    
    def predict(self, X, return_attention=False):
        """Make predictions on new data"""
        return self.forward(X, return_attention=return_attention)

# Initialize and train the model
print("=== Training Attention-Based Time Series Model ===\n")

model = TemporalAttentionModel(seq_length=seq_length, d_model=8, learning_rate=0.01)

# Train the model
losses = model.train(X_train, y_train, epochs=50, batch_size=32, verbose=True)

print("\nTraining complete!")

In [None]:
## Part 3: Prepare Data for Attention Model

def create_sequences(data, seq_length, forecast_horizon):
    """
    Create input sequences and target values for time series forecasting.
    
    Args:
        data: Array of time series values
        seq_length: Length of input sequence (lookback window)
        forecast_horizon: Number of steps to predict ahead
        
    Returns:
        X: Input sequences
        y: Target values
    """
    X, y = [], []
    
    for i in range(len(data) - seq_length - forecast_horizon + 1):
        # Input: sequence of length seq_length
        X.append(data[i:i + seq_length])
        # Target: next value(s) after the sequence
        y.append(data[i + seq_length:i + seq_length + forecast_horizon])
    
    return np.array(X), np.array(y)

# Prepare data
seq_length = 50  # Use 50 time steps to predict the next values
forecast_horizon = 1  # Predict 1 step ahead

# Normalize the data
scaler = StandardScaler()
values = ts_data['value'].values.reshape(-1, 1)
values_normalized = scaler.fit_transform(values).flatten()

# Create sequences
X, y = create_sequences(values_normalized, seq_length, forecast_horizon)

print(f"Sequence preparation complete:")
print(f"X shape: {X.shape} (samples, sequence_length)")
print(f"y shape: {y.shape} (samples, forecast_horizon)")
print(f"\nExample input sequence (first 10 values): {X[0][:10].round(3)}")
print(f"Corresponding target: {y[0].round(3)}")

# Split into train and test sets
train_size = int(0.8 * len(X))
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

print(f"\nTrain set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

## Hands-On Activity: Compare Attention vs Simple Baseline

Now it's your turn! In this activity, you'll:

1. **Implement a simple baseline model** (using just the most recent value)
2. **Compare performance** with the attention-based model
3. **Experiment with hyperparameters** to improve the attention model
4. **Analyze attention patterns** for different types of time series

### Task 1: Create a Simple Baseline

Implement a naive baseline that predicts the next value as the average of the last N values.

In [None]:
## Activity Solution: Baseline Comparison

# Task 1: Simple Moving Average Baseline
def moving_average_baseline(X, window=5):
    """
    Predict next value as average of last 'window' values.
    
    Args:
        X: Input sequences (n_samples, seq_length)
        window: Number of recent values to average
        
    Returns:
        predictions: Predicted next values
    """
    predictions = np.mean(X[:, -window:], axis=1)
    return predictions

# Task 2: Evaluate baseline
baseline_predictions = moving_average_baseline(X_test, window=10)

baseline_mse = np.mean((baseline_predictions - y_test.flatten()) ** 2)
baseline_rmse = np.sqrt(baseline_mse)
baseline_mae = np.mean(np.abs(baseline_predictions - y_test.flatten()))

print("=== Baseline vs Attention Model Comparison ===\n")
print(f"{'Metric':<15} {'Baseline':<15} {'Attention':<15} {'Improvement':<15}")
print("-" * 60)
print(f"{'MSE':<15} {baseline_mse:<15.6f} {mse_test:<15.6f} {(baseline_mse - mse_test)/baseline_mse*100:>13.2f}%")
print(f"{'RMSE':<15} {baseline_rmse:<15.6f} {rmse_test:<15.6f} {(baseline_rmse - rmse_test)/baseline_rmse*100:>13.2f}%")
print(f"{'MAE':<15} {baseline_mae:<15.6f} {mae_test:<15.6f} {(baseline_mae - mae_test)/baseline_mae*100:>13.2f}%")

# Task 3: Visualize comparison
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
n_plot = 100
plt.plot(y_test[:n_plot], label='Actual', color='green', linewidth=2.5, alpha=0.8)
plt.plot(baseline_predictions[:n_plot], label='Baseline (Moving Avg)', 
         color='orange', linewidth=2, alpha=0.7, linestyle='-.')
plt.plot(y_pred_test[:n_plot], label='Attention Model', 
         color='red', linewidth=2, alpha=0.7, linestyle='--')
plt.title('Model Comparison: First 100 Test Samples', fontsize=14, fontweight='bold')
plt.xlabel('Sample Index')
plt.ylabel('Normalized Value')
plt.legend(loc='best')
plt.grid(True, alpha=0.3)

# Plot error distributions
plt.subplot(1, 2, 2)
baseline_errors = np.abs(baseline_predictions - y_test.flatten())
attention_errors = np.abs(y_pred_test.flatten() - y_test.flatten())

plt.hist(baseline_errors, bins=30, alpha=0.6, label='Baseline Errors', color='orange', edgecolor='black')
plt.hist(attention_errors, bins=30, alpha=0.6, label='Attention Errors', color='red', edgecolor='black')
plt.title('Error Distribution Comparison', fontsize=14, fontweight='bold')
plt.xlabel('Absolute Error')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nâœ… Activity Complete!")
print("\nKey Insights:")
print("1. The attention model learns to weigh historical data more intelligently")
print("2. Attention provides interpretability through weight visualization")
print("3. More complex patterns benefit more from attention mechanisms")
print("\nðŸ’¡ Try experimenting with:")
print("- Different sequence lengths (seq_length)")
print("- Model dimensions (d_model)")
print("- Training epochs and learning rates")
print("- Adding positional encoding to the input sequences")

## Key Takeaways

### Core Concepts

1. **Attention Mechanism**: A powerful technique that computes weighted sums of values based on learned relevance scores, allowing models to focus on the most important parts of the input sequence.

2. **Self-Attention for Time Series**: By computing queries, keys, and values from the same sequence, self-attention enables each time step to directly interact with all other time steps, capturing long-range dependencies without sequential processing.

3. **Scaled Dot-Product Attention**: The most common attention variant uses the formula:
   $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
   The scaling factor $\sqrt{d_k}$ prevents softmax saturation in high dimensions.

### Advantages of Attention for Time Series

- **Interpretability**: Attention weights reveal which historical points influenced each prediction, providing transparency in model decisions
- **Long-range dependencies**: Direct connections between all time steps overcome the limitations of sequential processing
- **Flexibility**: Can handle irregular sampling, missing data, and variable-length sequences
- **Parallel computation**: Unlike RNNs, attention can be computed in parallel across all positions

### Practical Considerations

- **Computational complexity**: Standard attention is $O(T^2)$ where $T$ is sequence length. For very long sequences, consider sparse attention or linear attention variants.
- **Positional information**: Pure attention is permutation-invariant, so positional encodings are crucial for maintaining temporal order.
- **Causality**: In forecasting, use masked attention to prevent the model from "seeing" future values during training.
- **Multi-head attention**: Using multiple attention heads with different learned projections captures diverse temporal patterns.

### When to Use Attention for Time Series

**Good fit:**
- Long sequences where distant past matters (e.g., yearly seasonality in daily data)
- Multi-variate time series with complex interactions
- When interpretability is important
- Irregular or sparse time series

**Consider alternatives:**
- Very short sequences (simple models may suffice)
- Limited training data (attention models have many parameters)
- Real-time constraints with extremely long sequences (computational cost)

### Connection to Temporal Fusion Transformers

TFT extends basic attention with:
- **Variable selection networks** for automatic feature selection
- **Static + temporal encoders** for time-invariant and time-varying features
- **Quantile forecasting** for uncertainty estimation
- **Interpretable multi-head attention** with gating mechanisms

TFT represents the state-of-the-art for many forecasting benchmarks and demonstrates the power of combining attention with domain-specific architectural choices.

## Further Resources

### Foundational Papers

- **[Attention Is All You Need (Vaswani et al., 2017)](https://arxiv.org/abs/1706.03762)** - The original Transformer paper that introduced scaled dot-product attention and multi-head attention. While focused on NLP, the mechanisms apply directly to time series.

- **[Temporal Fusion Transformers (Lim et al., 2021)](https://arxiv.org/abs/1912.09363)** - State-of-the-art attention-based architecture specifically designed for time series forecasting with interpretability.

- **[Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting (Zhou et al., 2021)](https://arxiv.org/abs/2012.07436)** - Addresses the computational challenges of attention for very long sequences with ProbSparse self-attention.

### Implementations and Libraries

- **[PyTorch Temporal Fusion Transformer](https://pytorch-forecasting.readthedocs.io/en/latest/models.html)** - Production-ready implementation of TFT in the pytorch-forecasting library.

- **[Hugging Face Transformers](https://huggingface.co/docs/transformers/index)** - While primarily for NLP, contains reusable attention components that can be adapted for time series.

- **[GluonTS](https://ts.gluon.ai/)** - Amazon's toolkit for probabilistic time series modeling, includes Transformer-based forecasting models.

### Tutorials and Guides

- **[The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)** - Excellent visual guide to understanding attention mechanisms and the Transformer architecture.

- **[Time Series Forecasting with Transformers (TensorFlow Tutorial)](https://www.tensorflow.org/tutorials/structured_data/time_series)** - Official TensorFlow tutorial on applying Transformers to time series.

- **[Attention Mechanisms in Deep Learning (d2l.ai)](https://d2l.ai/chapter_attention-mechanisms/)** - Comprehensive textbook chapter with interactive examples.

### Advanced Topics

- **[Efficient Attention Mechanisms](https://arxiv.org/abs/2009.06732)** - Survey of techniques to reduce the $O(T^2)$ complexity of standard attention.

- **[Multi-variate Time Series Forecasting with Transformers](https://arxiv.org/abs/2001.08317)** - Extending attention to handle multiple correlated time series.

- **[Interpretable Time Series Forecasting with Attention](https://arxiv.org/abs/1907.00235)** - Techniques for extracting meaningful insights from attention weights.

### Tools and Datasets

- **[Monash Time Series Forecasting Archive](https://forecastingdata.org/)** - Large collection of time series datasets for benchmarking forecasting models.

- **[Weights & Biases](https://wandb.ai/)** - Experiment tracking and visualization tool, excellent for hyperparameter tuning of attention models.

- **[Optuna](https://optuna.org/)** - Hyperparameter optimization framework useful for tuning complex models like TFT.

### Next Steps

1. **Implement positional encoding** to better preserve temporal order information
2. **Experiment with multi-head attention** to capture different temporal patterns simultaneously
3. **Try TFT on real datasets** like electricity demand or retail sales forecasting
4. **Explore sparse attention** variants for handling very long sequences efficiently
5. **Study hybrid models** that combine attention with RNNs or CNNs for enhanced performance