# Week 15 - Day 4: Temporal Transformers for Financial Time Series

## Learning Objectives
- Understand time series specific transformer modifications
- Implement Temporal Fusion Transformer (TFT) concepts
- Explore interpretability mechanisms in transformers
- Build a practical trading prediction system

---

## Why Temporal Transformers?

Standard transformers were designed for NLP tasks where positional encoding is sufficient. However, financial time series have:
- **Irregular temporal patterns** (weekends, holidays, varying volatility regimes)
- **Multi-horizon forecasting needs** (predict 1-day, 5-day, 20-day ahead)
- **Multiple input types** (static features, known future inputs, observed values)
- **Need for interpretability** (understanding why models make predictions)

The **Temporal Fusion Transformer (TFT)** addresses these challenges with specialized components.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim

# Sklearn for preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Plot settings
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

---

## Part 1: Time Series Specific Transformer Modifications

### 1.1 Temporal Positional Encoding

Unlike standard positional encoding, temporal encoding captures:
- **Time-of-day patterns** (market open/close effects)
- **Day-of-week patterns** (Monday effect, Friday positioning)
- **Seasonal patterns** (earnings seasons, year-end rebalancing)
- **Distance-aware encoding** (how far apart observations are in time)

In [None]:
class TemporalPositionalEncoding(nn.Module):
    """
    Enhanced positional encoding for time series.
    Combines standard sinusoidal encoding with temporal features.
    """
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        self.d_model = d_model
        
        # Standard sinusoidal positional encoding
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # (1, max_len, d_model)
        
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        """
        Args:
            x: Tensor of shape (batch_size, seq_len, d_model)
        """
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)


class LearnedTemporalEncoding(nn.Module):
    """
    Learned positional encoding that can capture complex temporal patterns.
    Better for financial time series with irregular patterns.
    """
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Learnable position embeddings
        self.position_embedding = nn.Embedding(max_len, d_model)
        
        # Temporal feature embeddings (day of week, month, etc.)
        self.day_of_week_embedding = nn.Embedding(7, d_model // 4)
        self.month_embedding = nn.Embedding(12, d_model // 4)
        
        # Projection to combine temporal features
        self.temporal_projection = nn.Linear(d_model // 2, d_model)
        
    def forward(self, x, day_of_week=None, month=None):
        """
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            day_of_week: Day indices (batch_size, seq_len)
            month: Month indices (batch_size, seq_len)
        """
        batch_size, seq_len = x.size(0), x.size(1)
        
        # Position encoding
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
        pos_encoding = self.position_embedding(positions)
        
        # Add temporal features if provided
        if day_of_week is not None and month is not None:
            dow_emb = self.day_of_week_embedding(day_of_week)
            month_emb = self.month_embedding(month)
            temporal_features = torch.cat([dow_emb, month_emb], dim=-1)
            temporal_encoding = self.temporal_projection(temporal_features)
            pos_encoding = pos_encoding + temporal_encoding
        
        return self.dropout(x + pos_encoding)


# Demonstrate positional encodings
d_model = 64
seq_len = 100

# Create sample input
x = torch.randn(2, seq_len, d_model)

# Test standard encoding
standard_pe = TemporalPositionalEncoding(d_model)
x_encoded = standard_pe(x)
print(f"Input shape: {x.shape}")
print(f"Encoded shape: {x_encoded.shape}")

# Visualize positional encoding patterns
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot sinusoidal patterns
pe_values = standard_pe.pe[0, :50, :16].numpy()
im = axes[0].imshow(pe_values.T, aspect='auto', cmap='RdBu')
axes[0].set_xlabel('Position')
axes[0].set_ylabel('Encoding Dimension')
axes[0].set_title('Sinusoidal Positional Encoding')
plt.colorbar(im, ax=axes[0])

# Plot encoding values at different positions
positions = [0, 10, 25, 49]
for pos in positions:
    axes[1].plot(pe_values[pos, :], label=f'Position {pos}', alpha=0.8)
axes[1].set_xlabel('Encoding Dimension')
axes[1].set_ylabel('Value')
axes[1].set_title('Encoding Values at Different Positions')
axes[1].legend()

plt.tight_layout()
plt.show()

### 1.2 Causal (Autoregressive) Attention Mask

For time series prediction, we must ensure the model cannot "look into the future".
This is achieved through **causal masking** in the attention mechanism.

In [None]:
class CausalSelfAttention(nn.Module):
    """
    Causal (masked) self-attention for autoregressive time series modeling.
    Each position can only attend to previous positions.
    """
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        
        # Q, K, V projections
        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)
        
        # Output projection
        self.out_proj = nn.Linear(d_model, d_model)
        
        self.dropout = nn.Dropout(dropout)
        self.scale = self.head_dim ** -0.5
        
    def generate_causal_mask(self, seq_len, device):
        """Generate lower triangular mask to prevent attending to future positions."""
        mask = torch.triu(torch.ones(seq_len, seq_len, device=device), diagonal=1)
        mask = mask.masked_fill(mask == 1, float('-inf'))
        return mask
        
    def forward(self, x, return_attention=False):
        """
        Args:
            x: Input tensor (batch_size, seq_len, d_model)
            return_attention: Whether to return attention weights
        """
        batch_size, seq_len, _ = x.size()
        
        # Linear projections
        Q = self.query(x)
        K = self.key(x)
        V = self.value(x)
        
        # Reshape for multi-head attention
        Q = Q.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        
        # Scaled dot-product attention with causal mask
        attention_scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
        
        # Apply causal mask
        causal_mask = self.generate_causal_mask(seq_len, x.device)
        attention_scores = attention_scores + causal_mask
        
        attention_weights = F.softmax(attention_scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        # Apply attention to values
        attention_output = torch.matmul(attention_weights, V)
        
        # Reshape and project
        attention_output = attention_output.transpose(1, 2).contiguous()
        attention_output = attention_output.view(batch_size, seq_len, self.d_model)
        output = self.out_proj(attention_output)
        
        if return_attention:
            return output, attention_weights
        return output


# Visualize causal mask
seq_len = 10
causal_attn = CausalSelfAttention(d_model=64, n_heads=4)
mask = causal_attn.generate_causal_mask(seq_len, device='cpu')

# Convert for visualization (replace -inf with 0 for display)
mask_viz = mask.clone()
mask_viz[mask_viz == float('-inf')] = -10

plt.figure(figsize=(8, 6))
sns.heatmap(mask_viz.numpy(), annot=True, fmt='.0f', cmap='RdYlGn_r',
            xticklabels=[f't-{seq_len-1-i}' for i in range(seq_len)],
            yticklabels=[f't-{seq_len-1-i}' for i in range(seq_len)])
plt.title('Causal Attention Mask\n(0 = can attend, -inf = blocked)')
plt.xlabel('Key Positions')
plt.ylabel('Query Positions')
plt.show()

print("\nCausal mask ensures each position only attends to current and past positions.")
print("This is critical for time series to prevent information leakage.")

---

## Part 2: Temporal Fusion Transformer (TFT) Concepts

The TFT architecture (Lim et al., 2019) introduces several key innovations:

1. **Variable Selection Networks** - Learn which inputs are important
2. **Gated Residual Networks (GRN)** - Flexible nonlinear processing
3. **Interpretable Multi-Head Attention** - Understand temporal patterns
4. **Quantile Outputs** - Probabilistic predictions

### 2.1 Gated Residual Network (GRN)

The GRN is a fundamental building block that provides:
- Skip connections for gradient flow
- Gating mechanism to control information flow
- Flexibility to learn complex patterns

In [None]:
class GatedLinearUnit(nn.Module):
    """Gated Linear Unit (GLU) for controlled information flow."""
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.linear = nn.Linear(input_dim, output_dim * 2)
        
    def forward(self, x):
        x = self.linear(x)
        x, gate = x.chunk(2, dim=-1)
        return x * torch.sigmoid(gate)


class GatedResidualNetwork(nn.Module):
    """
    Gated Residual Network from TFT paper.
    Provides flexible nonlinear processing with skip connections.
    """
    def __init__(self, input_dim, hidden_dim, output_dim, dropout=0.1, context_dim=None):
        super().__init__()
        
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim
        
        # Primary transformation
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        
        # Context integration (optional)
        self.context_projection = None
        if context_dim is not None:
            self.context_projection = nn.Linear(context_dim, hidden_dim, bias=False)
        
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        
        # Gating mechanism
        self.gate = GatedLinearUnit(output_dim, output_dim)
        
        # Layer normalization
        self.layer_norm = nn.LayerNorm(output_dim)
        
        # Skip connection projection (if dimensions differ)
        self.skip_projection = None
        if input_dim != output_dim:
            self.skip_projection = nn.Linear(input_dim, output_dim)
            
        self.dropout = nn.Dropout(dropout)
        self.elu = nn.ELU()
        
    def forward(self, x, context=None):
        """
        Args:
            x: Input tensor
            context: Optional context tensor for conditioning
        """
        # Skip connection
        residual = x if self.skip_projection is None else self.skip_projection(x)
        
        # Primary transformation
        hidden = self.fc1(x)
        
        # Add context if provided
        if context is not None and self.context_projection is not None:
            hidden = hidden + self.context_projection(context)
            
        hidden = self.elu(hidden)
        hidden = self.fc2(hidden)
        hidden = self.dropout(hidden)
        
        # Gated skip connection
        gated = self.gate(hidden)
        output = self.layer_norm(residual + gated)
        
        return output


# Test GRN
grn = GatedResidualNetwork(input_dim=32, hidden_dim=64, output_dim=32)
x = torch.randn(4, 10, 32)  # (batch, seq_len, features)
output = grn(x)
print(f"GRN Input: {x.shape} -> Output: {output.shape}")

### 2.2 Variable Selection Network (VSN)

The VSN learns to weight different input features based on their importance.
This provides **interpretability** by showing which features the model relies on.

In [None]:
class VariableSelectionNetwork(nn.Module):
    """
    Variable Selection Network from TFT.
    Learns to weight different input variables based on their relevance.
    """
    def __init__(self, input_dim, num_vars, hidden_dim, dropout=0.1, context_dim=None):
        super().__init__()
        
        self.num_vars = num_vars
        self.hidden_dim = hidden_dim
        
        # Individual variable processing
        self.var_grns = nn.ModuleList([
            GatedResidualNetwork(
                input_dim=input_dim // num_vars,
                hidden_dim=hidden_dim,
                output_dim=hidden_dim,
                dropout=dropout
            )
            for _ in range(num_vars)
        ])
        
        # Softmax variable weights GRN
        self.flattened_grn = GatedResidualNetwork(
            input_dim=input_dim,
            hidden_dim=hidden_dim,
            output_dim=num_vars,
            dropout=dropout,
            context_dim=context_dim
        )
        
    def forward(self, x, context=None):
        """
        Args:
            x: Input tensor (batch, seq_len, num_vars * var_dim)
            context: Optional context for conditioning
            
        Returns:
            output: Weighted combination of processed variables
            weights: Variable importance weights (for interpretability)
        """
        batch_size, seq_len, _ = x.size()
        var_dim = x.size(-1) // self.num_vars
        
        # Split input into individual variables
        var_inputs = x.view(batch_size, seq_len, self.num_vars, var_dim)
        
        # Process each variable through its GRN
        var_outputs = []
        for i in range(self.num_vars):
            var_out = self.var_grns[i](var_inputs[:, :, i, :])
            var_outputs.append(var_out)
        
        # Stack processed variables: (batch, seq_len, num_vars, hidden_dim)
        var_outputs = torch.stack(var_outputs, dim=2)
        
        # Compute variable selection weights
        weights = self.flattened_grn(x, context)  # (batch, seq_len, num_vars)
        weights = F.softmax(weights, dim=-1)
        
        # Weighted combination
        weights_expanded = weights.unsqueeze(-1)  # (batch, seq_len, num_vars, 1)
        output = (var_outputs * weights_expanded).sum(dim=2)  # (batch, seq_len, hidden_dim)
        
        return output, weights


# Test Variable Selection Network
num_vars = 5
var_dim = 8
hidden_dim = 32

vsn = VariableSelectionNetwork(
    input_dim=num_vars * var_dim,
    num_vars=num_vars,
    hidden_dim=hidden_dim
)

x = torch.randn(4, 10, num_vars * var_dim)
output, weights = vsn(x)

print(f"VSN Input: {x.shape}")
print(f"VSN Output: {output.shape}")
print(f"Variable Weights: {weights.shape}")
print(f"\nSample weights (first timestep): {weights[0, 0, :].detach().numpy().round(3)}")
print(f"Weights sum to: {weights[0, 0, :].sum().item():.4f}")

### 2.3 Interpretable Multi-Head Attention

TFT uses a modified attention mechanism that enables interpretation of:
- Which past time steps are most important for prediction
- How attention patterns change across different forecasting horizons

In [None]:
class InterpretableMultiHeadAttention(nn.Module):
    """
    Interpretable Multi-Head Attention from TFT.
    Shares values across heads for interpretability while keeping
    separate queries and keys for different attention patterns.
    """
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        
        # Separate Q, K projections per head
        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        
        # Shared value projection (interpretability)
        self.value = nn.Linear(d_model, self.head_dim)  # Single head for values
        
        # Output projection
        self.out_proj = nn.Linear(self.head_dim, d_model)
        
        self.dropout = nn.Dropout(dropout)
        self.scale = self.head_dim ** -0.5
        
    def forward(self, x, mask=None):
        """
        Args:
            x: Input tensor (batch, seq_len, d_model)
            mask: Optional attention mask
            
        Returns:
            output: Attention output
            attention_weights: Average attention across heads (interpretable)
        """
        batch_size, seq_len, _ = x.size()
        
        # Compute Q, K, V
        Q = self.query(x)  # (batch, seq_len, d_model)
        K = self.key(x)
        V = self.value(x)  # (batch, seq_len, head_dim) - shared across heads
        
        # Reshape Q, K for multi-head
        Q = Q.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.n_heads, self.head_dim).transpose(1, 2)
        
        # Compute attention scores
        attention_scores = torch.matmul(Q, K.transpose(-2, -1)) * self.scale
        
        # Apply mask if provided
        if mask is not None:
            attention_scores = attention_scores + mask
            
        attention_weights = F.softmax(attention_scores, dim=-1)
        attention_weights = self.dropout(attention_weights)
        
        # Average attention across heads (for interpretability)
        avg_attention = attention_weights.mean(dim=1)  # (batch, seq_len, seq_len)
        
        # Apply attention to shared values
        # Sum weighted values across heads
        attention_output = torch.matmul(avg_attention, V)  # (batch, seq_len, head_dim)
        
        # Output projection
        output = self.out_proj(attention_output)
        
        return output, avg_attention


# Test interpretable attention
int_attn = InterpretableMultiHeadAttention(d_model=64, n_heads=4)
x = torch.randn(2, 20, 64)
output, attn_weights = int_attn(x)

print(f"Input: {x.shape}")
print(f"Output: {output.shape}")
print(f"Attention Weights: {attn_weights.shape}")

# Visualize attention pattern
plt.figure(figsize=(10, 8))
sns.heatmap(attn_weights[0].detach().numpy(), cmap='viridis')
plt.title('Interpretable Attention Weights\n(Averaged across heads)')
plt.xlabel('Key Positions (Past)')
plt.ylabel('Query Positions (Current)')
plt.show()

---

## Part 3: Building a Temporal Transformer for Trading

Now let's combine these components into a complete Temporal Transformer for stock price prediction.

In [None]:
class TemporalTransformerBlock(nn.Module):
    """Single Temporal Transformer block with GRN and attention."""
    
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        
        # Interpretable multi-head attention
        self.attention = InterpretableMultiHeadAttention(d_model, n_heads, dropout)
        
        # Post-attention GRN
        self.grn1 = GatedResidualNetwork(d_model, d_ff, d_model, dropout)
        
        # Feed-forward GRN
        self.grn2 = GatedResidualNetwork(d_model, d_ff, d_model, dropout)
        
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Self-attention with residual
        attn_out, attn_weights = self.attention(x, mask)
        x = self.layer_norm1(x + self.dropout(attn_out))
        x = self.grn1(x)
        
        # Feed-forward with residual
        x = self.grn2(x)
        
        return x, attn_weights


class TemporalFusionTransformer(nn.Module):
    """
    Simplified Temporal Fusion Transformer for stock prediction.
    Combines variable selection, temporal encoding, and interpretable attention.
    """
    
    def __init__(self, 
                 num_features,
                 d_model=64,
                 n_heads=4,
                 n_layers=2,
                 d_ff=128,
                 dropout=0.1,
                 max_seq_len=256,
                 num_quantiles=3):
        super().__init__()
        
        self.num_features = num_features
        self.d_model = d_model
        self.num_quantiles = num_quantiles
        
        # Input embedding
        self.input_projection = nn.Linear(num_features, d_model)
        
        # Variable Selection (for interpretability)
        self.variable_selection = VariableSelectionNetwork(
            input_dim=d_model,
            num_vars=num_features,
            hidden_dim=d_model,
            dropout=dropout
        )
        
        # Positional encoding
        self.pos_encoding = LearnedTemporalEncoding(d_model, max_seq_len, dropout)
        
        # LSTM for local patterns (TFT uses this before attention)
        self.lstm = nn.LSTM(
            input_size=d_model,
            hidden_size=d_model,
            num_layers=1,
            batch_first=True,
            dropout=dropout if n_layers > 1 else 0
        )
        
        # Temporal transformer blocks
        self.transformer_blocks = nn.ModuleList([
            TemporalTransformerBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ])
        
        # Output layers
        self.output_grn = GatedResidualNetwork(d_model, d_ff, d_model, dropout)
        
        # Quantile outputs (for probabilistic predictions)
        self.quantile_outputs = nn.Linear(d_model, num_quantiles)
        
        # Point prediction output
        self.point_output = nn.Linear(d_model, 1)
        
    def generate_causal_mask(self, seq_len, device):
        """Generate causal mask for attention."""
        mask = torch.triu(torch.ones(seq_len, seq_len, device=device), diagonal=1)
        mask = mask.masked_fill(mask == 1, float('-inf'))
        return mask
        
    def forward(self, x, return_attention=False, return_var_weights=False):
        """
        Args:
            x: Input tensor (batch, seq_len, num_features)
            return_attention: Return attention weights for interpretation
            return_var_weights: Return variable importance weights
            
        Returns:
            predictions: Point predictions (batch, seq_len, 1)
            quantiles: Quantile predictions (batch, seq_len, num_quantiles)
            attention_weights: List of attention weights per layer (optional)
            var_weights: Variable importance weights (optional)
        """
        batch_size, seq_len, _ = x.size()
        device = x.device
        
        # Input projection
        x_proj = self.input_projection(x)
        
        # Variable selection (with interpretability)
        # Reshape for VSN: treat each feature dimension separately
        x_expanded = x.unsqueeze(-1).expand(-1, -1, -1, self.d_model // self.num_features)
        x_flat = x_expanded.reshape(batch_size, seq_len, -1)
        x_selected, var_weights = self.variable_selection(x_flat)
        
        # Combine projected input with selected features
        x = x_proj + x_selected
        
        # Add positional encoding
        x = self.pos_encoding(x)
        
        # LSTM for local temporal patterns
        lstm_out, _ = self.lstm(x)
        x = x + lstm_out  # Residual connection
        
        # Causal mask
        causal_mask = self.generate_causal_mask(seq_len, device)
        
        # Transformer blocks
        attention_weights_list = []
        for block in self.transformer_blocks:
            x, attn_weights = block(x, causal_mask)
            if return_attention:
                attention_weights_list.append(attn_weights)
        
        # Output processing
        output = self.output_grn(x)
        
        # Generate predictions
        point_pred = self.point_output(output)
        quantile_pred = self.quantile_outputs(output)
        
        results = [point_pred, quantile_pred]
        
        if return_attention:
            results.append(attention_weights_list)
        if return_var_weights:
            results.append(var_weights)
            
        return tuple(results) if len(results) > 2 else (point_pred, quantile_pred)


# Test the model
model = TemporalFusionTransformer(
    num_features=5,
    d_model=64,
    n_heads=4,
    n_layers=2
)

x = torch.randn(4, 30, 5)  # (batch, seq_len, features)
point_pred, quantile_pred, attn_weights, var_weights = model(x, return_attention=True, return_var_weights=True)

print(f"Input shape: {x.shape}")
print(f"Point predictions: {point_pred.shape}")
print(f"Quantile predictions: {quantile_pred.shape}")
print(f"Number of attention weight tensors: {len(attn_weights)}")
print(f"Variable weights shape: {var_weights.shape}")

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

---

## Part 4: Trading Prediction Practical

### 4.1 Data Preparation

In [None]:
# Download stock data
symbols = ['AAPL', 'MSFT', 'GOOGL', 'SPY']  # Multiple stocks + market index
start_date = '2018-01-01'
end_date = '2024-01-01'

# Download data
data = {}
for symbol in symbols:
    df = yf.download(symbol, start=start_date, end=end_date, progress=False)
    data[symbol] = df['Close']
    print(f"Downloaded {symbol}: {len(df)} days")

# Combine into DataFrame
prices_df = pd.DataFrame(data)
prices_df = prices_df.dropna()

print(f"\nCombined dataset: {prices_df.shape}")
print(f"Date range: {prices_df.index[0]} to {prices_df.index[-1]}")
prices_df.head()

In [None]:
def create_features(prices_df, target_col='AAPL'):
    """
    Create features for the Temporal Transformer.
    
    Features include:
    - Returns (1-day, 5-day, 20-day)
    - Volatility (rolling std)
    - Momentum indicators
    - Cross-asset features
    """
    df = prices_df.copy()
    
    # Returns for all assets
    for col in df.columns:
        df[f'{col}_ret_1d'] = df[col].pct_change(1)
        df[f'{col}_ret_5d'] = df[col].pct_change(5)
        df[f'{col}_ret_20d'] = df[col].pct_change(20)
    
    # Target-specific features
    target = df[target_col]
    
    # Moving averages
    df['ma_ratio_5_20'] = target.rolling(5).mean() / target.rolling(20).mean()
    df['ma_ratio_20_50'] = target.rolling(20).mean() / target.rolling(50).mean()
    
    # Volatility
    df['volatility_20d'] = df[f'{target_col}_ret_1d'].rolling(20).std()
    df['volatility_60d'] = df[f'{target_col}_ret_1d'].rolling(60).std()
    df['vol_ratio'] = df['volatility_20d'] / df['volatility_60d']
    
    # Momentum
    df['momentum_10d'] = target.pct_change(10)
    df['momentum_30d'] = target.pct_change(30)
    
    # RSI-like feature
    delta = target.diff()
    gain = (delta.where(delta > 0, 0)).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    rs = gain / loss
    df['rsi'] = 100 - (100 / (1 + rs))
    
    # Beta to market (SPY)
    if 'SPY' in df.columns:
        rolling_cov = df[f'{target_col}_ret_1d'].rolling(60).cov(df['SPY_ret_1d'])
        rolling_var = df['SPY_ret_1d'].rolling(60).var()
        df['beta_60d'] = rolling_cov / rolling_var
    
    # Target: Next day return
    df['target'] = df[f'{target_col}_ret_1d'].shift(-1)
    
    # Drop original price columns and NaN
    feature_cols = [c for c in df.columns if c not in symbols]
    df = df[feature_cols].dropna()
    
    return df


# Create features
features_df = create_features(prices_df, target_col='AAPL')
print(f"Feature DataFrame shape: {features_df.shape}")
print(f"\nFeatures:")
print(features_df.columns.tolist())

In [None]:
class TimeSeriesDataset(Dataset):
    """Dataset for sequence-to-sequence time series prediction."""
    
    def __init__(self, features, targets, seq_len=60, pred_len=1):
        self.features = torch.FloatTensor(features)
        self.targets = torch.FloatTensor(targets)
        self.seq_len = seq_len
        self.pred_len = pred_len
        
    def __len__(self):
        return len(self.features) - self.seq_len - self.pred_len + 1
    
    def __getitem__(self, idx):
        x = self.features[idx:idx + self.seq_len]
        y = self.targets[idx + self.seq_len:idx + self.seq_len + self.pred_len]
        return x, y


def prepare_data(df, seq_len=60, train_ratio=0.7, val_ratio=0.15):
    """
    Prepare train/val/test datasets with proper scaling.
    Uses time-based split (no shuffling for time series).
    """
    # Separate features and target
    feature_cols = [c for c in df.columns if c != 'target']
    X = df[feature_cols].values
    y = df['target'].values.reshape(-1, 1)
    
    # Time-based split
    n = len(df)
    train_end = int(n * train_ratio)
    val_end = int(n * (train_ratio + val_ratio))
    
    X_train, y_train = X[:train_end], y[:train_end]
    X_val, y_val = X[train_end:val_end], y[train_end:val_end]
    X_test, y_test = X[val_end:], y[val_end:]
    
    # Scale features (fit on train only)
    scaler_X = StandardScaler()
    scaler_y = StandardScaler()
    
    X_train = scaler_X.fit_transform(X_train)
    X_val = scaler_X.transform(X_val)
    X_test = scaler_X.transform(X_test)
    
    y_train = scaler_y.fit_transform(y_train)
    y_val = scaler_y.transform(y_val)
    y_test = scaler_y.transform(y_test)
    
    # Create datasets
    train_dataset = TimeSeriesDataset(X_train, y_train, seq_len)
    val_dataset = TimeSeriesDataset(X_val, y_val, seq_len)
    test_dataset = TimeSeriesDataset(X_test, y_test, seq_len)
    
    return train_dataset, val_dataset, test_dataset, scaler_X, scaler_y, feature_cols


# Prepare data
SEQ_LEN = 60  # Use 60 days of history
BATCH_SIZE = 32

train_dataset, val_dataset, test_dataset, scaler_X, scaler_y, feature_cols = prepare_data(
    features_df, seq_len=SEQ_LEN
)

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print(f"Number of features: {len(feature_cols)}")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")

# Check a batch
x_batch, y_batch = next(iter(train_loader))
print(f"\nBatch shapes: X={x_batch.shape}, y={y_batch.shape}")

### 4.2 Model Training

In [None]:
class QuantileLoss(nn.Module):
    """Quantile loss for probabilistic predictions."""
    
    def __init__(self, quantiles=[0.1, 0.5, 0.9]):
        super().__init__()
        self.quantiles = quantiles
        
    def forward(self, predictions, targets):
        """
        Args:
            predictions: (batch, seq_len, num_quantiles)
            targets: (batch, seq_len, 1)
        """
        losses = []
        for i, q in enumerate(self.quantiles):
            pred_q = predictions[:, :, i:i+1]
            error = targets - pred_q
            loss = torch.max(q * error, (q - 1) * error)
            losses.append(loss.mean())
        return sum(losses) / len(losses)


def train_model(model, train_loader, val_loader, epochs=50, lr=0.001, device='cpu'):
    """
    Train the Temporal Transformer model.
    """
    model = model.to(device)
    
    # Loss functions
    mse_loss = nn.MSELoss()
    quantile_loss = QuantileLoss(quantiles=[0.1, 0.5, 0.9])
    
    # Optimizer with gradient clipping
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)
    
    # Training history
    history = {'train_loss': [], 'val_loss': [], 'train_mse': [], 'val_mse': []}
    best_val_loss = float('inf')
    best_model_state = None
    
    for epoch in range(epochs):
        # Training phase
        model.train()
        train_losses = []
        train_mses = []
        
        for x_batch, y_batch in train_loader:
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)
            
            optimizer.zero_grad()
            
            # Forward pass (use last timestep for prediction)
            point_pred, quantile_pred = model(x_batch)
            
            # Use only the last timestep predictions
            point_pred_last = point_pred[:, -1:, :]  # (batch, 1, 1)
            quantile_pred_last = quantile_pred[:, -1:, :]  # (batch, 1, 3)
            
            # Compute losses
            mse = mse_loss(point_pred_last.squeeze(-1), y_batch.squeeze(-1))
            q_loss = quantile_loss(quantile_pred_last, y_batch)
            loss = mse + 0.5 * q_loss
            
            # Backward pass
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            
            train_losses.append(loss.item())
            train_mses.append(mse.item())
        
        # Validation phase
        model.eval()
        val_losses = []
        val_mses = []
        
        with torch.no_grad():
            for x_batch, y_batch in val_loader:
                x_batch = x_batch.to(device)
                y_batch = y_batch.to(device)
                
                point_pred, quantile_pred = model(x_batch)
                point_pred_last = point_pred[:, -1:, :]
                quantile_pred_last = quantile_pred[:, -1:, :]
                
                mse = mse_loss(point_pred_last.squeeze(-1), y_batch.squeeze(-1))
                q_loss = quantile_loss(quantile_pred_last, y_batch)
                loss = mse + 0.5 * q_loss
                
                val_losses.append(loss.item())
                val_mses.append(mse.item())
        
        # Record metrics
        train_loss = np.mean(train_losses)
        val_loss = np.mean(val_losses)
        train_mse = np.mean(train_mses)
        val_mse = np.mean(val_mses)
        
        history['train_loss'].append(train_loss)
        history['val_loss'].append(val_loss)
        history['train_mse'].append(train_mse)
        history['val_mse'].append(val_mse)
        
        # Update learning rate
        scheduler.step(val_loss)
        
        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_state = model.state_dict().copy()
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs} | "
                  f"Train Loss: {train_loss:.6f} | Val Loss: {val_loss:.6f} | "
                  f"Train MSE: {train_mse:.6f} | Val MSE: {val_mse:.6f}")
    
    # Load best model
    model.load_state_dict(best_model_state)
    
    return model, history


# Initialize model
num_features = len(feature_cols)

model = TemporalFusionTransformer(
    num_features=num_features,
    d_model=64,
    n_heads=4,
    n_layers=2,
    d_ff=128,
    dropout=0.1,
    max_seq_len=SEQ_LEN + 10,
    num_quantiles=3
)

print(f"Model initialized with {sum(p.numel() for p in model.parameters()):,} parameters")

In [None]:
# Train the model
model, history = train_model(
    model, 
    train_loader, 
    val_loader, 
    epochs=50, 
    lr=0.001,
    device=device
)

# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(history['train_loss'], label='Train Loss', alpha=0.8)
axes[0].plot(history['val_loss'], label='Val Loss', alpha=0.8)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training & Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(history['train_mse'], label='Train MSE', alpha=0.8)
axes[1].plot(history['val_mse'], label='Val MSE', alpha=0.8)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('MSE')
axes[1].set_title('Training & Validation MSE')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 4.3 Model Evaluation and Interpretation

In [None]:
def evaluate_model(model, test_loader, scaler_y, device='cpu'):
    """
    Evaluate the model and collect predictions with attention weights.
    """
    model.eval()
    model = model.to(device)
    
    all_predictions = []
    all_targets = []
    all_quantiles = []
    all_attention = []
    all_var_weights = []
    
    with torch.no_grad():
        for x_batch, y_batch in test_loader:
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)
            
            # Get predictions with interpretability outputs
            point_pred, quantile_pred, attn_weights, var_weights = model(
                x_batch, return_attention=True, return_var_weights=True
            )
            
            # Extract last timestep predictions
            point_pred_last = point_pred[:, -1, 0].cpu().numpy()
            quantile_pred_last = quantile_pred[:, -1, :].cpu().numpy()
            targets = y_batch[:, 0, 0].cpu().numpy()
            
            all_predictions.extend(point_pred_last)
            all_targets.extend(targets)
            all_quantiles.extend(quantile_pred_last)
            all_attention.append([a.cpu().numpy() for a in attn_weights])
            all_var_weights.append(var_weights.cpu().numpy())
    
    # Convert to arrays
    predictions = np.array(all_predictions)
    targets = np.array(all_targets)
    quantiles = np.array(all_quantiles)
    
    # Inverse transform to get actual returns
    predictions_orig = scaler_y.inverse_transform(predictions.reshape(-1, 1)).flatten()
    targets_orig = scaler_y.inverse_transform(targets.reshape(-1, 1)).flatten()
    quantiles_orig = scaler_y.inverse_transform(quantiles)
    
    # Calculate metrics
    mse = mean_squared_error(targets_orig, predictions_orig)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(targets_orig, predictions_orig)
    
    # Direction accuracy
    direction_correct = (np.sign(predictions_orig) == np.sign(targets_orig)).mean()
    
    # Quantile coverage
    lower_coverage = (targets_orig >= quantiles_orig[:, 0]).mean()
    upper_coverage = (targets_orig <= quantiles_orig[:, 2]).mean()
    interval_coverage = ((targets_orig >= quantiles_orig[:, 0]) & 
                         (targets_orig <= quantiles_orig[:, 2])).mean()
    
    results = {
        'predictions': predictions_orig,
        'targets': targets_orig,
        'quantiles': quantiles_orig,
        'attention': all_attention,
        'var_weights': np.concatenate(all_var_weights, axis=0),
        'metrics': {
            'MSE': mse,
            'RMSE': rmse,
            'MAE': mae,
            'Direction Accuracy': direction_correct,
            'Lower Quantile Coverage (≥10%)': lower_coverage,
            'Upper Quantile Coverage (≤90%)': upper_coverage,
            '80% Interval Coverage': interval_coverage
        }
    }
    
    return results


# Evaluate model
results = evaluate_model(model, test_loader, scaler_y, device)

# Print metrics
print("="*50)
print("TEST SET EVALUATION METRICS")
print("="*50)
for metric, value in results['metrics'].items():
    if 'Coverage' in metric or 'Accuracy' in metric:
        print(f"{metric}: {value:.2%}")
    else:
        print(f"{metric}: {value:.6f}")

In [None]:
# Visualize predictions
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Predictions vs Actual (time series)
n_show = 200
ax = axes[0, 0]
ax.plot(results['targets'][:n_show], label='Actual', alpha=0.7, linewidth=1)
ax.plot(results['predictions'][:n_show], label='Predicted', alpha=0.7, linewidth=1)
ax.fill_between(range(n_show), 
                results['quantiles'][:n_show, 0], 
                results['quantiles'][:n_show, 2], 
                alpha=0.2, label='80% Interval')
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax.set_xlabel('Time')
ax.set_ylabel('Return')
ax.set_title('Predictions vs Actual Returns (Test Set)')
ax.legend()
ax.grid(True, alpha=0.3)

# 2. Scatter plot
ax = axes[0, 1]
ax.scatter(results['targets'], results['predictions'], alpha=0.3, s=10)
ax.plot([-0.1, 0.1], [-0.1, 0.1], 'r--', label='Perfect Prediction')
ax.set_xlabel('Actual Return')
ax.set_ylabel('Predicted Return')
ax.set_title('Prediction Scatter Plot')
ax.legend()
ax.grid(True, alpha=0.3)

# 3. Cumulative returns (trading strategy)
ax = axes[1, 0]
signal = np.sign(results['predictions'])
strategy_returns = signal * results['targets']
cumulative_strategy = (1 + strategy_returns).cumprod()
cumulative_buy_hold = (1 + results['targets']).cumprod()

ax.plot(cumulative_strategy, label='Transformer Strategy', linewidth=2)
ax.plot(cumulative_buy_hold, label='Buy & Hold', linewidth=2)
ax.set_xlabel('Time')
ax.set_ylabel('Cumulative Return')
ax.set_title('Strategy Performance vs Buy & Hold')
ax.legend()
ax.grid(True, alpha=0.3)

# 4. Error distribution
ax = axes[1, 1]
errors = results['predictions'] - results['targets']
ax.hist(errors, bins=50, density=True, alpha=0.7, edgecolor='black')
ax.axvline(x=0, color='red', linestyle='--', label='Zero Error')
ax.set_xlabel('Prediction Error')
ax.set_ylabel('Density')
ax.set_title(f'Prediction Error Distribution\nMean: {errors.mean():.4f}, Std: {errors.std():.4f}')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate Sharpe Ratio
sharpe_strategy = np.sqrt(252) * strategy_returns.mean() / strategy_returns.std()
sharpe_buyhold = np.sqrt(252) * results['targets'].mean() / results['targets'].std()
print(f"\nSharpe Ratio (Strategy): {sharpe_strategy:.2f}")
print(f"Sharpe Ratio (Buy & Hold): {sharpe_buyhold:.2f}")

---

## Part 5: Interpretability Analysis

One of the key advantages of the Temporal Fusion Transformer is its interpretability.
Let's analyze:
1. Variable importance weights
2. Temporal attention patterns

In [None]:
# Analyze Variable Importance
var_weights = results['var_weights']  # (samples, seq_len, num_vars)

# Average importance across samples and time
avg_importance = var_weights.mean(axis=(0, 1))  # (num_vars,)

# Since we expanded features for VSN, map back to original features
# This is an approximation since VSN treats each feature separately
num_vars_actual = min(len(feature_cols), len(avg_importance))

# Plot variable importance
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Top variable importance
ax = axes[0]
importance_df = pd.DataFrame({
    'Feature': feature_cols[:num_vars_actual],
    'Importance': avg_importance[:num_vars_actual]
}).sort_values('Importance', ascending=True)

colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(importance_df)))
ax.barh(importance_df['Feature'], importance_df['Importance'], color=colors)
ax.set_xlabel('Average Importance Weight')
ax.set_title('Variable Selection Network: Feature Importance')
ax.grid(True, alpha=0.3)

# Variable importance over time (for a sample)
ax = axes[1]
sample_idx = 0
time_importance = var_weights[sample_idx, :, :num_vars_actual]  # (seq_len, num_vars)

# Show only top 5 most important variables
top_indices = np.argsort(avg_importance[:num_vars_actual])[-5:]
for idx in top_indices:
    ax.plot(time_importance[:, idx], label=feature_cols[idx], alpha=0.8)

ax.set_xlabel('Time Step')
ax.set_ylabel('Importance Weight')
ax.set_title('Top 5 Variable Importance Over Time (Sample)')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Analyze Temporal Attention Patterns
# Get attention from a batch
model.eval()
x_sample, _ = next(iter(test_loader))
x_sample = x_sample.to(device)

with torch.no_grad():
    _, _, attn_weights, _ = model(x_sample[:1], return_attention=True, return_var_weights=True)

# Plot attention for each layer
fig, axes = plt.subplots(1, len(attn_weights), figsize=(7*len(attn_weights), 6))

if len(attn_weights) == 1:
    axes = [axes]

for i, attn in enumerate(attn_weights):
    attn_matrix = attn[0].cpu().numpy()  # First sample
    
    im = axes[i].imshow(attn_matrix, cmap='viridis', aspect='auto')
    axes[i].set_xlabel('Key Position (History)')
    axes[i].set_ylabel('Query Position (Current)')
    axes[i].set_title(f'Layer {i+1} Attention Pattern')
    plt.colorbar(im, ax=axes[i])

plt.suptitle('Temporal Attention Patterns\n(How model attends to historical data)', fontsize=14)
plt.tight_layout()
plt.show()

# Analyze attention at the last (prediction) position
last_position_attention = attn_weights[-1][0, -1, :].cpu().numpy()  # Last layer, last position

plt.figure(figsize=(12, 4))
plt.bar(range(len(last_position_attention)), last_position_attention, alpha=0.7)
plt.xlabel('Historical Time Step')
plt.ylabel('Attention Weight')
plt.title('Attention Distribution at Prediction Time\n(Which historical steps matter most for the prediction)')
plt.grid(True, alpha=0.3)

# Mark most attended positions
top_5_positions = np.argsort(last_position_attention)[-5:]
for pos in top_5_positions:
    plt.axvline(x=pos, color='red', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

print(f"\nMost attended time steps (from recent to oldest):")
for pos in sorted(top_5_positions, reverse=True):
    days_ago = SEQ_LEN - 1 - pos
    print(f"  - Position {pos} ({days_ago} days ago): attention weight = {last_position_attention[pos]:.4f}")

---

## Summary: Key Takeaways

### Time Series Transformer Modifications
1. **Temporal Positional Encoding**: Captures day-of-week, seasonal, and learned patterns
2. **Causal Masking**: Prevents information leakage from future to past
3. **LSTM Integration**: Captures local temporal patterns before global attention

### TFT Key Components
1. **Variable Selection Network**: Learns feature importance dynamically
2. **Gated Residual Networks**: Flexible nonlinear processing with skip connections
3. **Interpretable Multi-Head Attention**: Shared values enable attention interpretation
4. **Quantile Outputs**: Probabilistic predictions with uncertainty estimation

### Trading Applications
- **Direction Prediction**: Achieved through sign of predicted returns
- **Risk Management**: Quantile outputs provide prediction intervals
- **Feature Analysis**: VSN shows which factors drive predictions
- **Temporal Patterns**: Attention weights reveal important historical periods

### Practical Considerations
- Start with simpler models (LSTM, GRU) before using transformers
- Transformers shine with longer sequences and more features
- Interpretability is valuable for trading strategy validation
- Quantile predictions are essential for risk-adjusted position sizing

In [None]:
# Final summary comparison
print("="*60)
print("TEMPORAL TRANSFORMER MODEL SUMMARY")
print("="*60)
print(f"\nModel Architecture:")
print(f"  - Embedding Dimension: 64")
print(f"  - Number of Attention Heads: 4")
print(f"  - Number of Transformer Layers: 2")
print(f"  - Sequence Length: {SEQ_LEN} days")
print(f"  - Total Parameters: {sum(p.numel() for p in model.parameters()):,}")

print(f"\nKey Innovations:")
print(f"  ✓ Variable Selection Network for feature importance")
print(f"  ✓ Learned temporal positional encoding")
print(f"  ✓ Causal attention masking")
print(f"  ✓ LSTM integration for local patterns")
print(f"  ✓ Quantile outputs for uncertainty")
print(f"  ✓ Interpretable attention weights")

print(f"\nTest Performance:")
for metric, value in results['metrics'].items():
    if 'Coverage' in metric or 'Accuracy' in metric:
        print(f"  - {metric}: {value:.2%}")
    else:
        print(f"  - {metric}: {value:.6f}")

print(f"\nStrategy Metrics:")
print(f"  - Strategy Sharpe Ratio: {sharpe_strategy:.2f}")
print(f"  - Buy & Hold Sharpe Ratio: {sharpe_buyhold:.2f}")
print(f"  - Strategy Final Value: ${cumulative_strategy[-1]:.2f}")
print(f"  - Buy & Hold Final Value: ${cumulative_buy_hold[-1]:.2f}")

---

## Practice Exercises

1. **Multi-Horizon Forecasting**: Modify the model to predict 1, 5, and 20-day returns simultaneously

2. **Attention Variants**: Implement different attention mechanisms:
   - Sparse attention (attend to only top-k positions)
   - Relative positional attention
   - Local + global attention hybrid

3. **Alternative Data Integration**: Add sentiment features from news or social media

4. **Portfolio Application**: Extend to multiple stocks and predict portfolio weights

5. **Regime Detection**: Use attention patterns to identify market regime changes

---

## References

1. Lim, B., et al. (2021). "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting"
2. Vaswani, A., et al. (2017). "Attention Is All You Need"
3. Li, S., et al. (2019). "Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting"
4. Zhou, H., et al. (2021). "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting"