# Week 14 - Day 5: Bidirectional and Stacked RNNs

## Learning Objectives
- Understand bidirectional RNNs and their applications in pattern recognition
- Implement stacked/deep LSTM architectures
- Apply dropout regularization for sequential models
- Build a deep LSTM trading model for price prediction

---

## Table of Contents
1. [Introduction to Advanced RNN Architectures](#1-introduction)
2. [Bidirectional RNNs](#2-bidirectional)
3. [Stacked/Deep LSTMs](#3-stacked)
4. [Dropout for Sequences](#4-dropout)
5. [Deep LSTM Trading Model](#5-trading-model)
6. [Key Takeaways](#6-takeaways)

---

## 1. Introduction to Advanced RNN Architectures <a id='1-introduction'></a>

### Why Go Beyond Simple RNNs?

While basic LSTMs and GRUs can capture sequential patterns, more sophisticated architectures can:

1. **Bidirectional RNNs**: Process sequences in both forward and backward directions
   - Capture context from both past and future
   - Useful for pattern recognition where full sequence is available

2. **Stacked/Deep LSTMs**: Multiple layers of LSTM cells
   - Learn hierarchical representations
   - Lower layers capture simple patterns, higher layers capture complex abstractions

3. **Regularization**: Dropout prevents overfitting
   - Essential for deep networks with limited financial data

### Architecture Overview

```
Bidirectional:          Stacked:
→ LSTM →                LSTM Layer 3
    ↓    concat              ↑
← LSTM ←                LSTM Layer 2
                             ↑
                        LSTM Layer 1
```

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import yfinance as yf
from datetime import datetime, timedelta

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error

import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
# Download financial data
ticker = 'SPY'
start_date = '2018-01-01'
end_date = '2024-01-01'

data = yf.download(ticker, start=start_date, end=end_date, progress=False)
prices = data['Close'].values.reshape(-1, 1)

print(f"Data shape: {prices.shape}")
print(f"Date range: {data.index[0].date()} to {data.index[-1].date()}")
print(f"Price range: ${prices.min():.2f} - ${prices.max():.2f}")

In [None]:
# Data preprocessing
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_prices = scaler.fit_transform(prices)

def create_sequences(data, seq_length):
    """Create sequences for time series prediction."""
    X, y = [], []
    for i in range(len(data) - seq_length):
        X.append(data[i:i+seq_length])
        y.append(data[i+seq_length])
    return np.array(X), np.array(y)

# Parameters
SEQ_LENGTH = 60
TRAIN_SPLIT = 0.8

# Create sequences
X, y = create_sequences(scaled_prices, SEQ_LENGTH)
print(f"Sequences shape: X={X.shape}, y={y.shape}")

# Train/test split
train_size = int(len(X) * TRAIN_SPLIT)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

In [None]:
# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train).to(device)
y_train_tensor = torch.FloatTensor(y_train).to(device)
X_test_tensor = torch.FloatTensor(X_test).to(device)
y_test_tensor = torch.FloatTensor(y_test).to(device)

# Create DataLoaders
BATCH_SIZE = 32

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print(f"Training batches: {len(train_loader)}")
print(f"Test batches: {len(test_loader)}")

---

## 2. Bidirectional RNNs <a id='2-bidirectional'></a>

### Concept

Bidirectional RNNs process sequences in **two directions**:
- **Forward**: From $t_1$ to $t_n$ (past → future)
- **Backward**: From $t_n$ to $t_1$ (future → past)

### Mathematical Formulation

For each timestep $t$:

$$\overrightarrow{h_t} = \text{LSTM}_{forward}(x_t, \overrightarrow{h_{t-1}})$$

$$\overleftarrow{h_t} = \text{LSTM}_{backward}(x_t, \overleftarrow{h_{t+1}})$$

$$h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}]$$

### When to Use Bidirectional RNNs

| Use Case | Suitable? | Reason |
|----------|-----------|--------|
| Pattern recognition (post-hoc) | ✅ Yes | Full sequence available |
| Real-time prediction | ❌ No | Future data unavailable |
| Backtesting analysis | ✅ Yes | Historical patterns |
| Regime classification | ✅ Yes | Identify patterns after the fact |

In [None]:
class BidirectionalLSTM(nn.Module):
    """
    Bidirectional LSTM for sequence pattern recognition.
    
    The output dimension is doubled (hidden_size * 2) due to
    concatenation of forward and backward hidden states.
    """
    
    def __init__(self, input_size, hidden_size, num_layers=1, dropout=0.0):
        super(BidirectionalLSTM, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Bidirectional LSTM
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True,  # Key parameter!
            dropout=dropout if num_layers > 1 else 0.0
        )
        
        # Output layer (hidden_size * 2 for bidirectional)
        self.fc = nn.Linear(hidden_size * 2, 1)
        
    def forward(self, x):
        # x shape: (batch, seq_len, input_size)
        
        # LSTM output shape: (batch, seq_len, hidden_size * 2)
        lstm_out, (h_n, c_n) = self.lstm(x)
        
        # Use the last output (contains info from both directions)
        last_output = lstm_out[:, -1, :]
        
        # Predict
        out = self.fc(last_output)
        return out

# Instantiate model
bi_lstm = BidirectionalLSTM(
    input_size=1,
    hidden_size=64,
    num_layers=1
).to(device)

print(bi_lstm)
print(f"\nTotal parameters: {sum(p.numel() for p in bi_lstm.parameters()):,}")

In [None]:
# Visualize bidirectional processing
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Left: Unidirectional concept
ax1 = axes[0]
timesteps = np.arange(10)
ax1.plot(timesteps, np.sin(timesteps * 0.5), 'b-o', linewidth=2, markersize=8)
for i in range(len(timesteps)-1):
    ax1.annotate('', xy=(timesteps[i+1], np.sin((i+1)*0.5)), 
                 xytext=(timesteps[i], np.sin(i*0.5)),
                 arrowprops=dict(arrowstyle='->', color='blue', lw=2))
ax1.set_title('Unidirectional LSTM\n(Forward Only)', fontsize=12)
ax1.set_xlabel('Time Step')
ax1.set_ylabel('Hidden State Flow')

# Right: Bidirectional concept
ax2 = axes[1]
ax2.plot(timesteps, np.sin(timesteps * 0.5), 'b-o', linewidth=2, markersize=8, label='Forward')
ax2.plot(timesteps, np.sin(timesteps * 0.5) - 0.3, 'r-s', linewidth=2, markersize=8, label='Backward')
for i in range(len(timesteps)-1):
    # Forward arrows
    ax2.annotate('', xy=(timesteps[i+1], np.sin((i+1)*0.5)), 
                 xytext=(timesteps[i], np.sin(i*0.5)),
                 arrowprops=dict(arrowstyle='->', color='blue', lw=1.5))
    # Backward arrows
    ax2.annotate('', xy=(timesteps[i], np.sin(i*0.5) - 0.3), 
                 xytext=(timesteps[i+1], np.sin((i+1)*0.5) - 0.3),
                 arrowprops=dict(arrowstyle='->', color='red', lw=1.5))
ax2.set_title('Bidirectional LSTM\n(Forward + Backward)', fontsize=12)
ax2.set_xlabel('Time Step')
ax2.legend()

plt.tight_layout()
plt.show()

---

## 3. Stacked/Deep LSTMs <a id='3-stacked'></a>

### Why Stack LSTM Layers?

Stacking multiple LSTM layers creates a **hierarchical representation**:

1. **Layer 1**: Captures low-level patterns (trends, simple oscillations)
2. **Layer 2**: Combines low-level patterns into mid-level features
3. **Layer 3+**: Learns complex, abstract representations

### Architecture

```
Input Sequence: [x₁, x₂, ..., xₜ]
        ↓
    LSTM Layer 1  →  h₁⁽¹⁾, h₂⁽¹⁾, ..., hₜ⁽¹⁾  (simple patterns)
        ↓
    LSTM Layer 2  →  h₁⁽²⁾, h₂⁽²⁾, ..., hₜ⁽²⁾  (complex patterns)
        ↓
    LSTM Layer 3  →  h₁⁽³⁾, h₂⁽³⁾, ..., hₜ⁽³⁾  (abstract features)
        ↓
    Dense Layer   →  Output
```

### Guidelines for Depth

| Dataset Size | Recommended Layers |
|--------------|--------------------|
| < 1,000 samples | 1 layer |
| 1,000 - 10,000 | 1-2 layers |
| 10,000 - 100,000 | 2-3 layers |
| > 100,000 | 3-4+ layers |

In [None]:
class StackedLSTM(nn.Module):
    """
    Stacked (Deep) LSTM with configurable number of layers.
    
    Each layer passes its output sequence to the next layer,
    enabling hierarchical feature learning.
    """
    
    def __init__(self, input_size, hidden_size, num_layers, dropout=0.0):
        super(StackedLSTM, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Stacked LSTM (PyTorch handles stacking automatically)
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0.0
        )
        
        self.fc = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        # Initialize hidden states
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        
        # Forward propagate LSTM
        out, _ = self.lstm(x, (h0, c0))
        
        # Get last timestep output
        out = self.fc(out[:, -1, :])
        return out

# Compare different depths
depths = [1, 2, 3, 4]
for depth in depths:
    model = StackedLSTM(input_size=1, hidden_size=64, num_layers=depth)
    params = sum(p.numel() for p in model.parameters())
    print(f"Layers: {depth} | Parameters: {params:,}")

In [None]:
class ManualStackedLSTM(nn.Module):
    """
    Manually stacked LSTM layers for fine-grained control.
    
    Useful when you want different hidden sizes per layer
    or custom connections between layers.
    """
    
    def __init__(self, input_size, hidden_sizes, dropout=0.2):
        super(ManualStackedLSTM, self).__init__()
        
        self.layers = nn.ModuleList()
        self.dropouts = nn.ModuleList()
        
        # First layer
        self.layers.append(
            nn.LSTM(input_size, hidden_sizes[0], batch_first=True)
        )
        self.dropouts.append(nn.Dropout(dropout))
        
        # Subsequent layers
        for i in range(1, len(hidden_sizes)):
            self.layers.append(
                nn.LSTM(hidden_sizes[i-1], hidden_sizes[i], batch_first=True)
            )
            self.dropouts.append(nn.Dropout(dropout))
        
        self.fc = nn.Linear(hidden_sizes[-1], 1)
        
    def forward(self, x):
        # Process through each layer
        for lstm, dropout in zip(self.layers, self.dropouts):
            x, _ = lstm(x)
            x = dropout(x)
        
        # Get last timestep
        out = self.fc(x[:, -1, :])
        return out

# Example: Pyramid architecture (decreasing hidden sizes)
pyramid_lstm = ManualStackedLSTM(
    input_size=1,
    hidden_sizes=[128, 64, 32],  # Decreasing sizes
    dropout=0.2
).to(device)

print("Pyramid LSTM Architecture:")
print(pyramid_lstm)
print(f"\nTotal parameters: {sum(p.numel() for p in pyramid_lstm.parameters()):,}")

---

## 4. Dropout for Sequences <a id='4-dropout'></a>

### Types of Dropout in RNNs

1. **Standard Dropout**: Applied between LSTM layers
   - Different mask for each timestep
   - Can disrupt temporal information flow

2. **Variational Dropout**: Same mask across all timesteps
   - Preserves temporal consistency
   - Better for sequence modeling

3. **Recurrent Dropout**: Applied to recurrent connections
   - Drops hidden-to-hidden connections

### Dropout Placement

```
Input → LSTM₁ → Dropout → LSTM₂ → Dropout → LSTM₃ → Output
                  ↑                 ↑
           Between layers    Between layers
```

### Best Practices

- **Rate**: 0.2-0.5 typically works well
- **Position**: After LSTM output, not on input
- **Training only**: Disabled during inference

In [None]:
class VariationalDropout(nn.Module):
    """
    Variational Dropout for sequences.
    
    Uses the same dropout mask across all timesteps,
    preserving temporal correlations.
    """
    
    def __init__(self, dropout_rate=0.2):
        super(VariationalDropout, self).__init__()
        self.dropout_rate = dropout_rate
        
    def forward(self, x):
        if not self.training or self.dropout_rate == 0:
            return x
        
        # x shape: (batch, seq_len, features)
        # Create mask for (batch, 1, features) - same mask for all timesteps
        mask = torch.bernoulli(
            torch.ones(x.size(0), 1, x.size(2)) * (1 - self.dropout_rate)
        ).to(x.device)
        
        # Expand mask across timesteps and apply
        mask = mask.expand_as(x)
        return x * mask / (1 - self.dropout_rate)

# Demonstrate difference
sample_seq = torch.randn(2, 5, 4)  # (batch=2, seq_len=5, features=4)

# Standard dropout - different mask each time
std_dropout = nn.Dropout(0.5)
std_dropout.train()

# Variational dropout - same mask across timesteps
var_dropout = VariationalDropout(0.5)
var_dropout.train()

print("Standard Dropout (different mask per timestep):")
out_std = std_dropout(sample_seq)
print(f"Non-zero pattern varies: {(out_std[0] != 0).int()}")

print("\nVariational Dropout (same mask across timesteps):")
out_var = var_dropout(sample_seq)
print(f"Consistent pattern: {(out_var[0] != 0).int()}")

In [None]:
class LSTMWithVariationalDropout(nn.Module):
    """
    LSTM with Variational Dropout between layers.
    """
    
    def __init__(self, input_size, hidden_size, num_layers, dropout=0.2):
        super(LSTMWithVariationalDropout, self).__init__()
        
        self.layers = nn.ModuleList()
        self.dropouts = nn.ModuleList()
        
        # Build layers
        for i in range(num_layers):
            input_dim = input_size if i == 0 else hidden_size
            self.layers.append(
                nn.LSTM(input_dim, hidden_size, batch_first=True)
            )
            if i < num_layers - 1:  # No dropout after last layer
                self.dropouts.append(VariationalDropout(dropout))
        
        self.fc = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        for i, lstm in enumerate(self.layers):
            x, _ = lstm(x)
            if i < len(self.dropouts):
                x = self.dropouts[i](x)
        
        return self.fc(x[:, -1, :])

# Create model
var_lstm = LSTMWithVariationalDropout(
    input_size=1,
    hidden_size=64,
    num_layers=3,
    dropout=0.3
).to(device)

print(var_lstm)

---

## 5. Deep LSTM Trading Model <a id='5-trading-model'></a>

Now we'll combine all concepts into a comprehensive trading model:

1. **Bidirectional + Stacked**: For pattern recognition
2. **Dropout**: For regularization
3. **Multiple features**: Technical indicators
4. **Attention mechanism**: Focus on important timesteps

In [None]:
class DeepLSTMTradingModel(nn.Module):
    """
    Production-grade Deep LSTM for Trading.
    
    Features:
    - Bidirectional first layer for pattern recognition
    - Stacked unidirectional layers for prediction
    - Variational dropout for regularization
    - Skip connections for gradient flow
    """
    
    def __init__(self, input_size, hidden_size=128, num_layers=3, dropout=0.3):
        super(DeepLSTMTradingModel, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Layer 1: Bidirectional for pattern recognition
        self.bi_lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=1,
            batch_first=True,
            bidirectional=True
        )
        
        # Projection from bidirectional output to hidden_size
        self.projection = nn.Linear(hidden_size * 2, hidden_size)
        
        # Layer 2+: Stacked unidirectional LSTMs
        self.stacked_lstm = nn.LSTM(
            input_size=hidden_size,
            hidden_size=hidden_size,
            num_layers=num_layers - 1,
            batch_first=True,
            dropout=dropout if num_layers > 2 else 0.0
        )
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
        
        # Output layers
        self.fc1 = nn.Linear(hidden_size, hidden_size // 2)
        self.fc2 = nn.Linear(hidden_size // 2, 1)
        
        # Activation
        self.relu = nn.ReLU()
        
    def forward(self, x):
        # Bidirectional layer
        bi_out, _ = self.bi_lstm(x)
        
        # Project to hidden_size
        projected = self.relu(self.projection(bi_out))
        projected = self.dropout(projected)
        
        # Stacked layers
        stacked_out, _ = self.stacked_lstm(projected)
        
        # Get last output
        last_output = stacked_out[:, -1, :]
        last_output = self.dropout(last_output)
        
        # Dense layers
        out = self.relu(self.fc1(last_output))
        out = self.fc2(out)
        
        return out

# Create model
model = DeepLSTMTradingModel(
    input_size=1,
    hidden_size=128,
    num_layers=3,
    dropout=0.3
).to(device)

print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
def train_model(model, train_loader, test_loader, epochs=50, lr=0.001):
    """
    Training loop with early stopping and learning rate scheduling.
    """
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode='min', factor=0.5, patience=5, verbose=True
    )
    
    train_losses = []
    test_losses = []
    best_test_loss = float('inf')
    patience_counter = 0
    patience = 10
    
    for epoch in range(epochs):
        # Training
        model.train()
        train_loss = 0.0
        
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            
            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            optimizer.step()
            train_loss += loss.item()
        
        train_loss /= len(train_loader)
        train_losses.append(train_loss)
        
        # Evaluation
        model.eval()
        test_loss = 0.0
        
        with torch.no_grad():
            for X_batch, y_batch in test_loader:
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                test_loss += loss.item()
        
        test_loss /= len(test_loader)
        test_losses.append(test_loss)
        
        # Learning rate scheduling
        scheduler.step(test_loss)
        
        # Early stopping check
        if test_loss < best_test_loss:
            best_test_loss = test_loss
            patience_counter = 0
            # Save best model
            best_model_state = model.state_dict().copy()
        else:
            patience_counter += 1
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch [{epoch+1}/{epochs}] "
                  f"Train Loss: {train_loss:.6f} | "
                  f"Test Loss: {test_loss:.6f}")
        
        if patience_counter >= patience:
            print(f"\nEarly stopping at epoch {epoch+1}")
            model.load_state_dict(best_model_state)
            break
    
    return train_losses, test_losses

# Train the model
print("Training Deep LSTM Trading Model...\n")
train_losses, test_losses = train_model(
    model, train_loader, test_loader, epochs=100, lr=0.001
)

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Loss curves
ax1 = axes[0]
ax1.plot(train_losses, label='Train Loss', linewidth=2)
ax1.plot(test_losses, label='Test Loss', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('MSE Loss')
ax1.set_title('Training Progress')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Log scale
ax2 = axes[1]
ax2.semilogy(train_losses, label='Train Loss', linewidth=2)
ax2.semilogy(test_losses, label='Test Loss', linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('MSE Loss (log scale)')
ax2.set_title('Training Progress (Log Scale)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Generate predictions
model.eval()

with torch.no_grad():
    train_predictions = model(X_train_tensor).cpu().numpy()
    test_predictions = model(X_test_tensor).cpu().numpy()

# Inverse transform
train_predictions = scaler.inverse_transform(train_predictions)
test_predictions = scaler.inverse_transform(test_predictions)
y_train_actual = scaler.inverse_transform(y_train)
y_test_actual = scaler.inverse_transform(y_test)

# Calculate metrics
train_rmse = np.sqrt(mean_squared_error(y_train_actual, train_predictions))
test_rmse = np.sqrt(mean_squared_error(y_test_actual, test_predictions))
train_mae = mean_absolute_error(y_train_actual, train_predictions)
test_mae = mean_absolute_error(y_test_actual, test_predictions)

print("Model Performance:")
print(f"{'Metric':<15} {'Train':<15} {'Test':<15}")
print("-" * 45)
print(f"{'RMSE':<15} ${train_rmse:<14.2f} ${test_rmse:<14.2f}")
print(f"{'MAE':<15} ${train_mae:<14.2f} ${test_mae:<14.2f}")

In [None]:
# Visualize predictions
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Full timeline
ax1 = axes[0]
dates = data.index[SEQ_LENGTH:]
train_dates = dates[:train_size]
test_dates = dates[train_size:]

ax1.plot(train_dates, y_train_actual, 'b-', label='Train Actual', alpha=0.7)
ax1.plot(train_dates, train_predictions, 'g-', label='Train Predicted', alpha=0.7)
ax1.plot(test_dates, y_test_actual, 'b-', alpha=0.7)
ax1.plot(test_dates, test_predictions, 'r-', label='Test Predicted', alpha=0.7)
ax1.axvline(x=test_dates[0], color='black', linestyle='--', label='Train/Test Split')
ax1.set_xlabel('Date')
ax1.set_ylabel('Price ($)')
ax1.set_title(f'{ticker} Price Prediction - Deep LSTM Model')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Test set zoom
ax2 = axes[1]
ax2.plot(test_dates, y_test_actual, 'b-', label='Actual', linewidth=2)
ax2.plot(test_dates, test_predictions, 'r--', label='Predicted', linewidth=2)
ax2.fill_between(test_dates.to_numpy(), 
                  y_test_actual.flatten(), 
                  test_predictions.flatten(),
                  alpha=0.3, color='gray', label='Prediction Error')
ax2.set_xlabel('Date')
ax2.set_ylabel('Price ($)')
ax2.set_title('Test Set Predictions (Zoomed)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Compare different architectures
def evaluate_architecture(model_class, model_params, name):
    """Train and evaluate a model architecture."""
    model = model_class(**model_params).to(device)
    
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    # Quick training (reduced epochs for comparison)
    for epoch in range(30):
        model.train()
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()
    
    # Evaluate
    model.eval()
    with torch.no_grad():
        test_pred = model(X_test_tensor).cpu().numpy()
    
    test_pred = scaler.inverse_transform(test_pred)
    rmse = np.sqrt(mean_squared_error(y_test_actual, test_pred))
    
    params = sum(p.numel() for p in model.parameters())
    
    return {'name': name, 'rmse': rmse, 'params': params}

# Define architectures to compare
architectures = [
    (StackedLSTM, {'input_size': 1, 'hidden_size': 64, 'num_layers': 1}, 'Simple LSTM (1 layer)'),
    (StackedLSTM, {'input_size': 1, 'hidden_size': 64, 'num_layers': 2, 'dropout': 0.2}, 'Stacked LSTM (2 layers)'),
    (StackedLSTM, {'input_size': 1, 'hidden_size': 64, 'num_layers': 3, 'dropout': 0.2}, 'Stacked LSTM (3 layers)'),
    (BidirectionalLSTM, {'input_size': 1, 'hidden_size': 64, 'num_layers': 1}, 'Bidirectional LSTM'),
    (DeepLSTMTradingModel, {'input_size': 1, 'hidden_size': 64, 'num_layers': 3, 'dropout': 0.2}, 'Deep Trading Model'),
]

print("Comparing architectures (30 epochs each)...\n")
results = []

for model_class, params, name in architectures:
    result = evaluate_architecture(model_class, params, name)
    results.append(result)
    print(f"{name}: RMSE=${result['rmse']:.2f}, Params={result['params']:,}")

# Create comparison dataframe
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('rmse')
print("\nRanking by RMSE:")
print(results_df.to_string(index=False))

In [None]:
# Visualize architecture comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# RMSE comparison
ax1 = axes[0]
colors = plt.cm.viridis(np.linspace(0, 0.8, len(results_df)))
bars = ax1.barh(results_df['name'], results_df['rmse'], color=colors)
ax1.set_xlabel('RMSE ($)')
ax1.set_title('Model Comparison: Test RMSE')
ax1.grid(True, alpha=0.3, axis='x')

# Add value labels
for bar, rmse in zip(bars, results_df['rmse']):
    ax1.text(bar.get_width() + 0.1, bar.get_y() + bar.get_height()/2,
             f'${rmse:.2f}', va='center', fontsize=10)

# Parameters vs RMSE
ax2 = axes[1]
ax2.scatter(results_df['params'], results_df['rmse'], s=200, c=colors, edgecolors='black')
for _, row in results_df.iterrows():
    ax2.annotate(row['name'].split('(')[0].strip(), 
                 (row['params'], row['rmse']),
                 textcoords="offset points", xytext=(5,5), fontsize=9)
ax2.set_xlabel('Number of Parameters')
ax2.set_ylabel('RMSE ($)')
ax2.set_title('Parameters vs Performance')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 6. Key Takeaways <a id='6-takeaways'></a>

### Bidirectional RNNs

| Aspect | Details |
|--------|--------|
| **When to Use** | Post-hoc analysis, pattern recognition, regime identification |
| **When NOT to Use** | Real-time prediction (future unavailable) |
| **Output Size** | Doubles hidden size (forward + backward concatenated) |
| **Trade-off** | Better context understanding vs. 2x parameters |

### Stacked LSTMs

| Aspect | Details |
|--------|--------|
| **Purpose** | Hierarchical feature learning |
| **Depth Guidelines** | Match to data size (1-4 layers typically) |
| **Architecture** | Can use decreasing sizes (pyramid) |
| **Regularization** | Essential as depth increases |

### Dropout for Sequences

| Type | Application | Best For |
|------|-------------|----------|
| Standard | Between layers | General regularization |
| Variational | Same mask across time | Temporal consistency |
| Recurrent | Hidden-to-hidden | Preventing co-adaptation |

### Practical Recommendations

1. **Start Simple**: Begin with 1-2 layer LSTM, add complexity if needed
2. **Use Dropout**: 0.2-0.3 between LSTM layers
3. **Bidirectional**: Only when full sequence is available at inference
4. **Monitor Overfitting**: Deep models need more regularization
5. **Gradient Clipping**: Essential for deep networks (max_norm=1.0)

---

### Next Steps
- **Day 6**: Sequence-to-Sequence Models
- **Day 7**: Advanced Training Techniques