# Week 14, Day 3: GRU Networks

## Gated Recurrent Units for Financial Time Series

**Learning Objectives:**
1. Understand GRU architecture (reset and update gates)
2. Compare GRU vs LSTM networks
3. Learn when to use each architecture
4. Implement GRU in PyTorch
5. Build a volatility prediction model

---

## 1. GRU Architecture

### What is a GRU?

**Gated Recurrent Unit (GRU)** was introduced by Cho et al. (2014) as a simpler alternative to LSTM. It combines the forget and input gates into a single **update gate** and merges the cell state and hidden state.

### GRU Gates

GRU has **two gates** (vs LSTM's three):

#### 1. Reset Gate ($r_t$)
Controls how much of the previous hidden state to forget:
$$r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)$$

#### 2. Update Gate ($z_t$)
Controls how much of the new candidate state to use:
$$z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)$$

### GRU Equations

**Candidate hidden state:**
$$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t] + b_h)$$

**Final hidden state:**
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$

### Visual Intuition

```
         ┌─────────────────────────────────────┐
         │              GRU Cell               │
         │                                     │
    x_t ─┼──┬──────────┬──────────┬───────────┤
         │  │          │          │           │
         │  ▼          ▼          ▼           │
         │ ┌──┐      ┌──┐      ┌─────┐       │
         │ │z_t│     │r_t│     │ h̃_t │       │
         │ │  │      │  │      │     │       │
         │ └──┘      └──┘      └─────┘       │
         │  │          │          │           │
         │  └──────────┴──────────┘           │
         │              │                     │
         │              ▼                     │
 h_{t-1}─┼─────────►  (×)  ──────────────────►├── h_t
         │                                     │
         └─────────────────────────────────────┘
```

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import yfinance as yf
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 2. GRU Cell from Scratch

Let's implement a GRU cell manually to understand the mechanics:

In [None]:
class GRUCellFromScratch(nn.Module):
    """
    Manual GRU Cell Implementation for educational purposes.
    
    GRU has 2 gates vs LSTM's 3:
    - Reset gate (r): controls how much past info to forget
    - Update gate (z): controls how much new info to add
    """
    
    def __init__(self, input_size, hidden_size):
        super(GRUCellFromScratch, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        # Reset gate weights
        self.W_xr = nn.Linear(input_size, hidden_size, bias=False)
        self.W_hr = nn.Linear(hidden_size, hidden_size, bias=True)
        
        # Update gate weights
        self.W_xz = nn.Linear(input_size, hidden_size, bias=False)
        self.W_hz = nn.Linear(hidden_size, hidden_size, bias=True)
        
        # Candidate hidden state weights
        self.W_xh = nn.Linear(input_size, hidden_size, bias=False)
        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=True)
        
    def forward(self, x, h_prev):
        """
        Forward pass for single time step.
        
        Args:
            x: input at time t, shape (batch, input_size)
            h_prev: hidden state at t-1, shape (batch, hidden_size)
            
        Returns:
            h_t: new hidden state
        """
        # Reset gate: how much of past to forget
        r_t = torch.sigmoid(self.W_xr(x) + self.W_hr(h_prev))
        
        # Update gate: how much of new info to use
        z_t = torch.sigmoid(self.W_xz(x) + self.W_hz(h_prev))
        
        # Candidate hidden state (reset gate applied here)
        h_candidate = torch.tanh(self.W_xh(x) + self.W_hh(r_t * h_prev))
        
        # Final hidden state: interpolation between old and new
        h_t = (1 - z_t) * h_prev + z_t * h_candidate
        
        return h_t, {'reset_gate': r_t, 'update_gate': z_t}

# Test our implementation
batch_size, input_size, hidden_size = 2, 4, 8
gru_cell = GRUCellFromScratch(input_size, hidden_size)

x = torch.randn(batch_size, input_size)
h = torch.zeros(batch_size, hidden_size)

h_new, gates = gru_cell(x, h)
print(f"Input shape: {x.shape}")
print(f"Hidden state shape: {h_new.shape}")
print(f"Reset gate shape: {gates['reset_gate'].shape}")
print(f"Update gate shape: {gates['update_gate'].shape}")

## 3. GRU vs LSTM Comparison

### Architectural Differences

| Aspect | LSTM | GRU |
|--------|------|-----|
| **Gates** | 3 (forget, input, output) | 2 (reset, update) |
| **States** | 2 (cell + hidden) | 1 (hidden only) |
| **Parameters** | ~4x more | ~3x more (vs vanilla RNN) |
| **Computation** | Slower | Faster |
| **Memory** | More | Less |

### Parameter Count Comparison

For input size $n$ and hidden size $h$:
- **LSTM**: $4 \times (n \times h + h^2 + h) = 4(nh + h^2 + h)$
- **GRU**: $3 \times (n \times h + h^2 + h) = 3(nh + h^2 + h)$

GRU has **~25% fewer parameters** than LSTM!

In [None]:
def count_parameters(model):
    """Count trainable parameters in a model."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Compare parameter counts
input_size = 10
hidden_sizes = [32, 64, 128, 256]
num_layers = 2

comparison_data = []

for hidden_size in hidden_sizes:
    lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
    gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
    
    lstm_params = count_parameters(lstm)
    gru_params = count_parameters(gru)
    
    comparison_data.append({
        'Hidden Size': hidden_size,
        'LSTM Params': lstm_params,
        'GRU Params': gru_params,
        'Reduction (%)': (1 - gru_params/lstm_params) * 100
    })

comparison_df = pd.DataFrame(comparison_data)
print("Parameter Comparison (2 layers):")
print(comparison_df.to_string(index=False))

In [None]:
# Speed comparison
import time

def benchmark_model(model, x, num_runs=100):
    """Benchmark forward pass speed."""
    model.eval()
    
    # Warm up
    with torch.no_grad():
        for _ in range(10):
            _ = model(x)
    
    # Benchmark
    start = time.time()
    with torch.no_grad():
        for _ in range(num_runs):
            _ = model(x)
    end = time.time()
    
    return (end - start) / num_runs * 1000  # ms per forward pass

# Create test input
batch_size, seq_len, input_size, hidden_size = 32, 60, 10, 64
x = torch.randn(batch_size, seq_len, input_size)

lstm = nn.LSTM(input_size, hidden_size, num_layers=2, batch_first=True)
gru = nn.GRU(input_size, hidden_size, num_layers=2, batch_first=True)

lstm_time = benchmark_model(lstm, x)
gru_time = benchmark_model(gru, x)

print(f"\nSpeed Comparison (seq_len={seq_len}, batch={batch_size}):")
print(f"LSTM: {lstm_time:.3f} ms per forward pass")
print(f"GRU:  {gru_time:.3f} ms per forward pass")
print(f"GRU is {(lstm_time/gru_time - 1)*100:.1f}% faster")

## 4. When to Use Each Architecture

### Use GRU When:
- ✅ **Limited computational resources** (fewer parameters)
- ✅ **Smaller datasets** (less overfitting risk)
- ✅ **Shorter sequences** (less need for long-term memory)
- ✅ **Real-time applications** (faster inference)
- ✅ **Quick prototyping** (simpler to tune)

### Use LSTM When:
- ✅ **Very long sequences** (better gradient flow via cell state)
- ✅ **Complex temporal dependencies** (more expressive)
- ✅ **Large datasets** (can utilize extra capacity)
- ✅ **Critical applications** (more battle-tested)

### Financial Applications Guide:

| Task | Recommendation | Reason |
|------|----------------|--------|
| Intraday volatility | GRU | Short sequences, speed matters |
| Long-term trend | LSTM | Needs long memory |
| HFT signals | GRU | Low latency critical |
| Portfolio optimization | Either | Test both |
| Sentiment analysis | GRU | Usually short text |

## 5. GRU Implementation in PyTorch

Let's build a complete GRU model for financial prediction:

In [None]:
class FinancialGRU(nn.Module):
    """
    GRU model for financial time series prediction.
    
    Features:
    - Multi-layer GRU with dropout
    - Batch normalization
    - Flexible output (single value or sequence)
    """
    
    def __init__(self, input_size, hidden_size, num_layers, output_size,
                 dropout=0.2, bidirectional=False):
        super(FinancialGRU, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.bidirectional = bidirectional
        self.num_directions = 2 if bidirectional else 1
        
        # Input batch normalization
        self.batch_norm = nn.BatchNorm1d(input_size)
        
        # GRU layers
        self.gru = nn.GRU(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=bidirectional
        )
        
        # Output layers
        fc_input_size = hidden_size * self.num_directions
        self.fc = nn.Sequential(
            nn.Linear(fc_input_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, output_size)
        )
        
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x: Input tensor of shape (batch, seq_len, input_size)
            
        Returns:
            Output prediction
        """
        batch_size = x.size(0)
        
        # Batch normalization (transpose for BatchNorm1d)
        x = x.permute(0, 2, 1)  # (batch, features, seq_len)
        x = self.batch_norm(x)
        x = x.permute(0, 2, 1)  # (batch, seq_len, features)
        
        # Initialize hidden state
        h0 = torch.zeros(
            self.num_layers * self.num_directions,
            batch_size,
            self.hidden_size
        ).to(x.device)
        
        # GRU forward pass
        gru_out, h_n = self.gru(x, h0)
        
        # Use last hidden state for prediction
        if self.bidirectional:
            # Concatenate forward and backward final states
            h_final = torch.cat((h_n[-2], h_n[-1]), dim=1)
        else:
            h_final = h_n[-1]
        
        # Output layer
        out = self.fc(h_final)
        
        return out

# Test the model
model = FinancialGRU(
    input_size=5,
    hidden_size=64,
    num_layers=2,
    output_size=1,
    dropout=0.2
)

x_test = torch.randn(16, 30, 5)  # (batch, seq_len, features)
y_test = model(x_test)
print(f"Model output shape: {y_test.shape}")
print(f"Total parameters: {count_parameters(model):,}")

## 6. Volatility Prediction - Practical Application

We'll build a GRU model to predict realized volatility using stock price data.

In [None]:
# Download financial data
ticker = 'SPY'
start_date = '2018-01-01'
end_date = '2024-01-01'

print(f"Downloading {ticker} data...")
data = yf.download(ticker, start=start_date, end=end_date, progress=False)

# Use Close price
df = pd.DataFrame()
df['Close'] = data['Close']

print(f"Data shape: {df.shape}")
print(f"Date range: {df.index[0]} to {df.index[-1]}")
df.head()

In [None]:
# Feature engineering for volatility prediction
def create_volatility_features(df, vol_window=20):
    """
    Create features for volatility prediction.
    
    Args:
        df: DataFrame with Close prices
        vol_window: Window for realized volatility calculation
        
    Returns:
        DataFrame with features and target
    """
    features = pd.DataFrame(index=df.index)
    
    # Returns
    features['returns'] = df['Close'].pct_change()
    features['log_returns'] = np.log(df['Close'] / df['Close'].shift(1))
    
    # Absolute returns (proxy for volatility)
    features['abs_returns'] = features['returns'].abs()
    
    # Historical volatility (rolling std of returns)
    features['vol_5d'] = features['returns'].rolling(5).std() * np.sqrt(252)
    features['vol_10d'] = features['returns'].rolling(10).std() * np.sqrt(252)
    features['vol_20d'] = features['returns'].rolling(20).std() * np.sqrt(252)
    
    # Volatility ratio (short-term / long-term)
    features['vol_ratio'] = features['vol_5d'] / features['vol_20d']
    
    # Price momentum
    features['momentum_5d'] = df['Close'].pct_change(5)
    features['momentum_10d'] = df['Close'].pct_change(10)
    
    # Range-based volatility (Parkinson)
    if 'High' in df.columns and 'Low' in df.columns:
        features['range'] = (np.log(df['High']) - np.log(df['Low'])) ** 2
        features['parkinson_vol'] = np.sqrt(features['range'].rolling(20).mean() / (4 * np.log(2))) * np.sqrt(252)
    
    # Target: Forward realized volatility (what we want to predict)
    features['target_vol'] = features['returns'].rolling(vol_window).std().shift(-vol_window) * np.sqrt(252)
    
    return features.dropna()

# Create features
features_df = create_volatility_features(df)
print(f"Features shape: {features_df.shape}")
print(f"\nFeature columns:")
print(features_df.columns.tolist())

In [None]:
# Visualize volatility
fig, axes = plt.subplots(3, 1, figsize=(14, 10))

# Price
axes[0].plot(df.index, df['Close'], 'b-', linewidth=0.8)
axes[0].set_title(f'{ticker} Close Price', fontsize=12)
axes[0].set_ylabel('Price ($)')
axes[0].grid(True, alpha=0.3)

# Returns
axes[1].plot(features_df.index, features_df['returns'], 'gray', linewidth=0.5, alpha=0.7)
axes[1].axhline(y=0, color='black', linestyle='-', linewidth=0.5)
axes[1].set_title('Daily Returns', fontsize=12)
axes[1].set_ylabel('Return')
axes[1].grid(True, alpha=0.3)

# Volatility
axes[2].plot(features_df.index, features_df['vol_20d'], 'b-', label='20-day Vol', linewidth=1)
axes[2].plot(features_df.index, features_df['target_vol'], 'r--', label='Target (Forward Vol)', linewidth=1, alpha=0.7)
axes[2].set_title('Annualized Volatility', fontsize=12)
axes[2].set_ylabel('Volatility')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Prepare data for GRU
def prepare_sequences(data, feature_cols, target_col, seq_length):
    """
    Create sequences for time series prediction.
    
    Args:
        data: DataFrame with features and target
        feature_cols: List of feature column names
        target_col: Target column name
        seq_length: Length of input sequences
        
    Returns:
        X, y arrays
    """
    X, y = [], []
    
    features = data[feature_cols].values
    targets = data[target_col].values
    
    for i in range(len(data) - seq_length):
        X.append(features[i:i+seq_length])
        y.append(targets[i+seq_length-1])  # Target at end of sequence
    
    return np.array(X), np.array(y)

# Define features to use
feature_cols = ['returns', 'abs_returns', 'vol_5d', 'vol_10d', 'vol_20d', 
                'vol_ratio', 'momentum_5d', 'momentum_10d']
target_col = 'target_vol'
seq_length = 30  # 30 days of history

# Create sequences
X, y = prepare_sequences(features_df, feature_cols, target_col, seq_length)
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

In [None]:
# Train/validation/test split (temporal)
train_size = int(len(X) * 0.7)
val_size = int(len(X) * 0.15)

X_train = X[:train_size]
y_train = y[:train_size]

X_val = X[train_size:train_size+val_size]
y_val = y[train_size:train_size+val_size]

X_test = X[train_size+val_size:]
y_test = y[train_size+val_size:]

print(f"Train: {X_train.shape[0]} samples")
print(f"Val:   {X_val.shape[0]} samples")
print(f"Test:  {X_test.shape[0]} samples")

In [None]:
# Scale features
scaler_X = StandardScaler()
scaler_y = StandardScaler()

# Fit on training data only
X_train_flat = X_train.reshape(-1, X_train.shape[-1])
scaler_X.fit(X_train_flat)
scaler_y.fit(y_train.reshape(-1, 1))

# Transform all sets
def scale_sequences(X, scaler):
    """Scale 3D sequence data."""
    original_shape = X.shape
    X_flat = X.reshape(-1, original_shape[-1])
    X_scaled = scaler.transform(X_flat)
    return X_scaled.reshape(original_shape)

X_train_scaled = scale_sequences(X_train, scaler_X)
X_val_scaled = scale_sequences(X_val, scaler_X)
X_test_scaled = scale_sequences(X_test, scaler_X)

y_train_scaled = scaler_y.transform(y_train.reshape(-1, 1)).flatten()
y_val_scaled = scaler_y.transform(y_val.reshape(-1, 1)).flatten()
y_test_scaled = scaler_y.transform(y_test.reshape(-1, 1)).flatten()

print("Data scaled successfully")

In [None]:
# Create PyTorch datasets and dataloaders
batch_size = 32

train_dataset = TensorDataset(
    torch.FloatTensor(X_train_scaled),
    torch.FloatTensor(y_train_scaled)
)
val_dataset = TensorDataset(
    torch.FloatTensor(X_val_scaled),
    torch.FloatTensor(y_val_scaled)
)
test_dataset = TensorDataset(
    torch.FloatTensor(X_test_scaled),
    torch.FloatTensor(y_test_scaled)
)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"Number of batches - Train: {len(train_loader)}, Val: {len(val_loader)}, Test: {len(test_loader)}")

In [None]:
# Initialize model
model = FinancialGRU(
    input_size=len(feature_cols),
    hidden_size=64,
    num_layers=2,
    output_size=1,
    dropout=0.2,
    bidirectional=False
).to(device)

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=10)

print(model)
print(f"\nTotal trainable parameters: {count_parameters(model):,}")

In [None]:
# Training function
def train_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    
    for X_batch, y_batch in loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        
        optimizer.zero_grad()
        outputs = model(X_batch).squeeze()
        loss = criterion(outputs, y_batch)
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        total_loss += loss.item()
    
    return total_loss / len(loader)

def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss = 0
    predictions = []
    actuals = []
    
    with torch.no_grad():
        for X_batch, y_batch in loader:
            X_batch = X_batch.to(device)
            y_batch = y_batch.to(device)
            
            outputs = model(X_batch).squeeze()
            loss = criterion(outputs, y_batch)
            total_loss += loss.item()
            
            predictions.extend(outputs.cpu().numpy())
            actuals.extend(y_batch.cpu().numpy())
    
    return total_loss / len(loader), np.array(predictions), np.array(actuals)

In [None]:
# Training loop
num_epochs = 100
best_val_loss = float('inf')
patience = 20
patience_counter = 0

train_losses = []
val_losses = []

print("Training GRU model...")
print("-" * 50)

for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
    val_loss, _, _ = evaluate(model, val_loader, criterion, device)
    
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    
    # Learning rate scheduling
    scheduler.step(val_loss)
    
    # Early stopping
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_gru_model.pth')
        patience_counter = 0
    else:
        patience_counter += 1
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}] - Train Loss: {train_loss:.6f}, Val Loss: {val_loss:.6f}")
    
    if patience_counter >= patience:
        print(f"\nEarly stopping at epoch {epoch+1}")
        break

print(f"\nBest validation loss: {best_val_loss:.6f}")

In [None]:
# Plot training history
fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(train_losses, label='Train Loss', linewidth=1.5)
ax.plot(val_losses, label='Validation Loss', linewidth=1.5)
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss (MSE)')
ax.set_title('GRU Training History')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Load best model and evaluate on test set
model.load_state_dict(torch.load('best_gru_model.pth'))

# Get predictions
test_loss, y_pred_scaled, y_true_scaled = evaluate(model, test_loader, criterion, device)

# Inverse transform predictions
y_pred = scaler_y.inverse_transform(y_pred_scaled.reshape(-1, 1)).flatten()
y_true = scaler_y.inverse_transform(y_true_scaled.reshape(-1, 1)).flatten()

# Calculate metrics
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

# Direction accuracy (volatility increasing or decreasing)
y_true_diff = np.diff(y_true)
y_pred_diff = np.diff(y_pred)
direction_acc = np.mean(np.sign(y_true_diff) == np.sign(y_pred_diff))

print("Test Set Performance:")
print("=" * 40)
print(f"MSE:              {mse:.6f}")
print(f"RMSE:             {rmse:.6f}")
print(f"MAE:              {mae:.6f}")
print(f"R² Score:         {r2:.4f}")
print(f"Direction Acc:    {direction_acc:.2%}")

In [None]:
# Visualize predictions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Time series comparison
ax1 = axes[0, 0]
ax1.plot(y_true, label='Actual Volatility', linewidth=1, alpha=0.8)
ax1.plot(y_pred, label='Predicted Volatility', linewidth=1, alpha=0.8)
ax1.set_title('Volatility Prediction - Test Set', fontsize=12)
ax1.set_xlabel('Time')
ax1.set_ylabel('Annualized Volatility')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Scatter plot
ax2 = axes[0, 1]
ax2.scatter(y_true, y_pred, alpha=0.5, s=20)
ax2.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--', linewidth=2, label='Perfect Prediction')
ax2.set_xlabel('Actual Volatility')
ax2.set_ylabel('Predicted Volatility')
ax2.set_title(f'Actual vs Predicted (R² = {r2:.3f})', fontsize=12)
ax2.legend()
ax2.grid(True, alpha=0.3)

# Prediction error over time
ax3 = axes[1, 0]
errors = y_pred - y_true
ax3.plot(errors, linewidth=0.8, alpha=0.7)
ax3.axhline(y=0, color='r', linestyle='--', linewidth=1)
ax3.fill_between(range(len(errors)), errors, 0, alpha=0.3)
ax3.set_title('Prediction Error Over Time', fontsize=12)
ax3.set_xlabel('Time')
ax3.set_ylabel('Error (Pred - Actual)')
ax3.grid(True, alpha=0.3)

# Error distribution
ax4 = axes[1, 1]
ax4.hist(errors, bins=50, density=True, alpha=0.7, edgecolor='black')
ax4.axvline(x=0, color='r', linestyle='--', linewidth=2)
ax4.axvline(x=errors.mean(), color='g', linestyle='-', linewidth=2, label=f'Mean = {errors.mean():.4f}')
ax4.set_title('Error Distribution', fontsize=12)
ax4.set_xlabel('Prediction Error')
ax4.set_ylabel('Density')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. GRU vs LSTM Head-to-Head Comparison

In [None]:
# LSTM model for comparison
class FinancialLSTM(nn.Module):
    """LSTM model with same architecture as GRU for fair comparison."""
    
    def __init__(self, input_size, hidden_size, num_layers, output_size, dropout=0.2):
        super(FinancialLSTM, self).__init__()
        
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        self.batch_norm = nn.BatchNorm1d(input_size)
        
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )
        
        self.fc = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_size, output_size)
        )
        
    def forward(self, x):
        batch_size = x.size(0)
        
        x = x.permute(0, 2, 1)
        x = self.batch_norm(x)
        x = x.permute(0, 2, 1)
        
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
        
        lstm_out, (h_n, c_n) = self.lstm(x, (h0, c0))
        out = self.fc(h_n[-1])
        
        return out

In [None]:
# Train and compare both models
def train_model(model, train_loader, val_loader, num_epochs=50, patience=15):
    """Train a model and return history."""
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
    
    best_val_loss = float('inf')
    patience_counter = 0
    train_history = []
    val_history = []
    
    for epoch in range(num_epochs):
        train_loss = train_epoch(model, train_loader, criterion, optimizer, device)
        val_loss, _, _ = evaluate(model, val_loader, criterion, device)
        
        train_history.append(train_loss)
        val_history.append(val_loss)
        
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_weights = model.state_dict().copy()
            patience_counter = 0
        else:
            patience_counter += 1
        
        if patience_counter >= patience:
            break
    
    model.load_state_dict(best_weights)
    return train_history, val_history, best_val_loss

# Initialize fresh models
gru_model = FinancialGRU(
    input_size=len(feature_cols),
    hidden_size=64,
    num_layers=2,
    output_size=1,
    dropout=0.2
).to(device)

lstm_model = FinancialLSTM(
    input_size=len(feature_cols),
    hidden_size=64,
    num_layers=2,
    output_size=1,
    dropout=0.2
).to(device)

print("Training GRU model...")
gru_train, gru_val, gru_best = train_model(gru_model, train_loader, val_loader)

print("\nTraining LSTM model...")
lstm_train, lstm_val, lstm_best = train_model(lstm_model, train_loader, val_loader)

print(f"\nGRU Best Val Loss:  {gru_best:.6f}")
print(f"LSTM Best Val Loss: {lstm_best:.6f}")

In [None]:
# Evaluate both on test set
def get_test_metrics(model, test_loader, scaler_y):
    """Get test metrics for a model."""
    criterion = nn.MSELoss()
    _, y_pred_scaled, y_true_scaled = evaluate(model, test_loader, criterion, device)
    
    y_pred = scaler_y.inverse_transform(y_pred_scaled.reshape(-1, 1)).flatten()
    y_true = scaler_y.inverse_transform(y_true_scaled.reshape(-1, 1)).flatten()
    
    return {
        'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)),
        'MAE': mean_absolute_error(y_true, y_pred),
        'R²': r2_score(y_true, y_pred),
        'predictions': y_pred,
        'actuals': y_true
    }

gru_metrics = get_test_metrics(gru_model, test_loader, scaler_y)
lstm_metrics = get_test_metrics(lstm_model, test_loader, scaler_y)

# Comparison table
comparison = pd.DataFrame({
    'Metric': ['RMSE', 'MAE', 'R²', 'Parameters', 'Epochs Trained'],
    'GRU': [
        f"{gru_metrics['RMSE']:.6f}",
        f"{gru_metrics['MAE']:.6f}",
        f"{gru_metrics['R²']:.4f}",
        f"{count_parameters(gru_model):,}",
        len(gru_train)
    ],
    'LSTM': [
        f"{lstm_metrics['RMSE']:.6f}",
        f"{lstm_metrics['MAE']:.6f}",
        f"{lstm_metrics['R²']:.4f}",
        f"{count_parameters(lstm_model):,}",
        len(lstm_train)
    ]
})

print("\nGRU vs LSTM Comparison:")
print("=" * 50)
print(comparison.to_string(index=False))

In [None]:
# Visual comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Training curves
ax1 = axes[0, 0]
ax1.plot(gru_val, label='GRU', linewidth=2)
ax1.plot(lstm_val, label='LSTM', linewidth=2)
ax1.set_title('Validation Loss During Training', fontsize=12)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Test predictions comparison
ax2 = axes[0, 1]
ax2.plot(gru_metrics['actuals'][:100], 'k-', label='Actual', linewidth=1.5, alpha=0.8)
ax2.plot(gru_metrics['predictions'][:100], 'b--', label='GRU', linewidth=1.5, alpha=0.8)
ax2.plot(lstm_metrics['predictions'][:100], 'r:', label='LSTM', linewidth=1.5, alpha=0.8)
ax2.set_title('Predictions Comparison (First 100 Points)', fontsize=12)
ax2.set_xlabel('Time')
ax2.set_ylabel('Volatility')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Error comparison
ax3 = axes[1, 0]
gru_errors = gru_metrics['predictions'] - gru_metrics['actuals']
lstm_errors = lstm_metrics['predictions'] - lstm_metrics['actuals']
ax3.boxplot([gru_errors, lstm_errors], labels=['GRU', 'LSTM'])
ax3.axhline(y=0, color='r', linestyle='--', linewidth=1)
ax3.set_title('Prediction Error Distribution', fontsize=12)
ax3.set_ylabel('Error')
ax3.grid(True, alpha=0.3)

# Parameter efficiency
ax4 = axes[1, 1]
models = ['GRU', 'LSTM']
params = [count_parameters(gru_model), count_parameters(lstm_model)]
r2_scores = [gru_metrics['R²'], lstm_metrics['R²']]

x_pos = np.arange(len(models))
width = 0.35

ax4_twin = ax4.twinx()
bars1 = ax4.bar(x_pos - width/2, params, width, label='Parameters', color='steelblue', alpha=0.7)
bars2 = ax4_twin.bar(x_pos + width/2, r2_scores, width, label='R² Score', color='coral', alpha=0.7)

ax4.set_xlabel('Model')
ax4.set_ylabel('Parameters', color='steelblue')
ax4_twin.set_ylabel('R² Score', color='coral')
ax4.set_xticks(x_pos)
ax4.set_xticklabels(models)
ax4.set_title('Parameter Efficiency', fontsize=12)

plt.tight_layout()
plt.show()

## 8. Key Takeaways

### GRU Architecture Summary
- **Simpler than LSTM**: 2 gates (reset, update) vs 3 gates
- **No separate cell state**: Hidden state serves both purposes
- **Faster training**: ~25% fewer parameters

### When to Choose GRU
1. **Computational constraints**: Faster training and inference
2. **Smaller datasets**: Less prone to overfitting
3. **Shorter sequences**: Adequate memory capacity
4. **Quick prototyping**: Simpler hyperparameter tuning

### Financial Applications
- ✅ Volatility forecasting
- ✅ Short-term price prediction
- ✅ Real-time trading signals
- ✅ Sentiment analysis

### Best Practices
1. **Always compare**: Test both GRU and LSTM on your specific task
2. **Proper scaling**: StandardScaler or MinMaxScaler
3. **Gradient clipping**: Prevent exploding gradients
4. **Early stopping**: Prevent overfitting
5. **Temporal split**: Never shuffle time series data

---

## 9. Exercises

1. **Bidirectional GRU**: Implement and compare bidirectional GRU for volatility prediction
2. **Multi-step forecasting**: Modify the model to predict 5-day ahead volatility
3. **Feature importance**: Use attention mechanism to identify important features
4. **Ensemble**: Combine GRU and LSTM predictions
5. **Different assets**: Test the model on crypto or forex data

In [None]:
# Clean up
import os
if os.path.exists('best_gru_model.pth'):
    os.remove('best_gru_model.pth')
    
print("Notebook completed successfully!")