# Week 15 - Day 7: Interview Review - Transformers & Attention

## Overview
Final day of Week 15 covering:
- **10 Transformer Interview Questions** with detailed answers
- **Common Mistakes** and how to avoid them
- **Mini-Project**: Complete Transformer-based stock predictor
- **Week Summary** consolidating all concepts

---

In [None]:
# Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import math
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Plot settings
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

---
# Part 1: 10 Transformer Interview Questions

## Question 1: What is Self-Attention and Why is it Important?

**Answer:**
Self-attention allows each position in a sequence to attend to all other positions, computing relevance scores dynamically. Unlike RNNs that process sequentially, self-attention captures long-range dependencies in O(1) sequential operations.

**Key Formula:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

**Finance Application:** In time series, self-attention can directly relate today's price to events from 100 days ago without information degradation.

In [None]:
# Demonstration: Self-Attention Mechanism
class SelfAttention(nn.Module):
    """Single-head self-attention for interview demonstration."""
    
    def __init__(self, embed_dim):
        super().__init__()
        self.embed_dim = embed_dim
        self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
        self.scale = math.sqrt(embed_dim)
    
    def forward(self, x):
        # x: (batch, seq_len, embed_dim)
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        attn_weights = torch.softmax(scores, dim=-1)
        
        # Weighted sum
        output = torch.matmul(attn_weights, V)
        return output, attn_weights

# Test self-attention
embed_dim = 64
seq_len = 10
batch_size = 2

attn = SelfAttention(embed_dim)
x = torch.randn(batch_size, seq_len, embed_dim)
output, weights = attn(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"\nAttention weights sum (should be 1.0): {weights[0, 0].sum().item():.4f}")

## Question 2: Explain Multi-Head Attention

**Answer:**
Multi-head attention runs multiple attention operations in parallel, each learning different relationship patterns. Heads are concatenated and projected back.

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

**Why Multiple Heads?**
- Head 1 might learn short-term momentum
- Head 2 might capture mean reversion patterns
- Head 3 might identify volatility clusters

In [None]:
# Multi-Head Attention Implementation
class MultiHeadAttention(nn.Module):
    """Multi-head attention for interview demonstration."""
    
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
        
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.scale = math.sqrt(self.head_dim)
        
        self.W_qkv = nn.Linear(embed_dim, 3 * embed_dim, bias=False)
        self.W_o = nn.Linear(embed_dim, embed_dim, bias=False)
    
    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape
        
        # Combined projection
        qkv = self.W_qkv(x)  # (batch, seq, 3*embed)
        qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, batch, heads, seq, head_dim)
        Q, K, V = qkv[0], qkv[1], qkv[2]
        
        # Attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn_weights = torch.softmax(scores, dim=-1)
        attn_output = torch.matmul(attn_weights, V)
        
        # Concatenate heads
        attn_output = attn_output.transpose(1, 2).reshape(batch_size, seq_len, self.embed_dim)
        output = self.W_o(attn_output)
        
        return output, attn_weights

# Test multi-head attention
mha = MultiHeadAttention(embed_dim=64, num_heads=8)
x = torch.randn(2, 10, 64)
output, weights = mha(x)

print(f"Multi-Head Attention:")
print(f"  Input: {x.shape}")
print(f"  Output: {output.shape}")
print(f"  Attention weights: {weights.shape} (batch, heads, seq, seq)")

## Question 3: Why Scale by ‚àöd_k in Attention?

**Answer:**
Without scaling, dot products grow with dimension size, pushing softmax into regions with extremely small gradients.

**Mathematical Explanation:**
- If Q and K have components with mean 0 and variance 1
- Their dot product has variance d_k
- Scaling by ‚àöd_k normalizes variance back to 1
- This keeps softmax in a region with meaningful gradients

In [None]:
# Demonstrate scaling importance
def compare_scaling(d_k_values=[16, 64, 256, 1024]):
    """Show how scaling affects attention distribution."""
    
    fig, axes = plt.subplots(2, len(d_k_values), figsize=(14, 6))
    
    for i, d_k in enumerate(d_k_values):
        # Random Q and K
        Q = torch.randn(1, 10, d_k)
        K = torch.randn(1, 10, d_k)
        
        # Unscaled
        scores_unscaled = torch.matmul(Q, K.transpose(-2, -1))
        attn_unscaled = torch.softmax(scores_unscaled, dim=-1)
        
        # Scaled
        scores_scaled = scores_unscaled / math.sqrt(d_k)
        attn_scaled = torch.softmax(scores_scaled, dim=-1)
        
        # Plot unscaled
        axes[0, i].imshow(attn_unscaled[0].detach().numpy(), cmap='Blues')
        axes[0, i].set_title(f'd_k={d_k} (unscaled)')
        axes[0, i].set_xlabel(f'Max: {attn_unscaled.max():.3f}')
        
        # Plot scaled
        axes[1, i].imshow(attn_scaled[0].detach().numpy(), cmap='Blues')
        axes[1, i].set_title(f'd_k={d_k} (scaled)')
        axes[1, i].set_xlabel(f'Max: {attn_scaled.max():.3f}')
    
    axes[0, 0].set_ylabel('Unscaled')
    axes[1, 0].set_ylabel('Scaled')
    plt.suptitle('Effect of Scaling on Attention Distributions', fontsize=12)
    plt.tight_layout()
    plt.show()

compare_scaling()
print("\nObservation: Without scaling, larger d_k leads to more 'peaky' attention")
print("(one position dominates), reducing model's ability to attend broadly.")

## Question 4: What is Positional Encoding and Why is it Needed?

**Answer:**
Self-attention is permutation invariant - it treats sequences as sets. Positional encoding injects position information so the model knows the order of elements.

**Sinusoidal Encoding Formula:**
$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})$$
$$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d})$$

**Finance Application:** Critical for time series where sequence order represents temporal dynamics.

In [None]:
# Positional Encoding Implementation
class PositionalEncoding(nn.Module):
    """Sinusoidal positional encoding."""
    
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # (1, max_len, d_model)
        
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)

# Visualize positional encoding
d_model = 64
pe = PositionalEncoding(d_model, max_len=100, dropout=0.0)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap
pe_matrix = pe.pe[0, :50, :32].numpy()
im = axes[0].imshow(pe_matrix, aspect='auto', cmap='RdBu')
axes[0].set_xlabel('Embedding Dimension')
axes[0].set_ylabel('Position')
axes[0].set_title('Positional Encoding Heatmap')
plt.colorbar(im, ax=axes[0])

# Individual dimensions
positions = range(50)
for dim in [0, 1, 4, 5, 10, 11]:
    axes[1].plot(positions, pe.pe[0, :50, dim].numpy(), label=f'dim {dim}')
axes[1].set_xlabel('Position')
axes[1].set_ylabel('Encoding Value')
axes[1].set_title('Positional Encoding by Dimension')
axes[1].legend()

plt.tight_layout()
plt.show()

## Question 5: What is the Transformer Architecture?

**Answer:**
The Transformer consists of:
1. **Encoder**: Processes input sequence with self-attention + feed-forward layers
2. **Decoder**: Generates output with masked self-attention + cross-attention + feed-forward

**Key Components:**
- Multi-head attention
- Position-wise feed-forward networks
- Layer normalization
- Residual connections

**For Time Series (Encoder-only):** Often use just the encoder for regression/classification tasks.

In [None]:
# Transformer Encoder Layer
class TransformerEncoderLayer(nn.Module):
    """Single transformer encoder layer."""
    
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        
        # Multi-head attention
        self.self_attn = nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        
        # Layer normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        # Self-attention with residual
        attn_output, attn_weights = self.self_attn(x, x, x, attn_mask=mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)
        
        return x, attn_weights

# Test encoder layer
enc_layer = TransformerEncoderLayer(d_model=64, num_heads=8, d_ff=256)
x = torch.randn(2, 20, 64)
output, weights = enc_layer(x)

print(f"Transformer Encoder Layer:")
print(f"  Input: {x.shape}")
print(f"  Output: {output.shape}")
print(f"  Attention weights: {weights.shape}")

## Question 6: Layer Normalization vs Batch Normalization?

**Answer:**
| Aspect | LayerNorm | BatchNorm |
|--------|-----------|------------|
| Normalizes across | Features | Batch |
| Batch size dependency | No | Yes |
| Sequence length handling | Independent | Problematic |
| Preferred for | Transformers, NLP | CNNs, fixed-size inputs |

**Why LayerNorm for Transformers?**
- Works with variable sequence lengths
- No running statistics to maintain
- Each sample normalized independently

In [None]:
# Compare normalization methods
batch_size, seq_len, features = 4, 10, 64
x = torch.randn(batch_size, seq_len, features) * 3 + 2  # Non-standard distribution

# Layer Normalization (across features)
layer_norm = nn.LayerNorm(features)
x_ln = layer_norm(x)

# Batch Normalization (across batch)
batch_norm = nn.BatchNorm1d(features)
x_bn = batch_norm(x.transpose(1, 2)).transpose(1, 2)  # BN expects (N, C, L)

print("Normalization Comparison:")
print(f"\nOriginal - Mean: {x.mean():.4f}, Std: {x.std():.4f}")
print(f"LayerNorm - Mean: {x_ln.mean():.4f}, Std: {x_ln.std():.4f}")
print(f"BatchNorm - Mean: {x_bn.mean():.4f}, Std: {x_bn.std():.4f}")

# Per-sample statistics (LayerNorm normalizes each sample independently)
print(f"\nPer-sample mean after LayerNorm (should be ~0):")
for i in range(batch_size):
    print(f"  Sample {i}: {x_ln[i].mean():.6f}")

## Question 7: What is Causal (Masked) Attention?

**Answer:**
Causal attention prevents positions from attending to future positions, essential for:
- Autoregressive generation
- Time series forecasting (no future information leakage)

**Implementation:** Apply a triangular mask where future positions have -‚àû before softmax.

**Finance Critical:** Prevents look-ahead bias in backtesting!

In [None]:
# Causal Mask Implementation
def create_causal_mask(seq_len):
    """Create causal mask for self-attention."""
    mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1)
    mask = mask.masked_fill(mask == 1, float('-inf'))
    return mask

# Visualize causal mask
seq_len = 10
causal_mask = create_causal_mask(seq_len)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Show mask
mask_visual = torch.where(causal_mask == float('-inf'), torch.tensor(0.0), torch.tensor(1.0))
axes[0].imshow(mask_visual.numpy(), cmap='RdYlGn')
axes[0].set_xlabel('Key Position (can attend to)')
axes[0].set_ylabel('Query Position')
axes[0].set_title('Causal Mask (Green = can attend)')

# Show resulting attention pattern
Q = K = torch.randn(1, seq_len, 64)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(64)
scores_masked = scores + causal_mask.unsqueeze(0)
attn = torch.softmax(scores_masked, dim=-1)

axes[1].imshow(attn[0].detach().numpy(), cmap='Blues')
axes[1].set_xlabel('Key Position')
axes[1].set_ylabel('Query Position')
axes[1].set_title('Resulting Attention Weights')

plt.tight_layout()
plt.show()

print("Position 0 can only attend to itself")
print("Position 5 can attend to positions 0-5")
print("This prevents future information leakage!")

## Question 8: Transformer Complexity - Time and Space?

**Answer:**
| Component | Time Complexity | Space Complexity |
|-----------|-----------------|------------------|
| Self-Attention | O(n¬≤d) | O(n¬≤) |
| Feed-Forward | O(nd¬≤) | O(d) |
| Total per Layer | O(n¬≤d + nd¬≤) | O(n¬≤ + d¬≤) |

**Key Insight:** Quadratic in sequence length n is the main bottleneck for long sequences.

**Solutions:**
- Sparse attention (Longformer, BigBird)
- Linear attention approximations
- Chunked processing

In [None]:
# Complexity Analysis
import time

def measure_attention_time(seq_lengths, d_model=64, num_heads=8):
    """Measure attention computation time for different sequence lengths."""
    times = []
    
    for seq_len in seq_lengths:
        mha = nn.MultiheadAttention(d_model, num_heads, batch_first=True)
        x = torch.randn(1, seq_len, d_model)
        
        # Warm up
        _ = mha(x, x, x)
        
        # Measure
        start = time.time()
        for _ in range(10):
            _ = mha(x, x, x)
        elapsed = (time.time() - start) / 10
        times.append(elapsed)
    
    return times

seq_lengths = [32, 64, 128, 256, 512]
times = measure_attention_time(seq_lengths)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(seq_lengths, times, 'bo-', linewidth=2, markersize=8)
axes[0].set_xlabel('Sequence Length')
axes[0].set_ylabel('Time (seconds)')
axes[0].set_title('Attention Computation Time')

# Theoretical O(n¬≤) curve
theoretical = [t * (seq_lengths[0]**2 / s**2) for s, t in zip(seq_lengths, times)]
axes[1].plot(seq_lengths, [t/times[0] for t in times], 'bo-', label='Measured', linewidth=2)
axes[1].plot(seq_lengths, [(s/seq_lengths[0])**2 for s in seq_lengths], 'r--', label='O(n¬≤) theoretical')
axes[1].set_xlabel('Sequence Length')
axes[1].set_ylabel('Relative Time')
axes[1].set_title('Complexity Scaling')
axes[1].legend()

plt.tight_layout()
plt.show()

## Question 9: Pre-Norm vs Post-Norm Architecture?

**Answer:**
- **Post-Norm (Original):** LayerNorm after residual addition
  - `x = LayerNorm(x + Sublayer(x))`
  - Requires careful learning rate warmup
  
- **Pre-Norm (Preferred):** LayerNorm before sublayer
  - `x = x + Sublayer(LayerNorm(x))`
  - More stable training
  - Better gradient flow

In [None]:
# Pre-Norm vs Post-Norm Comparison
class PreNormEncoderLayer(nn.Module):
    """Encoder layer with Pre-LayerNorm (preferred)."""
    
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        # Pre-norm: normalize BEFORE sublayer
        x_norm = self.norm1(x)
        attn_output, _ = self.self_attn(x_norm, x_norm, x_norm, attn_mask=mask)
        x = x + attn_output  # Residual
        
        x_norm = self.norm2(x)
        ffn_output = self.ffn(x_norm)
        x = x + ffn_output  # Residual
        
        return x

class PostNormEncoderLayer(nn.Module):
    """Encoder layer with Post-LayerNorm (original)."""
    
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
    
    def forward(self, x, mask=None):
        # Post-norm: normalize AFTER residual addition
        attn_output, _ = self.self_attn(x, x, x, attn_mask=mask)
        x = self.norm1(x + attn_output)  # Norm after residual
        
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)  # Norm after residual
        
        return x

print("Pre-Norm Advantages:")
print("  ‚úì More stable training")
print("  ‚úì Doesn't require learning rate warmup")
print("  ‚úì Better gradient flow through residual paths")
print("\nPost-Norm Disadvantages:")
print("  ‚úó Can have gradient explosion without warmup")
print("  ‚úó Sensitive to initialization")

## Question 10: How to Adapt Transformers for Time Series?

**Answer:**
Key adaptations:
1. **Input Embedding:** Linear projection or Conv1D for continuous values
2. **Positional Encoding:** Learnable or sinusoidal for temporal order
3. **Causal Masking:** Prevent future information leakage
4. **Output Head:** Regression head for price prediction
5. **Temporal Features:** Add time-based features (day of week, month, etc.)

**Finance-Specific:**
- Use returns instead of raw prices
- Include volume, volatility as features
- Consider market regime indicators

---
# Part 2: Common Mistakes & How to Avoid Them

## Mistake 1: Information Leakage (Look-Ahead Bias)

In [None]:
# WRONG: No causal mask - can see future
class LeakyTransformer(nn.Module):
    """BAD: This transformer can see the future!"""
    def __init__(self, d_model, num_heads, num_layers):
        super().__init__()
        encoder_layer = nn.TransformerEncoderLayer(d_model, num_heads, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
    
    def forward(self, x):
        return self.transformer(x)  # No mask!

# CORRECT: With causal mask
class CausalTransformer(nn.Module):
    """GOOD: Properly masked transformer."""
    def __init__(self, d_model, num_heads, num_layers):
        super().__init__()
        encoder_layer = nn.TransformerEncoderLayer(d_model, num_heads, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
    
    def forward(self, x):
        seq_len = x.size(1)
        mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
        mask = mask.to(x.device)
        return self.transformer(x, mask=mask)

print("‚ùå WRONG: Transformer without causal mask can 'cheat' by seeing future values")
print("‚úÖ CORRECT: Always use causal mask for time series prediction")

## Mistake 2: Forgetting Positional Encoding

In [None]:
# Demonstrate importance of positional encoding
class TransformerNoPE(nn.Module):
    """BAD: No positional encoding."""
    def __init__(self, input_dim, d_model, num_heads, num_layers):
        super().__init__()
        self.embedding = nn.Linear(input_dim, d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, num_heads, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
        self.output = nn.Linear(d_model, 1)
    
    def forward(self, x):
        x = self.embedding(x)
        x = self.transformer(x)  # No positional info!
        return self.output(x[:, -1])

class TransformerWithPE(nn.Module):
    """GOOD: With positional encoding."""
    def __init__(self, input_dim, d_model, num_heads, num_layers):
        super().__init__()
        self.embedding = nn.Linear(input_dim, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, num_heads, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
        self.output = nn.Linear(d_model, 1)
    
    def forward(self, x):
        x = self.embedding(x)
        x = self.pos_encoding(x)  # Add position info!
        x = self.transformer(x)
        return self.output(x[:, -1])

# Test: shuffle input and check if output changes
torch.manual_seed(42)
x = torch.randn(1, 10, 4)
x_shuffled = x[:, torch.randperm(10)]

model_no_pe = TransformerNoPE(4, 32, 4, 2)
model_with_pe = TransformerWithPE(4, 32, 4, 2)

with torch.no_grad():
    out_no_pe = model_no_pe(x)
    out_no_pe_shuffled = model_no_pe(x_shuffled)
    
    out_with_pe = model_with_pe(x)
    out_with_pe_shuffled = model_with_pe(x_shuffled)

print("Test: Does shuffling input change output?")
print(f"\nNo PE - Original: {out_no_pe.item():.4f}, Shuffled: {out_no_pe_shuffled.item():.4f}")
print(f"  Difference: {abs(out_no_pe.item() - out_no_pe_shuffled.item()):.6f}")
print(f"\nWith PE - Original: {out_with_pe.item():.4f}, Shuffled: {out_with_pe_shuffled.item():.4f}")
print(f"  Difference: {abs(out_with_pe.item() - out_with_pe_shuffled.item()):.6f}")
print("\n‚Üí Without PE, order doesn't matter! This is wrong for time series.")

## Mistake 3: Improper Scaling of Financial Data

In [None]:
# Common scaling mistakes
print("Common Scaling Mistakes in Finance ML:\n")

# Example data
prices = np.array([100, 105, 102, 108, 110, 115, 112, 120])

# WRONG: Scaling entire dataset (data leakage)
print("‚ùå WRONG: Fit scaler on ALL data")
scaler_wrong = MinMaxScaler()
prices_wrong = scaler_wrong.fit_transform(prices.reshape(-1, 1))
print(f"   Train sample scaled with future info: {prices_wrong[3, 0]:.4f}")

# CORRECT: Fit only on training data
print("\n‚úÖ CORRECT: Fit scaler only on TRAINING data")
train_prices = prices[:5]
test_prices = prices[5:]

scaler_correct = MinMaxScaler()
scaler_correct.fit(train_prices.reshape(-1, 1))
train_scaled = scaler_correct.transform(train_prices.reshape(-1, 1))
test_scaled = scaler_correct.transform(test_prices.reshape(-1, 1))
print(f"   Train sample (no future info): {train_scaled[3, 0]:.4f}")

# Best practice: Use returns instead of prices
print("\n‚úÖ BETTER: Use returns (naturally bounded)")
returns = np.diff(prices) / prices[:-1]
print(f"   Returns: {returns}")
print(f"   Returns range naturally from -1 to +inf, typically small numbers")

## Mistake 4: Wrong Attention Dimension

In [None]:
# Common dimension mistakes
print("Dimension Mistakes:\n")

# WRONG: d_model not divisible by num_heads
try:
    mha_wrong = nn.MultiheadAttention(embed_dim=65, num_heads=8)
    print("This would fail...")
except AssertionError as e:
    print(f"‚ùå Error: embed_dim=65 not divisible by num_heads=8")

# CORRECT: d_model divisible by num_heads
mha_correct = nn.MultiheadAttention(embed_dim=64, num_heads=8)
print(f"\n‚úÖ Correct: embed_dim=64 / num_heads=8 = head_dim=8")

# Common head_dim values
print("\nTypical configurations:")
configs = [
    (64, 4, 16),
    (128, 8, 16),
    (256, 8, 32),
    (512, 8, 64),
    (768, 12, 64),  # BERT-base
]
for d_model, heads, head_dim in configs:
    print(f"  d_model={d_model}, heads={heads} ‚Üí head_dim={head_dim}")

## Mistake 5: Overfitting Due to Small Dataset

In [None]:
# Transformer capacity vs dataset size
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Different model sizes
configs = [
    {"d_model": 32, "num_heads": 4, "num_layers": 2, "d_ff": 64},
    {"d_model": 64, "num_heads": 8, "num_layers": 4, "d_ff": 256},
    {"d_model": 128, "num_heads": 8, "num_layers": 6, "d_ff": 512},
    {"d_model": 256, "num_heads": 8, "num_layers": 8, "d_ff": 1024},
]

print("Model Size vs Recommended Data Size:\n")
print(f"{'Config':<35} {'Params':<12} {'Min Samples':<15}")
print("-" * 62)

for cfg in configs:
    model = TransformerWithPE(
        input_dim=5,
        d_model=cfg["d_model"],
        num_heads=cfg["num_heads"],
        num_layers=cfg["num_layers"]
    )
    params = count_parameters(model)
    # Rule of thumb: 10-100x more samples than parameters for good generalization
    min_samples = params * 10
    
    cfg_str = f"d={cfg['d_model']}, h={cfg['num_heads']}, L={cfg['num_layers']}"
    print(f"{cfg_str:<35} {params:>10,} {min_samples:>13,}")

print("\nüí° Tip: Start small! For finance data (~5 years = 1,260 trading days),")
print("   use compact transformers (d_model=32-64, layers=2-4)")

---
# Part 3: Mini-Project - Complete Transformer Stock Predictor

Build a production-ready transformer model for stock price prediction.

In [None]:
# Download Stock Data
print("Downloading stock data...")
ticker = "AAPL"
data = yf.download(ticker, start="2019-01-01", end="2024-01-01", progress=False)
prices = data['Close'].values.reshape(-1, 1)

print(f"Downloaded {len(prices)} days of {ticker} data")
print(f"Date range: {data.index[0].date()} to {data.index[-1].date()}")
print(f"Price range: ${prices.min():.2f} - ${prices.max():.2f}")

In [None]:
# Feature Engineering
def create_features(prices_series, window=20):
    """Create features from price series."""
    df = pd.DataFrame({'close': prices_series.flatten()})
    
    # Returns
    df['returns'] = df['close'].pct_change()
    
    # Moving averages
    df['ma_5'] = df['close'].rolling(5).mean() / df['close'] - 1
    df['ma_20'] = df['close'].rolling(20).mean() / df['close'] - 1
    
    # Volatility
    df['volatility'] = df['returns'].rolling(window).std()
    
    # Momentum
    df['momentum_5'] = df['close'].pct_change(5)
    df['momentum_20'] = df['close'].pct_change(20)
    
    # Drop NaN
    df = df.dropna()
    
    return df

# Create features
df = create_features(prices)
print(f"Features created: {df.columns.tolist()}")
print(f"Samples after feature engineering: {len(df)}")
df.head()

In [None]:
# Prepare Sequences
def create_sequences(data, target_col, seq_length=30, pred_horizon=1):
    """Create sequences for transformer input."""
    features = data.drop(columns=[target_col]).values
    target = data[target_col].values
    
    X, y = [], []
    for i in range(len(data) - seq_length - pred_horizon + 1):
        X.append(features[i:i+seq_length])
        y.append(target[i+seq_length+pred_horizon-1])
    
    return np.array(X), np.array(y)

# Create sequences
SEQ_LENGTH = 30
PRED_HORIZON = 1

X, y = create_sequences(df, target_col='returns', seq_length=SEQ_LENGTH, pred_horizon=PRED_HORIZON)
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")

# Train/Val/Test split (time-based)
train_size = int(len(X) * 0.7)
val_size = int(len(X) * 0.15)

X_train, y_train = X[:train_size], y[:train_size]
X_val, y_val = X[train_size:train_size+val_size], y[train_size:train_size+val_size]
X_test, y_test = X[train_size+val_size:], y[train_size+val_size:]

print(f"\nTrain: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

In [None]:
# Scale Features (fit only on training data!)
scaler = MinMaxScaler(feature_range=(-1, 1))

# Reshape for scaling
X_train_flat = X_train.reshape(-1, X_train.shape[-1])
scaler.fit(X_train_flat)  # Only fit on training!

# Transform all sets
X_train_scaled = scaler.transform(X_train.reshape(-1, X_train.shape[-1])).reshape(X_train.shape)
X_val_scaled = scaler.transform(X_val.reshape(-1, X_val.shape[-1])).reshape(X_val.shape)
X_test_scaled = scaler.transform(X_test.reshape(-1, X_test.shape[-1])).reshape(X_test.shape)

# Convert to tensors
X_train_t = torch.FloatTensor(X_train_scaled)
y_train_t = torch.FloatTensor(y_train).unsqueeze(1)
X_val_t = torch.FloatTensor(X_val_scaled)
y_val_t = torch.FloatTensor(y_val).unsqueeze(1)
X_test_t = torch.FloatTensor(X_test_scaled)
y_test_t = torch.FloatTensor(y_test).unsqueeze(1)

# DataLoaders
train_dataset = TensorDataset(X_train_t, y_train_t)
val_dataset = TensorDataset(X_val_t, y_val_t)
test_dataset = TensorDataset(X_test_t, y_test_t)

BATCH_SIZE = 32
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE)

print(f"Batch size: {BATCH_SIZE}")
print(f"Training batches: {len(train_loader)}")

In [None]:
# Complete Transformer Model for Stock Prediction
class StockTransformer(nn.Module):
    """
    Production-ready Transformer for stock prediction.
    
    Features:
    - Pre-LayerNorm architecture
    - Causal masking
    - Sinusoidal positional encoding
    - Dropout regularization
    """
    
    def __init__(
        self,
        input_dim,
        d_model=64,
        num_heads=4,
        num_layers=3,
        d_ff=128,
        dropout=0.1,
        max_len=500
    ):
        super().__init__()
        
        self.d_model = d_model
        
        # Input embedding
        self.input_projection = nn.Linear(input_dim, d_model)
        
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_len, dropout)
        
        # Transformer encoder layers (Pre-Norm)
        self.layers = nn.ModuleList([
            self._make_layer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
        # Final layer norm
        self.final_norm = nn.LayerNorm(d_model)
        
        # Output head
        self.output_head = nn.Sequential(
            nn.Linear(d_model, d_model // 2),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_model // 2, 1)
        )
        
        self._init_weights()
    
    def _make_layer(self, d_model, num_heads, d_ff, dropout):
        return nn.ModuleDict({
            'attn': nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True),
            'ffn': nn.Sequential(
                nn.Linear(d_model, d_ff),
                nn.GELU(),
                nn.Dropout(dropout),
                nn.Linear(d_ff, d_model),
                nn.Dropout(dropout)
            ),
            'norm1': nn.LayerNorm(d_model),
            'norm2': nn.LayerNorm(d_model)
        })
    
    def _init_weights(self):
        """Initialize weights with Xavier uniform."""
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
    
    def _create_causal_mask(self, seq_len, device):
        """Create causal attention mask."""
        mask = torch.triu(torch.ones(seq_len, seq_len, device=device), diagonal=1)
        return mask.bool()
    
    def forward(self, x, return_attention=False):
        batch_size, seq_len, _ = x.shape
        
        # Input projection
        x = self.input_projection(x)
        
        # Add positional encoding
        x = self.pos_encoding(x)
        
        # Causal mask
        mask = self._create_causal_mask(seq_len, x.device)
        
        # Store attention weights for visualization
        attention_weights = []
        
        # Transformer layers (Pre-Norm)
        for layer in self.layers:
            # Self-attention
            x_norm = layer['norm1'](x)
            attn_out, attn_w = layer['attn'](x_norm, x_norm, x_norm, attn_mask=mask)
            x = x + attn_out
            attention_weights.append(attn_w)
            
            # Feed-forward
            x_norm = layer['norm2'](x)
            x = x + layer['ffn'](x_norm)
        
        # Final norm
        x = self.final_norm(x)
        
        # Output (use last position)
        output = self.output_head(x[:, -1])
        
        if return_attention:
            return output, attention_weights
        return output

# Initialize model
INPUT_DIM = X_train.shape[-1]
model = StockTransformer(
    input_dim=INPUT_DIM,
    d_model=64,
    num_heads=4,
    num_layers=3,
    d_ff=128,
    dropout=0.1
).to(device)

print(f"Model Parameters: {count_parameters(model):,}")
print(model)

In [None]:
# Training Setup
criterion = nn.MSELoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

def train_epoch(model, loader, criterion, optimizer):
    model.train()
    total_loss = 0
    for X_batch, y_batch in loader:
        X_batch, y_batch = X_batch.to(device), y_batch.to(device)
        
        optimizer.zero_grad()
        output = model(X_batch)
        loss = criterion(output, y_batch)
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        total_loss += loss.item()
    
    return total_loss / len(loader)

def evaluate(model, loader, criterion):
    model.eval()
    total_loss = 0
    predictions, actuals = [], []
    
    with torch.no_grad():
        for X_batch, y_batch in loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            output = model(X_batch)
            loss = criterion(output, y_batch)
            total_loss += loss.item()
            
            predictions.extend(output.cpu().numpy())
            actuals.extend(y_batch.cpu().numpy())
    
    return total_loss / len(loader), np.array(predictions), np.array(actuals)

In [None]:
# Training Loop
EPOCHS = 50
best_val_loss = float('inf')
patience = 10
patience_counter = 0
train_losses, val_losses = [], []

print("Training Transformer...")
print(f"{'Epoch':<8} {'Train Loss':<15} {'Val Loss':<15} {'LR':<12}")
print("-" * 50)

for epoch in range(EPOCHS):
    train_loss = train_epoch(model, train_loader, criterion, optimizer)
    val_loss, _, _ = evaluate(model, val_loader, criterion)
    
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    
    # Learning rate scheduling
    scheduler.step()
    current_lr = scheduler.get_last_lr()[0]
    
    # Early stopping
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'best_transformer.pth')
        patience_counter = 0
        marker = "*"
    else:
        patience_counter += 1
        marker = ""
    
    if (epoch + 1) % 5 == 0 or epoch == 0:
        print(f"{epoch+1:<8} {train_loss:<15.6f} {val_loss:<15.6f} {current_lr:<12.6f} {marker}")
    
    if patience_counter >= patience:
        print(f"\nEarly stopping at epoch {epoch+1}")
        break

print(f"\nBest validation loss: {best_val_loss:.6f}")

In [None]:
# Plot Training History
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
axes[0].plot(train_losses, label='Train Loss', linewidth=2)
axes[0].plot(val_losses, label='Val Loss', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('MSE Loss')
axes[0].set_title('Training & Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Log scale
axes[1].semilogy(train_losses, label='Train Loss', linewidth=2)
axes[1].semilogy(val_losses, label='Val Loss', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('MSE Loss (log scale)')
axes[1].set_title('Training & Validation Loss (Log Scale)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Load Best Model and Evaluate
model.load_state_dict(torch.load('best_transformer.pth'))
test_loss, predictions, actuals = evaluate(model, test_loader, criterion)

# Calculate metrics
mse = mean_squared_error(actuals, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(actuals, predictions)
r2 = r2_score(actuals, predictions)

# Direction accuracy
pred_direction = (predictions.flatten() > 0).astype(int)
actual_direction = (actuals.flatten() > 0).astype(int)
direction_accuracy = (pred_direction == actual_direction).mean()

print("\n" + "="*50)
print("TEST SET EVALUATION")
print("="*50)
print(f"MSE:  {mse:.6f}")
print(f"RMSE: {rmse:.6f}")
print(f"MAE:  {mae:.6f}")
print(f"R¬≤:   {r2:.4f}")
print(f"Direction Accuracy: {direction_accuracy:.2%}")
print("="*50)

In [None]:
# Visualize Predictions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Predictions vs Actuals
axes[0, 0].plot(actuals[:100], label='Actual', alpha=0.7, linewidth=2)
axes[0, 0].plot(predictions[:100], label='Predicted', alpha=0.7, linewidth=2)
axes[0, 0].set_xlabel('Time Step')
axes[0, 0].set_ylabel('Return')
axes[0, 0].set_title('Predictions vs Actuals (First 100 samples)')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Scatter plot
axes[0, 1].scatter(actuals, predictions, alpha=0.3, s=10)
axes[0, 1].plot([actuals.min(), actuals.max()], [actuals.min(), actuals.max()], 'r--', linewidth=2)
axes[0, 1].set_xlabel('Actual Return')
axes[0, 1].set_ylabel('Predicted Return')
axes[0, 1].set_title(f'Scatter Plot (R¬≤ = {r2:.4f})')
axes[0, 1].grid(True, alpha=0.3)

# Error distribution
errors = predictions.flatten() - actuals.flatten()
axes[1, 0].hist(errors, bins=50, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(0, color='red', linestyle='--', linewidth=2)
axes[1, 0].set_xlabel('Prediction Error')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title(f'Error Distribution (Mean: {errors.mean():.6f})')

# Cumulative return (trading simulation)
# Simple strategy: go long when predicted return > 0
strategy_returns = np.where(predictions.flatten() > 0, actuals.flatten(), -actuals.flatten())
cumulative_strategy = np.cumprod(1 + strategy_returns) - 1
cumulative_buy_hold = np.cumprod(1 + actuals.flatten()) - 1

axes[1, 1].plot(cumulative_buy_hold * 100, label='Buy & Hold', linewidth=2)
axes[1, 1].plot(cumulative_strategy * 100, label='Transformer Strategy', linewidth=2)
axes[1, 1].set_xlabel('Time Step')
axes[1, 1].set_ylabel('Cumulative Return (%)')
axes[1, 1].set_title('Strategy Performance (Test Set)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nStrategy Total Return: {cumulative_strategy[-1]*100:.2f}%")
print(f"Buy & Hold Return: {cumulative_buy_hold[-1]*100:.2f}%")

In [None]:
# Visualize Attention Patterns
model.eval()
with torch.no_grad():
    sample = X_test_t[:1].to(device)
    _, attention_weights = model(sample, return_attention=True)

# Plot attention from last layer
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

attn = attention_weights[-1][0].cpu().numpy()  # Last layer, first sample
for head in range(min(4, attn.shape[0])):
    im = axes[head].imshow(attn[head], cmap='Blues', aspect='auto')
    axes[head].set_xlabel('Key Position')
    axes[head].set_ylabel('Query Position')
    axes[head].set_title(f'Head {head + 1}')
    plt.colorbar(im, ax=axes[head])

plt.suptitle('Attention Patterns (Last Layer)', fontsize=12)
plt.tight_layout()
plt.show()

print("\nAttention Interpretation:")
print("- Diagonal patterns: focus on recent positions")
print("- Horizontal bands: certain positions attended by all queries")
print("- Lower triangle only: causal masking working correctly")

---
# Part 4: Week 15 Summary

## Key Concepts Covered

### Day 1: Attention Mechanisms
- Self-attention fundamentals
- Query, Key, Value projections
- Scaled dot-product attention

### Day 2: Multi-Head Attention
- Parallel attention heads
- Head concatenation and projection
- Different heads learn different patterns

### Day 3: Transformer Architecture
- Encoder-decoder structure
- Positional encoding
- Layer normalization and residual connections

### Day 4: Transformers for Time Series
- Adapting transformers for continuous data
- Causal masking for forecasting
- Feature engineering for financial data

### Day 5: Advanced Topics
- Efficient attention variants
- Pre-norm vs Post-norm
- Hyperparameter tuning

### Day 6: Practical Implementation
- Training strategies
- Regularization techniques
- Model evaluation

### Day 7: Interview Review
- 10 key interview questions
- Common mistakes
- Complete mini-project

In [None]:
# Week 15 Quick Reference Card
reference_card = """
‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë                WEEK 15 QUICK REFERENCE CARD                      ‚ïë
‚ï†‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï£
‚ïë  ATTENTION FORMULA:                                              ‚ïë
‚ïë  Attention(Q,K,V) = softmax(QK^T / ‚àöd_k) √ó V                     ‚ïë
‚ïë                                                                  ‚ïë
‚ïë  MULTI-HEAD:                                                     ‚ïë
‚ïë  MultiHead = Concat(head_1,...,head_h) √ó W_O                     ‚ïë
‚ïë                                                                  ‚ïë
‚ïë  POSITIONAL ENCODING:                                            ‚ïë
‚ïë  PE(pos,2i) = sin(pos/10000^(2i/d))                              ‚ïë
‚ïë  PE(pos,2i+1) = cos(pos/10000^(2i/d))                            ‚ïë
‚ïë                                                                  ‚ïë
‚ïë  COMPLEXITY: O(n¬≤d) time, O(n¬≤) space                            ‚ïë
‚ïë                                                                  ‚ïë
‚ïë  KEY HYPERPARAMETERS:                                            ‚ïë
‚ïë  ‚Ä¢ d_model: 64-512 (must be divisible by num_heads)              ‚ïë
‚ïë  ‚Ä¢ num_heads: 4-16                                               ‚ïë
‚ïë  ‚Ä¢ num_layers: 2-12                                              ‚ïë
‚ïë  ‚Ä¢ d_ff: 2-4 √ó d_model                                           ‚ïë
‚ïë  ‚Ä¢ dropout: 0.1-0.3                                              ‚ïë
‚ïë                                                                  ‚ïë
‚ïë  FINANCE CHECKLIST:                                              ‚ïë
‚ïë  ‚ñ° Use causal masking (no look-ahead)                            ‚ïë
‚ïë  ‚ñ° Scale data (fit only on training)                             ‚ïë
‚ïë  ‚ñ° Use returns not raw prices                                    ‚ïë
‚ïë  ‚ñ° Time-based train/val/test split                               ‚ïë
‚ïë  ‚ñ° Include positional encoding                                   ‚ïë
‚ïë  ‚ñ° Start with small model, scale up                              ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù
"""
print(reference_card)

In [None]:
# Final Summary Statistics
print("\n" + "="*60)
print("WEEK 15 COMPLETION SUMMARY")
print("="*60)

summary = {
    "Topics Covered": 7,
    "Interview Questions": 10,
    "Common Mistakes Addressed": 5,
    "Model Parameters": f"{count_parameters(model):,}",
    "Training Samples": len(X_train),
    "Test MSE": f"{mse:.6f}",
    "Test R¬≤": f"{r2:.4f}",
    "Direction Accuracy": f"{direction_accuracy:.2%}"
}

for key, value in summary.items():
    print(f"{key}: {value}")

print("\n" + "="*60)
print("‚úÖ Week 15: Attention & Transformers - COMPLETE!")
print("="*60)
print("\nNext Steps:")
print("  ‚Üí Week 16: Reinforcement Learning for Trading")
print("  ‚Üí Apply transformers to multi-asset portfolios")
print("  ‚Üí Experiment with efficient attention variants")

---
## Additional Resources

### Papers
1. "Attention Is All You Need" - Vaswani et al. (2017)
2. "Temporal Fusion Transformers" - Lim et al. (2019)
3. "Informer: Beyond Efficient Transformer" - Zhou et al. (2021)

### Books
- "Advances in Financial Machine Learning" - Marcos L√≥pez de Prado
- "Deep Learning for Finance" - Jannes Klaas

### Practice Problems
1. Implement linear attention approximation
2. Add learnable positional encoding
3. Build cross-attention for multi-asset prediction
4. Implement Temporal Fusion Transformer

---
*Week 15 Day 7 Complete - Interview Review & Week Summary*