# Day 5: Neural Network Architecture Design for Trading

## Week 13 - Neural Networks in Quantitative Finance

---

## Learning Objectives

By the end of this notebook, you will understand:

1. **MLP Depth vs Width Tradeoffs** - How to balance network architecture
2. **Skip Connections & Residual Networks** - Combating vanishing gradients
3. **Embedding Layers** - Handling categorical features (sector, asset class)
4. **Multi-Input Architectures** - Combining price, volume, and sentiment data
5. **AutoML & Neural Architecture Search** - Automated architecture optimization
6. **Practical Application** - Design optimal architecture for a factor model

---

## Why Architecture Matters in Trading

In quantitative finance, the choice of neural network architecture can significantly impact:

- **Signal-to-Noise Ratio**: Financial data is inherently noisy; architecture affects extraction of true signals
- **Overfitting Risk**: Markets are non-stationary; simpler architectures may generalize better
- **Inference Speed**: For real-time trading, latency matters
- **Interpretability**: Regulatory requirements may demand model explainability

### European Market Considerations ðŸ‡ªðŸ‡º

- **MiFID II Compliance**: Model governance and explainability requirements
- **Trading Hours**: European markets (LSE, Euronext, XETRA) operate 08:00-16:30 CET
- **Multi-Currency**: Need to handle EUR, GBP, CHF denominated assets
- **Cross-Market Correlations**: European indices often lead/lag US markets

---

## 1. Environment Setup and Imports

In [1]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, Dataset

# Data acquisition
import yfinance as yf

# Scikit-learn utilities
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set random seeds for reproducibility
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")

# Plotting settings
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

Using device: cpu
PyTorch version: 2.9.1


---

## 2. Load Market Data

We'll load a diverse set of assets including European stocks to build our multi-asset architecture examples.

In [2]:
# Define tickers: Mix of US and European assets
# European tickers often have suffixes: .L (London), .PA (Paris), .DE (Germany)
tickers = {
    # US Tech
    'AAPL': {'sector': 'Technology', 'asset_class': 'Equity', 'region': 'US'},
    'MSFT': {'sector': 'Technology', 'asset_class': 'Equity', 'region': 'US'},
    'GOOGL': {'sector': 'Technology', 'asset_class': 'Equity', 'region': 'US'},
    # US Finance
    'JPM': {'sector': 'Financials', 'asset_class': 'Equity', 'region': 'US'},
    'GS': {'sector': 'Financials', 'asset_class': 'Equity', 'region': 'US'},
    # European stocks
    'SAP': {'sector': 'Technology', 'asset_class': 'Equity', 'region': 'EU'},  # German tech
    'ASML': {'sector': 'Technology', 'asset_class': 'Equity', 'region': 'EU'},  # Dutch semiconductor
    'HSBA.L': {'sector': 'Financials', 'asset_class': 'Equity', 'region': 'EU'},  # UK bank
    # ETFs for broader market
    'SPY': {'sector': 'Index', 'asset_class': 'ETF', 'region': 'US'},
    'EWG': {'sector': 'Index', 'asset_class': 'ETF', 'region': 'EU'},  # Germany ETF
}

# Download data
# Using 3 years of data for robust training
end_date = datetime.now()
start_date = end_date - timedelta(days=3*365)

print(f"Downloading data from {start_date.date()} to {end_date.date()}")
print("-" * 50)

# Download all tickers
price_data = {}
volume_data = {}

for ticker in tickers.keys():
    try:
        # Download OHLCV data
        df = yf.download(ticker, start=start_date, end=end_date, progress=False)
        if len(df) > 100:  # Ensure sufficient data
            price_data[ticker] = df['Close']  # Using Close column as specified
            volume_data[ticker] = df['Volume']
            print(f"âœ“ {ticker}: {len(df)} days of data")
        else:
            print(f"âœ— {ticker}: Insufficient data ({len(df)} days)")
    except Exception as e:
        print(f"âœ— {ticker}: Error - {str(e)}")

# Create DataFrames
prices_df = pd.DataFrame(price_data)
volume_df = pd.DataFrame(volume_data)

# Forward fill missing values (holidays may differ between markets)
prices_df = prices_df.ffill().dropna()
volume_df = volume_df.ffill().dropna()

print(f"\nFinal dataset: {len(prices_df)} trading days, {len(prices_df.columns)} assets")
prices_df.tail()

Downloading data from 2023-01-24 to 2026-01-23
--------------------------------------------------
âœ“ AAPL: 752 days of data
âœ“ MSFT: 752 days of data
âœ“ GOOGL: 752 days of data
âœ“ JPM: 752 days of data
âœ“ GS: 752 days of data
âœ“ SAP: 752 days of data
âœ“ ASML: 752 days of data
âœ“ HSBA.L: 758 days of data
âœ“ SPY: 752 days of data
âœ“ EWG: 752 days of data


ValueError: If using all scalar values, you must pass an index

---

## 3. Feature Engineering for Neural Networks

We'll create a comprehensive feature set including:
- Technical indicators (momentum, volatility)
- Cross-asset features
- Categorical features (sector, region)

In [None]:
def create_features(prices_df, volume_df, tickers_info, lookback_windows=[5, 10, 20, 60]):
    """
    Create comprehensive feature set for neural network training.
    
    Parameters:
    -----------
    prices_df : pd.DataFrame - Close prices for all assets
    volume_df : pd.DataFrame - Volume data for all assets
    tickers_info : dict - Metadata about each ticker
    lookback_windows : list - Windows for calculating features
    
    Returns:
    --------
    features_dict : dict - Dictionary containing feature DataFrames
    """
    features_dict = {}
    
    for ticker in prices_df.columns:
        price = prices_df[ticker]
        volume = volume_df[ticker] if ticker in volume_df.columns else None
        
        # Initialize feature DataFrame
        features = pd.DataFrame(index=prices_df.index)
        
        # ===== PRICE-BASED FEATURES =====
        # Returns at different horizons
        features['return_1d'] = price.pct_change(1)
        features['return_5d'] = price.pct_change(5)
        features['return_20d'] = price.pct_change(20)
        
        # Momentum features
        for window in lookback_windows:
            # Simple moving average ratio
            sma = price.rolling(window).mean()
            features[f'sma_ratio_{window}'] = price / sma - 1
            
            # Volatility (annualized)
            features[f'volatility_{window}'] = price.pct_change().rolling(window).std() * np.sqrt(252)
            
            # Price momentum
            features[f'momentum_{window}'] = price / price.shift(window) - 1
        
        # RSI (Relative Strength Index)
        delta = price.diff()
        gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
        rs = gain / loss
        features['rsi'] = 100 - (100 / (1 + rs))
        
        # ===== VOLUME-BASED FEATURES =====
        if volume is not None:
            features['volume_sma_ratio'] = volume / volume.rolling(20).mean()
            features['volume_change'] = volume.pct_change()
            # On-balance volume proxy
            features['obv_slope'] = (volume * np.sign(price.diff())).rolling(10).mean()
        
        # ===== CATEGORICAL FEATURES =====
        # These will be encoded later using embeddings
        if ticker in tickers_info:
            features['sector'] = tickers_info[ticker]['sector']
            features['asset_class'] = tickers_info[ticker]['asset_class']
            features['region'] = tickers_info[ticker]['region']
        
        # ===== TARGET VARIABLE =====
        # Binary classification: Will price go up in next day?
        features['target'] = (price.shift(-1) > price).astype(int)
        
        # Store features
        features_dict[ticker] = features
    
    return features_dict

# Create features for all assets
features_dict = create_features(prices_df, volume_df, tickers)

# Display sample features
sample_ticker = list(features_dict.keys())[0]
print(f"Features for {sample_ticker}:")
print(f"Shape: {features_dict[sample_ticker].shape}")
features_dict[sample_ticker].dropna().tail()

---

## 4. MLP Depth vs Width Tradeoffs

### Key Concepts

**Width (neurons per layer):**
- More neurons = higher capacity to learn complex patterns
- Risk: Overfitting, especially with limited financial data
- Benefit: Can capture more features simultaneously

**Depth (number of layers):**
- More layers = ability to learn hierarchical features
- Risk: Vanishing gradients, harder to train
- Benefit: Can learn compositional representations

### Trading-Specific Considerations

| Scenario | Recommended Architecture |
|----------|-------------------------|
| Small dataset (<1000 samples) | Shallow & narrow (2 layers, 32-64 neurons) |
| Medium dataset (1000-10000) | Moderate depth (3-4 layers, 64-128 neurons) |
| Large dataset (>10000) | Deeper networks possible (5+ layers) |
| High-frequency signals | Narrower networks (faster inference) |
| Factor models | Wider first layer (capture many factors) |

In [None]:
class ConfigurableMLP(nn.Module):
    """
    Configurable MLP to experiment with depth and width.
    
    This architecture allows systematic comparison of different
    network configurations for trading applications.
    """
    
    def __init__(self, input_dim, hidden_dims, output_dim=1, 
                 dropout=0.3, activation='relu', batch_norm=True):
        """
        Parameters:
        -----------
        input_dim : int - Number of input features
        hidden_dims : list - List of hidden layer dimensions (defines depth & width)
        output_dim : int - Output dimension (1 for binary classification)
        dropout : float - Dropout rate for regularization
        activation : str - Activation function ('relu', 'leaky_relu', 'elu')
        batch_norm : bool - Whether to use batch normalization
        """
        super(ConfigurableMLP, self).__init__()
        
        self.input_dim = input_dim
        self.hidden_dims = hidden_dims
        
        # Select activation function
        activation_funcs = {
            'relu': nn.ReLU(),
            'leaky_relu': nn.LeakyReLU(0.1),
            'elu': nn.ELU(),
            'gelu': nn.GELU()  # Popular in transformers
        }
        self.activation = activation_funcs.get(activation, nn.ReLU())
        
        # Build layers dynamically
        layers = []
        prev_dim = input_dim
        
        for i, hidden_dim in enumerate(hidden_dims):
            # Linear layer
            layers.append(nn.Linear(prev_dim, hidden_dim))
            
            # Batch normalization (helps with training stability)
            if batch_norm:
                layers.append(nn.BatchNorm1d(hidden_dim))
            
            # Activation
            layers.append(self.activation)
            
            # Dropout (regularization crucial for financial data)
            layers.append(nn.Dropout(dropout))
            
            prev_dim = hidden_dim
        
        # Create sequential container
        self.hidden_layers = nn.Sequential(*layers)
        
        # Output layer
        self.output_layer = nn.Linear(prev_dim, output_dim)
    
    def forward(self, x):
        """Forward pass through the network."""
        x = self.hidden_layers(x)
        x = self.output_layer(x)
        return x
    
    def count_parameters(self):
        """Count total trainable parameters."""
        return sum(p.numel() for p in self.parameters() if p.requires_grad)


# Compare different architectures
input_dim = 20  # Number of features

architectures = {
    'Shallow_Wide': [256, 128],           # 2 layers, wide
    'Deep_Narrow': [64, 64, 64, 64, 64],  # 5 layers, narrow
    'Pyramid': [256, 128, 64, 32],        # Decreasing width
    'Hourglass': [64, 128, 64],           # Expand then contract
    'Uniform': [128, 128, 128],           # Same width throughout
}

print("Architecture Comparison:")
print("=" * 60)
print(f"{'Name':<15} {'Layers':<8} {'Params':<12} {'Architecture'}")
print("-" * 60)

for name, hidden_dims in architectures.items():
    model = ConfigurableMLP(input_dim, hidden_dims)
    params = model.count_parameters()
    print(f"{name:<15} {len(hidden_dims):<8} {params:<12,} {hidden_dims}")

print("\nðŸ’¡ Insight: More parameters â‰  better performance in finance!")

---

## 5. Skip Connections and Residual Networks

### Why Skip Connections Matter in Trading

**The Vanishing Gradient Problem:**
- Deep networks struggle to propagate gradients to early layers
- Financial signals are often weak; we can't afford to lose gradient information

**Benefits of Residual Connections:**
1. **Gradient Highway**: Direct path for gradients to flow
2. **Identity Mapping**: Network can learn to skip unnecessary transformations
3. **Feature Preservation**: Original features available at all depths
4. **Ensemble Effect**: Implicitly creates an ensemble of networks

In [None]:
class ResidualBlock(nn.Module):
    """
    Residual block with skip connection.
    
    Output = F(x) + x, where F is a learned transformation.
    This allows gradients to flow directly through the skip connection.
    """
    
    def __init__(self, dim, dropout=0.3):
        super(ResidualBlock, self).__init__()
        
        # Two-layer transformation (typical for ResNets)
        self.block = nn.Sequential(
            nn.Linear(dim, dim),
            nn.BatchNorm1d(dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(dim, dim),
            nn.BatchNorm1d(dim)
        )
        
        self.activation = nn.ReLU()
    
    def forward(self, x):
        # Skip connection: add input to transformed output
        residual = x
        out = self.block(x)
        out = out + residual  # The magic of residual learning!
        out = self.activation(out)
        return out


class ResidualMLP(nn.Module):
    """
    MLP with residual connections for trading signal prediction.
    
    Architecture:
    Input -> Projection -> [ResBlock] x N -> Output
    """
    
    def __init__(self, input_dim, hidden_dim=128, num_blocks=4, 
                 output_dim=1, dropout=0.3):
        super(ResidualMLP, self).__init__()
        
        # Project input to hidden dimension
        self.input_projection = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout)
        )
        
        # Stack of residual blocks
        self.res_blocks = nn.ModuleList([
            ResidualBlock(hidden_dim, dropout) for _ in range(num_blocks)
        ])
        
        # Output layer
        self.output_layer = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        # Project to hidden space
        x = self.input_projection(x)
        
        # Pass through residual blocks
        for block in self.res_blocks:
            x = block(x)
        
        # Generate output
        return self.output_layer(x)


# Compare standard MLP vs Residual MLP
print("Standard MLP vs Residual MLP:")
print("=" * 50)

standard_mlp = ConfigurableMLP(input_dim=20, hidden_dims=[128, 128, 128, 128])
residual_mlp = ResidualMLP(input_dim=20, hidden_dim=128, num_blocks=4)

print(f"Standard MLP parameters: {standard_mlp.count_parameters():,}")
print(f"Residual MLP parameters: {sum(p.numel() for p in residual_mlp.parameters()):,}")

print("\nðŸ“Š Residual networks often train faster and achieve better optima!")

---

## 6. Embedding Layers for Categorical Features

### Why Embeddings?

In trading, we often have categorical features:
- **Sector**: Technology, Financials, Healthcare, etc.
- **Asset Class**: Equity, Fixed Income, Commodity, FX
- **Region**: US, EU, APAC
- **Exchange**: NYSE, NASDAQ, LSE, Euronext

**One-Hot Encoding Issues:**
- High dimensionality with many categories
- No relationship learned between categories
- Sparse representation is inefficient

**Embedding Benefits:**
- Dense, learned representations
- Captures semantic similarity (Tech and Semiconductors are related)
- Dimensionality reduction

In [None]:
class TradingEmbeddingMLP(nn.Module):
    """
    MLP with embedding layers for categorical features.
    
    This architecture combines:
    - Numerical features (returns, volatility, etc.)
    - Categorical features via embeddings (sector, asset class, region)
    """
    
    def __init__(self, num_numerical_features, 
                 num_sectors, sector_embed_dim,
                 num_asset_classes, asset_embed_dim,
                 num_regions, region_embed_dim,
                 hidden_dims=[128, 64], output_dim=1, dropout=0.3):
        """
        Parameters:
        -----------
        num_numerical_features : int - Number of numerical input features
        num_sectors : int - Number of unique sectors
        sector_embed_dim : int - Embedding dimension for sectors
        num_asset_classes : int - Number of unique asset classes
        asset_embed_dim : int - Embedding dimension for asset classes
        num_regions : int - Number of unique regions
        region_embed_dim : int - Embedding dimension for regions
        hidden_dims : list - Hidden layer dimensions
        output_dim : int - Output dimension
        dropout : float - Dropout rate
        """
        super(TradingEmbeddingMLP, self).__init__()
        
        # Embedding layers for categorical features
        self.sector_embedding = nn.Embedding(num_sectors, sector_embed_dim)
        self.asset_embedding = nn.Embedding(num_asset_classes, asset_embed_dim)
        self.region_embedding = nn.Embedding(num_regions, region_embed_dim)
        
        # Calculate total input dimension after concatenation
        total_embed_dim = sector_embed_dim + asset_embed_dim + region_embed_dim
        total_input_dim = num_numerical_features + total_embed_dim
        
        # Build MLP layers
        layers = []
        prev_dim = total_input_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout)
            ])
            prev_dim = hidden_dim
        
        self.mlp = nn.Sequential(*layers)
        self.output_layer = nn.Linear(prev_dim, output_dim)
    
    def forward(self, numerical_features, sector_idx, asset_idx, region_idx):
        """
        Forward pass combining numerical and categorical features.
        
        Parameters:
        -----------
        numerical_features : tensor - Numerical features (batch_size, num_features)
        sector_idx : tensor - Sector indices (batch_size,)
        asset_idx : tensor - Asset class indices (batch_size,)
        region_idx : tensor - Region indices (batch_size,)
        """
        # Get embeddings
        sector_emb = self.sector_embedding(sector_idx)
        asset_emb = self.asset_embedding(asset_idx)
        region_emb = self.region_embedding(region_idx)
        
        # Concatenate all features
        x = torch.cat([numerical_features, sector_emb, asset_emb, region_emb], dim=1)
        
        # Pass through MLP
        x = self.mlp(x)
        return self.output_layer(x)


# Example: Create embedding model for our data
# Define categorical mappings
sectors = ['Technology', 'Financials', 'Healthcare', 'Index', 'Energy']
asset_classes = ['Equity', 'ETF', 'Bond', 'Commodity']
regions = ['US', 'EU', 'APAC']

# Create encoders
sector_encoder = LabelEncoder()
sector_encoder.fit(sectors)

asset_encoder = LabelEncoder()
asset_encoder.fit(asset_classes)

region_encoder = LabelEncoder()
region_encoder.fit(regions)

# Create model
embedding_model = TradingEmbeddingMLP(
    num_numerical_features=15,
    num_sectors=len(sectors),
    sector_embed_dim=4,  # Embed 5 sectors into 4D space
    num_asset_classes=len(asset_classes),
    asset_embed_dim=3,
    num_regions=len(regions),
    region_embed_dim=2,
    hidden_dims=[64, 32]
)

print("Embedding Model Architecture:")
print("=" * 50)
print(embedding_model)
print(f"\nTotal parameters: {sum(p.numel() for p in embedding_model.parameters()):,}")

---

## 7. Multi-Input Architectures

### Combining Multiple Data Sources

Real-world trading systems often combine:
1. **Price Data**: Returns, volatility, technical indicators
2. **Volume Data**: Trading activity, liquidity metrics
3. **Sentiment Data**: News sentiment, social media signals
4. **Fundamental Data**: Earnings, valuations
5. **Alternative Data**: Satellite imagery, web traffic

### Multi-Head Architecture Benefits
- Each head can specialize in processing its data type
- Appropriate preprocessing for each modality
- Feature extraction tailored to data characteristics

In [None]:
class MultiInputTradingNetwork(nn.Module):
    """
    Multi-input architecture combining price, volume, and sentiment data.
    
    Architecture:
    - Price Branch: Processes technical indicators
    - Volume Branch: Processes volume-based features
    - Sentiment Branch: Processes sentiment scores
    - Fusion Layer: Combines all branches for final prediction
    """
    
    def __init__(self, price_dim, volume_dim, sentiment_dim,
                 hidden_dim=64, output_dim=1, dropout=0.3):
        super(MultiInputTradingNetwork, self).__init__()
        
        # ===== PRICE BRANCH =====
        # Deeper network for complex price patterns
        self.price_branch = nn.Sequential(
            nn.Linear(price_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim // 2)
        )
        
        # ===== VOLUME BRANCH =====
        # Simpler network for volume features
        self.volume_branch = nn.Sequential(
            nn.Linear(volume_dim, hidden_dim // 2),
            nn.BatchNorm1d(hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, hidden_dim // 4)
        )
        
        # ===== SENTIMENT BRANCH =====
        # Simple processing for sentiment scores
        self.sentiment_branch = nn.Sequential(
            nn.Linear(sentiment_dim, hidden_dim // 4),
            nn.BatchNorm1d(hidden_dim // 4),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 4, hidden_dim // 4)
        )
        
        # ===== FUSION LAYERS =====
        # Combine all branches
        fusion_dim = hidden_dim // 2 + hidden_dim // 4 + hidden_dim // 4
        
        self.fusion = nn.Sequential(
            nn.Linear(fusion_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, output_dim)
        )
        
        # Attention-based weighting for branches (optional)
        self.branch_attention = nn.Sequential(
            nn.Linear(fusion_dim, 3),
            nn.Softmax(dim=1)
        )
    
    def forward(self, price_features, volume_features, sentiment_features,
                use_attention=False):
        """
        Forward pass through multi-input network.
        
        Parameters:
        -----------
        price_features : tensor - Price-based features
        volume_features : tensor - Volume-based features
        sentiment_features : tensor - Sentiment scores
        use_attention : bool - Whether to use attention weighting
        """
        # Process each branch
        price_out = self.price_branch(price_features)
        volume_out = self.volume_branch(volume_features)
        sentiment_out = self.sentiment_branch(sentiment_features)
        
        # Concatenate branch outputs
        combined = torch.cat([price_out, volume_out, sentiment_out], dim=1)
        
        # Apply fusion
        output = self.fusion(combined)
        
        return output


# Create and visualize multi-input model
multi_input_model = MultiInputTradingNetwork(
    price_dim=12,      # 12 price features
    volume_dim=4,      # 4 volume features
    sentiment_dim=3,   # 3 sentiment features
    hidden_dim=64
)

print("Multi-Input Trading Network:")
print("=" * 50)
print(f"Price branch input: 12 features")
print(f"Volume branch input: 4 features")
print(f"Sentiment branch input: 3 features")
print(f"Total parameters: {sum(p.numel() for p in multi_input_model.parameters()):,}")

# Test forward pass
batch_size = 32
test_price = torch.randn(batch_size, 12)
test_volume = torch.randn(batch_size, 4)
test_sentiment = torch.randn(batch_size, 3)

output = multi_input_model(test_price, test_volume, test_sentiment)
print(f"\nOutput shape: {output.shape}")

---

## 8. AutoML and Neural Architecture Search (NAS) Basics

### Why AutoML for Trading?

**Challenges in Manual Architecture Design:**
- Exponential search space (depth Ã— width Ã— activations Ã— ...)
- Non-stationary market requires periodic re-optimization
- Human bias towards certain architectures

**AutoML Approaches:**
1. **Grid Search**: Exhaustive but expensive
2. **Random Search**: Often outperforms grid search
3. **Bayesian Optimization**: Efficient exploration
4. **Neural Architecture Search (NAS)**: Learn to design networks

### European Regulatory Considerations ðŸ‡ªðŸ‡º

Under MiFID II and incoming AI regulations:
- AutoML decisions must be documented
- Search process should be reproducible
- Selected architecture needs justification

In [None]:
class SimpleNAS:
    """
    Simple Neural Architecture Search for trading networks.
    
    Uses random search with early stopping for efficiency.
    In practice, consider using libraries like:
    - Optuna
    - Ray Tune
    - NNI (Neural Network Intelligence)
    """
    
    def __init__(self, input_dim, search_space=None, random_state=42):
        self.input_dim = input_dim
        self.random_state = random_state
        np.random.seed(random_state)
        
        # Define search space
        self.search_space = search_space or {
            'num_layers': [2, 3, 4, 5],
            'hidden_dim': [32, 64, 128, 256],
            'dropout': [0.1, 0.2, 0.3, 0.4, 0.5],
            'activation': ['relu', 'leaky_relu', 'elu', 'gelu'],
            'batch_norm': [True, False],
            'learning_rate': [1e-4, 5e-4, 1e-3, 5e-3]
        }
        
        self.results = []
    
    def sample_architecture(self):
        """Sample a random architecture from search space."""
        config = {}
        
        # Sample hyperparameters
        num_layers = np.random.choice(self.search_space['num_layers'])
        hidden_dim = np.random.choice(self.search_space['hidden_dim'])
        
        # Create hidden dimensions (can vary per layer)
        config['hidden_dims'] = [hidden_dim] * num_layers
        config['dropout'] = np.random.choice(self.search_space['dropout'])
        config['activation'] = np.random.choice(self.search_space['activation'])
        config['batch_norm'] = np.random.choice(self.search_space['batch_norm'])
        config['learning_rate'] = np.random.choice(self.search_space['learning_rate'])
        
        return config
    
    def build_model(self, config):
        """Build model from configuration."""
        return ConfigurableMLP(
            input_dim=self.input_dim,
            hidden_dims=config['hidden_dims'],
            dropout=config['dropout'],
            activation=config['activation'],
            batch_norm=config['batch_norm']
        )
    
    def evaluate_architecture(self, model, train_loader, val_loader, 
                             config, num_epochs=10):
        """Train and evaluate architecture."""
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        model = model.to(device)
        
        # Setup training
        criterion = nn.BCEWithLogitsLoss()
        optimizer = optim.Adam(model.parameters(), lr=config['learning_rate'])
        
        # Training loop with early stopping
        best_val_loss = float('inf')
        patience = 3
        patience_counter = 0
        
        for epoch in range(num_epochs):
            # Training
            model.train()
            for X_batch, y_batch in train_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                
                optimizer.zero_grad()
                outputs = model(X_batch)
                loss = criterion(outputs.squeeze(), y_batch.float())
                loss.backward()
                optimizer.step()
            
            # Validation
            model.eval()
            val_loss = 0
            with torch.no_grad():
                for X_batch, y_batch in val_loader:
                    X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                    outputs = model(X_batch)
                    val_loss += criterion(outputs.squeeze(), y_batch.float()).item()
            
            val_loss /= len(val_loader)
            
            # Early stopping
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= patience:
                    break
        
        return best_val_loss
    
    def search(self, train_loader, val_loader, n_trials=10):
        """Run architecture search."""
        print(f"Running NAS with {n_trials} trials...")
        print("=" * 50)
        
        for trial in range(n_trials):
            # Sample architecture
            config = self.sample_architecture()
            model = self.build_model(config)
            
            # Evaluate
            val_loss = self.evaluate_architecture(
                model, train_loader, val_loader, config
            )
            
            # Store results
            result = {
                'config': config,
                'val_loss': val_loss,
                'num_params': model.count_parameters()
            }
            self.results.append(result)
            
            print(f"Trial {trial+1}: Val Loss = {val_loss:.4f}, "
                  f"Layers = {len(config['hidden_dims'])}, "
                  f"Width = {config['hidden_dims'][0]}")
        
        # Find best architecture
        best_result = min(self.results, key=lambda x: x['val_loss'])
        return best_result


# Demonstrate NAS setup (we'll use dummy data for quick demonstration)
print("Neural Architecture Search Setup:")
print("=" * 50)

nas = SimpleNAS(input_dim=20)

# Show sample architectures
print("\nSample architectures from search space:")
for i in range(3):
    config = nas.sample_architecture()
    print(f"  Config {i+1}: {config['hidden_dims']}, "
          f"dropout={config['dropout']}, act={config['activation']}")

---

## 9. Practical: Design Optimal Architecture for Factor Model

Now we'll put everything together to design an optimal neural network architecture for a multi-factor trading model.

### Factor Model Objectives
1. Predict next-day returns direction
2. Combine multiple feature types (price, volume, categorical)
3. Handle multiple assets simultaneously
4. Include European market considerations

In [None]:
# Prepare data for factor model
def prepare_factor_model_data(features_dict, tickers_info):
    """
    Prepare data for multi-asset factor model training.
    
    Combines features from all assets into a unified dataset.
    """
    all_data = []
    
    # Encode categorical features
    all_sectors = list(set(info['sector'] for info in tickers_info.values()))
    all_assets = list(set(info['asset_class'] for info in tickers_info.values()))
    all_regions = list(set(info['region'] for info in tickers_info.values()))
    
    sector_map = {s: i for i, s in enumerate(all_sectors)}
    asset_map = {a: i for i, a in enumerate(all_assets)}
    region_map = {r: i for i, r in enumerate(all_regions)}
    
    for ticker, features in features_dict.items():
        if ticker not in tickers_info:
            continue
            
        # Get clean data (drop NaN)
        clean_features = features.dropna()
        
        if len(clean_features) < 100:
            continue
        
        # Separate numerical and categorical
        numerical_cols = [col for col in clean_features.columns 
                         if col not in ['sector', 'asset_class', 'region', 'target']]
        
        for idx in clean_features.index:
            row = clean_features.loc[idx]
            
            # Get numerical features
            numerical = row[numerical_cols].values.astype(np.float32)
            
            # Get categorical indices
            sector_idx = sector_map.get(row['sector'], 0)
            asset_idx = asset_map.get(row['asset_class'], 0)
            region_idx = region_map.get(row['region'], 0)
            
            # Get target
            target = row['target']
            
            all_data.append({
                'numerical': numerical,
                'sector_idx': sector_idx,
                'asset_idx': asset_idx,
                'region_idx': region_idx,
                'target': target,
                'ticker': ticker,
                'date': idx
            })
    
    return all_data, sector_map, asset_map, region_map


# Prepare data
print("Preparing factor model data...")
all_data, sector_map, asset_map, region_map = prepare_factor_model_data(
    features_dict, tickers
)

print(f"Total samples: {len(all_data):,}")
print(f"Sectors: {sector_map}")
print(f"Asset classes: {asset_map}")
print(f"Regions: {region_map}")

In [None]:
class FactorModelDataset(Dataset):
    """
    PyTorch Dataset for factor model training.
    
    Handles numerical and categorical features separately.
    """
    
    def __init__(self, data, scaler=None, fit_scaler=False):
        self.data = data
        
        # Extract numerical features for scaling
        numerical_features = np.array([d['numerical'] for d in data])
        
        # Handle scaling
        if fit_scaler:
            self.scaler = StandardScaler()
            self.numerical = self.scaler.fit_transform(numerical_features)
        elif scaler is not None:
            self.scaler = scaler
            self.numerical = self.scaler.transform(numerical_features)
        else:
            self.scaler = None
            self.numerical = numerical_features
        
        # Extract other data
        self.sector_idx = np.array([d['sector_idx'] for d in data])
        self.asset_idx = np.array([d['asset_idx'] for d in data])
        self.region_idx = np.array([d['region_idx'] for d in data])
        self.targets = np.array([d['target'] for d in data])
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return (
            torch.FloatTensor(self.numerical[idx]),
            torch.LongTensor([self.sector_idx[idx]]),
            torch.LongTensor([self.asset_idx[idx]]),
            torch.LongTensor([self.region_idx[idx]]),
            torch.FloatTensor([self.targets[idx]])
        )


# Time-series aware split (no look-ahead bias)
# Sort data by date first
all_data_sorted = sorted(all_data, key=lambda x: x['date'])

# Split: 70% train, 15% validation, 15% test
n = len(all_data_sorted)
train_idx = int(0.7 * n)
val_idx = int(0.85 * n)

train_data = all_data_sorted[:train_idx]
val_data = all_data_sorted[train_idx:val_idx]
test_data = all_data_sorted[val_idx:]

print(f"Train samples: {len(train_data):,}")
print(f"Validation samples: {len(val_data):,}")
print(f"Test samples: {len(test_data):,}")

# Create datasets
train_dataset = FactorModelDataset(train_data, fit_scaler=True)
val_dataset = FactorModelDataset(val_data, scaler=train_dataset.scaler)
test_dataset = FactorModelDataset(test_data, scaler=train_dataset.scaler)

# Create data loaders
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [None]:
class OptimalFactorModel(nn.Module):
    """
    Optimal neural network architecture for factor model.
    
    Combines:
    - Embedding layers for categorical features
    - Residual connections for better gradient flow
    - Multi-branch architecture for different feature types
    - Attention mechanism for feature importance
    """
    
    def __init__(self, num_numerical, num_sectors, num_assets, num_regions,
                 hidden_dim=128, embed_dim=8, num_res_blocks=3, dropout=0.3):
        super(OptimalFactorModel, self).__init__()
        
        # ===== EMBEDDING LAYERS =====
        self.sector_embedding = nn.Embedding(num_sectors, embed_dim)
        self.asset_embedding = nn.Embedding(num_assets, embed_dim)
        self.region_embedding = nn.Embedding(num_regions, embed_dim)
        
        # ===== INPUT PROJECTION =====
        total_input_dim = num_numerical + 3 * embed_dim
        self.input_projection = nn.Sequential(
            nn.Linear(total_input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.GELU(),  # GELU often works well in financial applications
            nn.Dropout(dropout)
        )
        
        # ===== RESIDUAL BLOCKS =====
        self.res_blocks = nn.ModuleList([
            ResidualBlock(hidden_dim, dropout) for _ in range(num_res_blocks)
        ])
        
        # ===== FEATURE IMPORTANCE ATTENTION =====
        self.attention = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 4),
            nn.ReLU(),
            nn.Linear(hidden_dim // 4, hidden_dim),
            nn.Sigmoid()
        )
        
        # ===== OUTPUT LAYERS =====
        self.output_layers = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, 1)
        )
    
    def forward(self, numerical, sector_idx, asset_idx, region_idx):
        # Get embeddings (squeeze extra dimension)
        sector_emb = self.sector_embedding(sector_idx.squeeze(1))
        asset_emb = self.asset_embedding(asset_idx.squeeze(1))
        region_emb = self.region_embedding(region_idx.squeeze(1))
        
        # Concatenate all features
        x = torch.cat([numerical, sector_emb, asset_emb, region_emb], dim=1)
        
        # Project to hidden space
        x = self.input_projection(x)
        
        # Pass through residual blocks
        for block in self.res_blocks:
            x = block(x)
        
        # Apply attention
        attention_weights = self.attention(x)
        x = x * attention_weights
        
        # Generate output
        return self.output_layers(x)


# Get numerical feature dimension
num_numerical = train_dataset.numerical.shape[1]

# Create model
model = OptimalFactorModel(
    num_numerical=num_numerical,
    num_sectors=len(sector_map),
    num_assets=len(asset_map),
    num_regions=len(region_map),
    hidden_dim=128,
    embed_dim=8,
    num_res_blocks=3,
    dropout=0.3
).to(device)

print("Optimal Factor Model Architecture:")
print("=" * 50)
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
def train_factor_model(model, train_loader, val_loader, num_epochs=50, 
                       lr=1e-3, patience=10):
    """
    Train the factor model with best practices.
    
    Includes:
    - Learning rate scheduling
    - Early stopping
    - Gradient clipping
    - Training history tracking
    """
    # Loss function with class weighting for imbalanced data
    criterion = nn.BCEWithLogitsLoss()
    
    # Optimizer with weight decay (L2 regularization)
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-5)
    
    # Learning rate scheduler
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode='min', factor=0.5, patience=5, verbose=True
    )
    
    # Training history
    history = {'train_loss': [], 'val_loss': [], 'val_acc': []}
    
    # Early stopping
    best_val_loss = float('inf')
    best_model_state = None
    patience_counter = 0
    
    print("Starting training...")
    print("=" * 60)
    
    for epoch in range(num_epochs):
        # ===== TRAINING =====
        model.train()
        train_loss = 0
        
        for batch in train_loader:
            numerical, sector, asset, region, target = [
                b.to(device) for b in batch
            ]
            
            optimizer.zero_grad()
            
            # Forward pass
            output = model(numerical, sector, asset, region)
            loss = criterion(output, target)
            
            # Backward pass with gradient clipping
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            
            train_loss += loss.item()
        
        train_loss /= len(train_loader)
        
        # ===== VALIDATION =====
        model.eval()
        val_loss = 0
        correct = 0
        total = 0
        
        with torch.no_grad():
            for batch in val_loader:
                numerical, sector, asset, region, target = [
                    b.to(device) for b in batch
                ]
                
                output = model(numerical, sector, asset, region)
                val_loss += criterion(output, target).item()
                
                # Calculate accuracy
                predictions = (torch.sigmoid(output) > 0.5).float()
                correct += (predictions == target).sum().item()
                total += target.size(0)
        
        val_loss /= len(val_loader)
        val_acc = correct / total
        
        # Update history
        history['train_loss'].append(train_loss)
        history['val_loss'].append(val_loss)
        history['val_acc'].append(val_acc)
        
        # Learning rate scheduling
        scheduler.step(val_loss)
        
        # Print progress
        if (epoch + 1) % 5 == 0 or epoch == 0:
            print(f"Epoch {epoch+1:3d}: Train Loss = {train_loss:.4f}, "
                  f"Val Loss = {val_loss:.4f}, Val Acc = {val_acc:.4f}")
        
        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_state = model.state_dict().copy()
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print(f"\nEarly stopping at epoch {epoch+1}")
                break
    
    # Restore best model
    if best_model_state is not None:
        model.load_state_dict(best_model_state)
    
    return model, history


# Train the model
trained_model, history = train_factor_model(
    model, train_loader, val_loader,
    num_epochs=50, lr=1e-3, patience=10
)

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss plot
axes[0].plot(history['train_loss'], label='Train Loss', linewidth=2)
axes[0].plot(history['val_loss'], label='Validation Loss', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training and Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy plot
axes[1].plot(history['val_acc'], label='Validation Accuracy', 
             linewidth=2, color='green')
axes[1].axhline(y=0.5, color='r', linestyle='--', label='Random Baseline')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
def evaluate_trading_strategy(model, test_loader, test_data):
    """
    Evaluate the model as a trading strategy.
    
    Calculates:
    - Accuracy metrics
    - Trading performance (returns, Sharpe ratio)
    - Comparison to buy-and-hold
    """
    model.eval()
    
    all_predictions = []
    all_targets = []
    all_probs = []
    
    with torch.no_grad():
        for batch in test_loader:
            numerical, sector, asset, region, target = [
                b.to(device) for b in batch
            ]
            
            output = model(numerical, sector, asset, region)
            probs = torch.sigmoid(output)
            predictions = (probs > 0.5).float()
            
            all_predictions.extend(predictions.cpu().numpy().flatten())
            all_targets.extend(target.cpu().numpy().flatten())
            all_probs.extend(probs.cpu().numpy().flatten())
    
    # Classification metrics
    accuracy = accuracy_score(all_targets, all_predictions)
    
    print("Model Evaluation Results:")
    print("=" * 50)
    print(f"\nClassification Accuracy: {accuracy:.4f}")
    print(f"\nClassification Report:")
    print(classification_report(all_targets, all_predictions, 
                                target_names=['Down', 'Up']))
    
    # Trading performance
    # Convert predictions to positions: 1 = long, -1 = short (based on probability)
    positions = np.array([1 if p > 0.5 else -1 for p in all_probs])
    
    # Get actual returns from test data
    actual_returns = []
    for i, d in enumerate(test_data[:len(all_probs)]):
        # Use next day return (target indicates direction)
        ret = 0.01 if d['target'] == 1 else -0.01  # Simplified return assumption
        actual_returns.append(ret)
    
    actual_returns = np.array(actual_returns)
    
    # Strategy returns
    strategy_returns = positions * actual_returns
    
    # Calculate performance metrics
    cumulative_return = (1 + strategy_returns).cumprod()[-1] - 1
    sharpe_ratio = np.sqrt(252) * strategy_returns.mean() / (strategy_returns.std() + 1e-8)
    
    # Buy and hold baseline
    bh_return = (1 + actual_returns).cumprod()[-1] - 1
    bh_sharpe = np.sqrt(252) * actual_returns.mean() / (actual_returns.std() + 1e-8)
    
    print("\nTrading Strategy Performance:")
    print("-" * 50)
    print(f"Strategy Cumulative Return: {cumulative_return*100:.2f}%")
    print(f"Strategy Sharpe Ratio: {sharpe_ratio:.2f}")
    print(f"\nBuy-and-Hold Cumulative Return: {bh_return*100:.2f}%")
    print(f"Buy-and-Hold Sharpe Ratio: {bh_sharpe:.2f}")
    print(f"\nExcess Return vs B&H: {(cumulative_return - bh_return)*100:.2f}%")
    
    return {
        'accuracy': accuracy,
        'predictions': all_predictions,
        'probabilities': all_probs,
        'strategy_returns': strategy_returns,
        'cumulative_return': cumulative_return,
        'sharpe_ratio': sharpe_ratio
    }


# Evaluate the model
results = evaluate_trading_strategy(trained_model, test_loader, test_data)

In [None]:
# Plot cumulative returns
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Cumulative returns
cumulative_strategy = (1 + results['strategy_returns']).cumprod()
axes[0].plot(cumulative_strategy, label='Neural Network Strategy', linewidth=2)
axes[0].axhline(y=1, color='r', linestyle='--', alpha=0.5, label='Starting Capital')
axes[0].set_xlabel('Trade Number')
axes[0].set_ylabel('Cumulative Return')
axes[0].set_title('Strategy Cumulative Performance')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Prediction probability distribution
axes[1].hist(results['probabilities'], bins=50, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0.5, color='r', linestyle='--', label='Decision Threshold')
axes[1].set_xlabel('Predicted Probability (Up)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Prediction Probabilities')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 10. Key Takeaways

### Architecture Design Principles for Trading

1. **Start Simple, Scale Carefully**
   - Begin with shallow networks (2-3 layers)
   - Add complexity only when justified by validation performance
   - More parameters = higher overfitting risk with limited data

2. **Use Residual Connections for Deeper Networks**
   - Skip connections help preserve weak financial signals
   - Enable training of deeper architectures
   - Provide implicit regularization

3. **Embeddings for Categorical Features**
   - Learn relationships between sectors/assets
   - More efficient than one-hot encoding
   - Enables transfer learning across assets

4. **Multi-Input Architectures**
   - Combine different data types (price, volume, sentiment)
   - Each branch can specialize in its data modality
   - Attention mechanisms can weight branch importance

5. **AutoML Considerations**
   - Use for systematic architecture exploration
   - Document search process for compliance
   - Prefer Bayesian optimization over grid search

### European Market Considerations ðŸ‡ªðŸ‡º

- Model decisions must be explainable (MiFID II)
- Document architecture choices and rationale
- Consider multi-currency handling in architecture
- Account for different trading hours and holidays

---

## 11. Exercises

### Exercise 1: Architecture Comparison
Implement and compare these architectures on the factor model:
- Wide-shallow (2 layers, 256 neurons)
- Deep-narrow (6 layers, 64 neurons)
- Pyramid (256 â†’ 128 â†’ 64 â†’ 32)

### Exercise 2: Custom Embeddings
Add additional categorical features:
- Market cap bucket (Small, Mid, Large)
- Volatility regime (Low, Medium, High)
- Day of week

### Exercise 3: Attention Analysis
Extract and visualize the attention weights from the model:
- Which features does the model focus on?
- Does attention vary by sector or region?

### Exercise 4: European-Specific Model
Build a model specifically for European equities:
- Add currency embedding
- Include European trading hours features
- Account for European regulatory requirements

In [None]:
# Exercise space - try implementing the exercises above!

# Example: Exercise 1 starter code
architectures_to_compare = {
    'wide_shallow': [256, 256],
    'deep_narrow': [64, 64, 64, 64, 64, 64],
    'pyramid': [256, 128, 64, 32]
}

# Your code here...
print("Ready for exercises!")

---

## References

1. **He et al. (2015)** - "Deep Residual Learning for Image Recognition" - Original ResNet paper
2. **Vaswani et al. (2017)** - "Attention Is All You Need" - Attention mechanisms
3. **LÃ³pez de Prado (2018)** - "Advances in Financial Machine Learning" - ML in finance
4. **Guo & Berkhahn (2016)** - "Entity Embeddings of Categorical Variables" - Embedding techniques
5. **Elsken et al. (2019)** - "Neural Architecture Search: A Survey" - NAS overview

---

**Next:** Day 6 - Training Best Practices for Neural Networks in Trading