# Day 6: Training Best Practices for Neural Networks in Trading

## Week 13 - Neural Networks in Quantitative Finance

---

## Learning Objectives

By the end of this notebook, you will master:

1. **Data Preprocessing & Standardization** - Proper scaling for financial data
2. **Train/Validation/Test Splits for Time Series** - Avoiding look-ahead bias
3. **Walk-Forward Training Paradigm** - Realistic backtesting methodology
4. **Mini-batch vs Full-batch Training** - Optimization for small financial datasets
5. **Handling Class Imbalance** - Dealing with rare trading signals
6. **Reproducibility** - Seeds, determinism, and experiment tracking
7. **GPU vs CPU Considerations** - Hardware optimization
8. **Practical: Full Training Pipeline** - Production-ready implementation

---

## Why Training Practices Matter in Finance

**Common Pitfalls in Financial ML:**
- Overfitting to historical data
- Look-ahead bias from improper data splits
- Non-reproducible results leading to false confidence
- Class imbalance causing models to ignore rare profitable signals

### European Market Considerations ðŸ‡ªðŸ‡º

- **GDPR**: Data handling and storage requirements
- **MiFID II**: Algorithm testing and validation documentation
- **Model Governance**: Reproducibility is mandatory for regulatory audits
- **Cross-Border Trading**: Handle different market calendars properly

---

## 1. Environment Setup and Imports

In [1]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
import os
import random
import json
import time
warnings.filterwarnings('ignore')

# PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, Dataset, WeightedRandomSampler

# Data acquisition
import yfinance as yf

# Scikit-learn utilities
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

# Plotting settings
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

PyTorch version: 2.9.1
CUDA available: False


---

## 2. Reproducibility: Seeds and Determinism

### Why Reproducibility Matters

**Regulatory Requirements:**
- MiFID II requires audit trails for algorithmic trading
- Results must be reproducible for compliance reviews
- Model validation requires consistent behavior

**Practical Benefits:**
- Debug models reliably
- Compare experiments fairly
- Deploy with confidence

In [2]:
def set_all_seeds(seed: int = 42, deterministic: bool = True):
    """
    Set all random seeds for complete reproducibility.
    
    This function ensures reproducible results across:
    - NumPy operations
    - Python random module
    - PyTorch CPU operations
    - PyTorch GPU operations (if available)
    
    Parameters:
    -----------
    seed : int - The seed value to use
    deterministic : bool - If True, enforce deterministic algorithms
                          (may impact performance)
    
    Note: For 100% reproducibility on GPU, you may also need:
    export CUBLAS_WORKSPACE_CONFIG=:4096:8
    """
    # Python's built-in random
    random.seed(seed)
    
    # NumPy
    np.random.seed(seed)
    
    # PyTorch CPU
    torch.manual_seed(seed)
    
    # PyTorch GPU (all GPUs)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    
    # Deterministic algorithms
    if deterministic:
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        # For PyTorch 1.8+
        if hasattr(torch, 'use_deterministic_algorithms'):
            try:
                torch.use_deterministic_algorithms(True)
            except Exception:
                pass  # Some operations may not have deterministic implementations
    else:
        # Allow cuDNN to find optimal algorithms (faster but non-deterministic)
        torch.backends.cudnn.benchmark = True
    
    # Set Python hash seed (for dict ordering consistency)
    os.environ['PYTHONHASHSEED'] = str(seed)
    
    return seed


# Set seeds for this notebook
SEED = set_all_seeds(42)
print(f"All seeds set to: {SEED}")
print(f"Deterministic mode: Enabled")

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

All seeds set to: 42
Deterministic mode: Enabled
Using device: cpu


In [3]:
# Verify reproducibility
def test_reproducibility(seed=42):
    """
    Test that our seeding works correctly.
    
    Runs the same operation twice and verifies identical results.
    """
    results = []
    
    for run in range(2):
        set_all_seeds(seed)
        
        # NumPy random
        np_random = np.random.randn(5)
        
        # PyTorch random
        torch_random = torch.randn(5)
        
        # PyTorch model initialization
        model = nn.Linear(10, 5)
        weights = model.weight.data.clone()
        
        results.append({
            'np_random': np_random,
            'torch_random': torch_random.numpy(),
            'model_weights': weights.numpy()
        })
    
    # Verify all match
    print("Reproducibility Test Results:")
    print("=" * 40)
    
    np_match = np.allclose(results[0]['np_random'], results[1]['np_random'])
    torch_match = np.allclose(results[0]['torch_random'], results[1]['torch_random'])
    weights_match = np.allclose(results[0]['model_weights'], results[1]['model_weights'])
    
    print(f"NumPy random reproducible: {'âœ“' if np_match else 'âœ—'}")
    print(f"PyTorch random reproducible: {'âœ“' if torch_match else 'âœ—'}")
    print(f"Model initialization reproducible: {'âœ“' if weights_match else 'âœ—'}")
    
    return all([np_match, torch_match, weights_match])

test_reproducibility()

Reproducibility Test Results:
NumPy random reproducible: âœ“
PyTorch random reproducible: âœ“
Model initialization reproducible: âœ“


True

---

## 3. Data Acquisition and Loading

In [4]:
# Download diverse market data
# Including European stocks for market considerations

tickers = [
    # US Equities
    'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META',
    'JPM', 'GS', 'BAC',
    # European Equities (traded on US exchanges or ADRs)
    'SAP', 'ASML', 'NVO',  # European tech/pharma
    # Market indices
    'SPY', 'QQQ', 'EWG',  # Germany ETF
]

# 5 years of data for robust training
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)

print(f"Downloading data from {start_date.date()} to {end_date.date()}")
print("-" * 50)

# Download all data
data = {}
for ticker in tickers:
    try:
        df = yf.download(ticker, start=start_date, end=end_date, progress=False)
        if len(df) > 500:  # At least 2 years of trading days
            data[ticker] = df
            print(f"âœ“ {ticker}: {len(df)} trading days")
        else:
            print(f"âœ— {ticker}: Insufficient data")
    except Exception as e:
        print(f"âœ— {ticker}: Error - {e}")

# Create price DataFrame (using Close column)
prices_df = pd.DataFrame({ticker: df['Close'] for ticker, df in data.items()})
prices_df = prices_df.ffill().dropna()

print(f"\nFinal dataset: {len(prices_df)} days, {len(prices_df.columns)} assets")

Downloading data from 2021-01-24 to 2026-01-23
--------------------------------------------------
âœ“ AAPL: 1255 trading days
âœ“ MSFT: 1255 trading days
âœ“ GOOGL: 1255 trading days
âœ“ AMZN: 1255 trading days
âœ“ META: 1255 trading days
âœ“ JPM: 1255 trading days
âœ“ GS: 1255 trading days
âœ“ BAC: 1255 trading days
âœ“ SAP: 1255 trading days
âœ“ ASML: 1255 trading days
âœ“ NVO: 1255 trading days
âœ“ SPY: 1255 trading days
âœ“ QQQ: 1255 trading days
âœ“ EWG: 1255 trading days


ValueError: If using all scalar values, you must pass an index

---

## 4. Data Preprocessing and Standardization

### Why Preprocessing Matters for Neural Networks

**Problem without preprocessing:**
- Features on different scales dominate gradient updates
- Slow convergence or failure to converge
- Numerical instability

**Common Scaling Methods:**

| Method | Formula | Best For |
|--------|---------|----------|
| StandardScaler | (x - Î¼) / Ïƒ | Gaussian-like distributions |
| RobustScaler | (x - median) / IQR | Outlier-heavy data |
| MinMaxScaler | (x - min) / (max - min) | Bounded features |

### Financial Data Considerations

- **Returns**: Often approximately Gaussian, StandardScaler works well
- **Volume**: Heavy tails, use RobustScaler or log transform
- **Prices**: Non-stationary, convert to returns first
- **Volatility**: Right-skewed, consider log transform

In [None]:
def create_features_with_target(prices_df, target_horizon=1, lookback_windows=[5, 10, 20, 60]):
    """
    Create features and target for neural network training.
    
    Parameters:
    -----------
    prices_df : pd.DataFrame - Price data (Close prices)
    target_horizon : int - Days ahead for target (default: next day)
    lookback_windows : list - Windows for feature calculation
    
    Returns:
    --------
    features_df : pd.DataFrame - Feature matrix
    targets : pd.Series - Binary targets (1 = up, 0 = down)
    feature_names : list - Names of features
    """
    feature_dfs = []
    
    for ticker in prices_df.columns:
        price = prices_df[ticker]
        
        # Create features
        features = pd.DataFrame(index=prices_df.index)
        
        # Returns at different horizons
        features[f'{ticker}_ret_1d'] = price.pct_change(1)
        features[f'{ticker}_ret_5d'] = price.pct_change(5)
        features[f'{ticker}_ret_20d'] = price.pct_change(20)
        
        # Momentum and mean reversion features
        for window in lookback_windows:
            # Price relative to moving average
            sma = price.rolling(window).mean()
            features[f'{ticker}_sma_ratio_{window}'] = price / sma - 1
            
            # Volatility
            features[f'{ticker}_vol_{window}'] = price.pct_change().rolling(window).std() * np.sqrt(252)
            
            # Momentum
            features[f'{ticker}_mom_{window}'] = price / price.shift(window) - 1
        
        # RSI
        delta = price.diff()
        gain = delta.where(delta > 0, 0).rolling(14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
        rs = gain / (loss + 1e-10)
        features[f'{ticker}_rsi'] = 100 - (100 / (1 + rs))
        
        feature_dfs.append(features)
    
    # Combine all features
    features_df = pd.concat(feature_dfs, axis=1)
    
    # Create target using SPY as market proxy
    # Binary: 1 if market goes up, 0 if down
    target_ticker = 'SPY' if 'SPY' in prices_df.columns else prices_df.columns[0]
    targets = (prices_df[target_ticker].shift(-target_horizon) > prices_df[target_ticker]).astype(int)
    targets.name = 'target'
    
    # Drop NaN
    valid_idx = features_df.dropna().index.intersection(targets.dropna().index)
    features_df = features_df.loc[valid_idx]
    targets = targets.loc[valid_idx]
    
    return features_df, targets, list(features_df.columns)


# Create features
features_df, targets, feature_names = create_features_with_target(prices_df)

print(f"Features shape: {features_df.shape}")
print(f"Number of features: {len(feature_names)}")
print(f"Target distribution: Up={targets.sum()}, Down={len(targets)-targets.sum()}")
print(f"Class ratio: {targets.mean():.2%} positive")

In [None]:
class FinancialDataPreprocessor:
    """
    Comprehensive data preprocessor for financial features.
    
    Handles:
    - Feature scaling with multiple methods
    - Outlier handling
    - Missing value imputation
    - Feature-wise scaling (crucial for look-ahead bias prevention)
    """
    
    def __init__(self, scaling_method='robust', clip_outliers=True, 
                 outlier_threshold=5.0):
        """
        Parameters:
        -----------
        scaling_method : str - 'standard', 'robust', or 'minmax'
        clip_outliers : bool - Whether to clip extreme values
        outlier_threshold : float - Number of std/IQR for clipping
        """
        self.scaling_method = scaling_method
        self.clip_outliers = clip_outliers
        self.outlier_threshold = outlier_threshold
        
        # Will be fit on training data
        self.scaler = None
        self.feature_stats = {}
        self.is_fitted = False
    
    def fit(self, X):
        """
        Fit preprocessor on training data only.
        
        CRITICAL: Never fit on validation/test data to avoid look-ahead bias!
        """
        X = np.array(X, dtype=np.float32)
        
        # Store feature statistics for outlier handling
        if self.scaling_method == 'robust':
            self.feature_stats['median'] = np.nanmedian(X, axis=0)
            q75 = np.nanpercentile(X, 75, axis=0)
            q25 = np.nanpercentile(X, 25, axis=0)
            self.feature_stats['iqr'] = q75 - q25
        else:
            self.feature_stats['mean'] = np.nanmean(X, axis=0)
            self.feature_stats['std'] = np.nanstd(X, axis=0)
        
        # Initialize and fit scaler
        if self.scaling_method == 'standard':
            self.scaler = StandardScaler()
        elif self.scaling_method == 'robust':
            self.scaler = RobustScaler()
        elif self.scaling_method == 'minmax':
            self.scaler = MinMaxScaler(feature_range=(-1, 1))
        
        # Handle NaN before fitting
        X_clean = np.nan_to_num(X, nan=0.0)
        self.scaler.fit(X_clean)
        
        self.is_fitted = True
        return self
    
    def transform(self, X):
        """
        Transform data using fitted parameters.
        """
        if not self.is_fitted:
            raise ValueError("Preprocessor not fitted. Call fit() first.")
        
        X = np.array(X, dtype=np.float32)
        
        # Handle NaN
        X = np.nan_to_num(X, nan=0.0)
        
        # Clip outliers before scaling
        if self.clip_outliers:
            X = self._clip_outliers(X)
        
        # Scale
        X_scaled = self.scaler.transform(X)
        
        return X_scaled.astype(np.float32)
    
    def fit_transform(self, X):
        """Fit and transform in one step."""
        self.fit(X)
        return self.transform(X)
    
    def _clip_outliers(self, X):
        """Clip outliers based on fitted statistics."""
        if self.scaling_method == 'robust':
            median = self.feature_stats['median']
            iqr = self.feature_stats['iqr'] + 1e-8
            lower = median - self.outlier_threshold * iqr
            upper = median + self.outlier_threshold * iqr
        else:
            mean = self.feature_stats['mean']
            std = self.feature_stats['std'] + 1e-8
            lower = mean - self.outlier_threshold * std
            upper = mean + self.outlier_threshold * std
        
        return np.clip(X, lower, upper)


# Demonstrate preprocessing
print("Data Preprocessing Comparison:")
print("=" * 50)

# Sample data for demonstration
sample_features = features_df.iloc[:1000].values

for method in ['standard', 'robust', 'minmax']:
    preprocessor = FinancialDataPreprocessor(scaling_method=method)
    scaled = preprocessor.fit_transform(sample_features)
    
    print(f"\n{method.upper()} Scaler:")
    print(f"  Original range: [{np.nanmin(sample_features):.2f}, {np.nanmax(sample_features):.2f}]")
    print(f"  Scaled range: [{np.min(scaled):.2f}, {np.max(scaled):.2f}]")
    print(f"  Scaled mean: {np.mean(scaled):.4f}, std: {np.std(scaled):.4f}")

---

## 5. Train/Validation/Test Splits for Time Series

### CRITICAL: Avoiding Look-Ahead Bias

**Standard ML Split (WRONG for time series):**
```
Random shuffle â†’ Split into train/val/test
```
This causes data leakage: future data influences predictions about the past.

**Correct Time Series Split:**
```
Time-ordered data â†’ [Train] | [Validation] | [Test]
                     Past      Present       Future
```

### Gap Period

Include a gap between train and validation to:
- Prevent target leakage from autocorrelated data
- Simulate realistic prediction delay
- Account for information propagation time

In [None]:
class TimeSeriesSplitter:
    """
    Time-aware data splitter for financial ML.
    
    Prevents look-ahead bias by maintaining chronological order.
    """
    
    def __init__(self, train_ratio=0.7, val_ratio=0.15, gap_days=5):
        """
        Parameters:
        -----------
        train_ratio : float - Proportion of data for training
        val_ratio : float - Proportion of data for validation
        gap_days : int - Number of days gap between splits (prevents leakage)
        """
        self.train_ratio = train_ratio
        self.val_ratio = val_ratio
        self.gap_days = gap_days
        # test_ratio is implicit: 1 - train_ratio - val_ratio
    
    def split(self, X, y, dates=None):
        """
        Split data chronologically.
        
        Parameters:
        -----------
        X : array-like - Features
        y : array-like - Targets
        dates : array-like - Optional date index for reporting
        
        Returns:
        --------
        dict with train, val, test splits
        """
        n = len(X)
        
        # Calculate split indices
        train_end = int(n * self.train_ratio)
        val_start = train_end + self.gap_days
        val_end = val_start + int(n * self.val_ratio)
        test_start = val_end + self.gap_days
        
        # Create splits
        splits = {
            'train': {
                'X': X[:train_end],
                'y': y[:train_end],
                'indices': range(0, train_end)
            },
            'val': {
                'X': X[val_start:val_end],
                'y': y[val_start:val_end],
                'indices': range(val_start, val_end)
            },
            'test': {
                'X': X[test_start:],
                'y': y[test_start:],
                'indices': range(test_start, n)
            }
        }
        
        # Add dates if provided
        if dates is not None:
            dates = np.array(dates)
            splits['train']['dates'] = dates[:train_end]
            splits['val']['dates'] = dates[val_start:val_end]
            splits['test']['dates'] = dates[test_start:]
        
        return splits
    
    def report(self, splits, dates=None):
        """Print split information."""
        print("Time Series Split Report:")
        print("=" * 60)
        
        for name, data in splits.items():
            n = len(data['X'])
            if 'dates' in data:
                start = pd.Timestamp(data['dates'][0]).strftime('%Y-%m-%d')
                end = pd.Timestamp(data['dates'][-1]).strftime('%Y-%m-%d')
                print(f"{name.upper():8s}: {n:5d} samples | {start} to {end}")
            else:
                print(f"{name.upper():8s}: {n:5d} samples")


# Apply time series split
splitter = TimeSeriesSplitter(train_ratio=0.7, val_ratio=0.15, gap_days=5)
splits = splitter.split(
    features_df.values, 
    targets.values,
    dates=features_df.index
)

splitter.report(splits)

# Visualize the split
fig, ax = plt.subplots(figsize=(14, 4))

# Plot timeline
for name, data in splits.items():
    if 'dates' in data:
        start_idx = data['indices'][0]
        end_idx = data['indices'][-1]
        color = {'train': 'blue', 'val': 'orange', 'test': 'green'}[name]
        ax.axvspan(start_idx, end_idx, alpha=0.3, label=name.upper(), color=color)

ax.set_xlabel('Sample Index')
ax.set_title('Time Series Train/Validation/Test Split')
ax.legend()
plt.tight_layout()
plt.show()

---

## 6. Walk-Forward Training Paradigm

### Why Walk-Forward?

**Standard Training:**
- Train once on historical data
- Test on future data
- Problem: Model becomes stale as markets evolve

**Walk-Forward Training:**
- Train on window of historical data
- Predict next period
- Roll forward and repeat
- Mimics real trading environment

```
Window 1: [Train...............] [Test]
Window 2:     [Train...............] [Test]
Window 3:         [Train...............] [Test]
```

In [None]:
class WalkForwardValidator:
    """
    Walk-forward validation framework for time series.
    
    Implements expanding or rolling window validation.
    """
    
    def __init__(self, n_splits=5, train_period=252, test_period=63,
                 expanding=False, gap=5):
        """
        Parameters:
        -----------
        n_splits : int - Number of walk-forward folds
        train_period : int - Training window size (days)
        test_period : int - Test window size (days)
        expanding : bool - If True, training window expands; if False, rolls
        gap : int - Gap between train and test to prevent leakage
        """
        self.n_splits = n_splits
        self.train_period = train_period
        self.test_period = test_period
        self.expanding = expanding
        self.gap = gap
    
    def split(self, X):
        """
        Generate train/test indices for each fold.
        
        Yields:
        -------
        train_idx, test_idx for each fold
        """
        n = len(X)
        
        # Calculate minimum required data
        min_required = self.train_period + self.gap + self.test_period
        
        # Calculate step size between folds
        if self.n_splits > 1:
            # Available space for all test periods
            available = n - min_required
            step = max(1, available // (self.n_splits - 1))
        else:
            step = 0
        
        for fold in range(self.n_splits):
            if self.expanding:
                # Expanding window: train from start
                train_start = 0
                train_end = self.train_period + fold * step
            else:
                # Rolling window: fixed train size
                train_start = fold * step
                train_end = train_start + self.train_period
            
            # Test period with gap
            test_start = train_end + self.gap
            test_end = min(test_start + self.test_period, n)
            
            if test_end <= test_start:
                break
            
            train_idx = np.arange(train_start, train_end)
            test_idx = np.arange(test_start, test_end)
            
            yield train_idx, test_idx
    
    def visualize(self, X, dates=None):
        """Visualize walk-forward splits."""
        fig, ax = plt.subplots(figsize=(14, 6))
        
        colors = plt.cm.tab10(np.linspace(0, 1, self.n_splits))
        
        for fold, (train_idx, test_idx) in enumerate(self.split(X)):
            # Plot training period
            ax.barh(fold, len(train_idx), left=train_idx[0], 
                   color=colors[fold], alpha=0.5, label=f'Fold {fold+1} Train')
            
            # Plot test period
            ax.barh(fold, len(test_idx), left=test_idx[0],
                   color=colors[fold], alpha=1.0, edgecolor='black')
        
        ax.set_xlabel('Sample Index')
        ax.set_ylabel('Fold')
        ax.set_title('Walk-Forward Validation Splits\n(Light = Train, Dark = Test)')
        ax.set_yticks(range(self.n_splits))
        ax.set_yticklabels([f'Fold {i+1}' for i in range(self.n_splits)])
        
        plt.tight_layout()
        plt.show()


# Create walk-forward validator
wf_validator = WalkForwardValidator(
    n_splits=5,
    train_period=500,  # ~2 years training
    test_period=63,    # ~3 months test
    expanding=True,    # Expanding window
    gap=5
)

# Visualize
wf_validator.visualize(features_df.values)

# Print fold details
print("\nWalk-Forward Fold Details:")
print("=" * 50)
for fold, (train_idx, test_idx) in enumerate(wf_validator.split(features_df.values)):
    print(f"Fold {fold+1}: Train [{train_idx[0]:4d} - {train_idx[-1]:4d}] ({len(train_idx):4d} samples), "
          f"Test [{test_idx[0]:4d} - {test_idx[-1]:4d}] ({len(test_idx):3d} samples)")

---

## 7. Mini-batch vs Full-batch Training

### Batch Size Considerations for Financial Data

| Batch Size | Pros | Cons | Best For |
|------------|------|------|----------|
| Full batch | Stable gradients, deterministic | Memory intensive, slow updates | Very small datasets (<1000) |
| Large batch (256-512) | Efficient GPU utilization | Less regularization | Large datasets with GPU |
| Medium batch (32-128) | Balance of stability/noise | General purpose | Most financial datasets |
| Small batch (8-32) | Strong regularization | Noisy gradients | Very small datasets, overfit-prone |

### Financial-Specific Guidelines

- **Small datasets (<5000)**: Use smaller batches (16-32) for regularization
- **Time series**: Batch sampling should respect temporal order
- **Class imbalance**: Consider stratified batching

In [None]:
def analyze_batch_size_effects(X, y, batch_sizes=[16, 32, 64, 128, 256]):
    """
    Analyze the effects of different batch sizes on training.
    
    Demonstrates:
    - Gradient variance at different batch sizes
    - Training time per epoch
    - Memory requirements
    """
    print("Batch Size Analysis:")
    print("=" * 60)
    print(f"{'Batch Size':>12} | {'Batches/Epoch':>14} | {'Est. GPU Mem':>12} | {'Gradient Noise':>14}")
    print("-" * 60)
    
    n_samples = len(X)
    n_features = X.shape[1]
    
    for batch_size in batch_sizes:
        # Number of batches per epoch
        n_batches = np.ceil(n_samples / batch_size).astype(int)
        
        # Estimated memory (simplified)
        # Assumes float32, accounts for model + gradients + optimizer states
        mem_mb = (batch_size * n_features * 4 * 3) / (1024 * 1024)
        
        # Gradient noise estimation (smaller batch = higher variance)
        # Based on central limit theorem: variance ~ 1/sqrt(batch_size)
        relative_noise = 1.0 / np.sqrt(batch_size)
        noise_level = 'High' if relative_noise > 0.2 else 'Medium' if relative_noise > 0.1 else 'Low'
        
        print(f"{batch_size:>12} | {n_batches:>14} | {mem_mb:>10.2f} MB | {noise_level:>14}")
    
    print("\nðŸ’¡ Recommendations for financial data:")
    if n_samples < 5000:
        print(f"   Dataset size: {n_samples} (small) â†’ Use batch size 16-32")
    elif n_samples < 50000:
        print(f"   Dataset size: {n_samples} (medium) â†’ Use batch size 32-64")
    else:
        print(f"   Dataset size: {n_samples} (large) â†’ Use batch size 64-256")


# Analyze batch sizes for our data
analyze_batch_size_effects(features_df.values, targets.values)

---

## 8. Handling Class Imbalance in Trading Signals

### The Imbalance Problem in Trading

**Common scenarios:**
- Rare profitable opportunities (5-10% of signals)
- Extreme market events (crashes, rallies)
- Asymmetric targets (large up vs small down)

**Solutions:**
1. **Class Weighting**: Penalize misclassification of minority class
2. **Oversampling**: SMOTE, random oversampling
3. **Undersampling**: Random undersampling (risks losing information)
4. **Threshold Adjustment**: Modify decision threshold
5. **Focal Loss**: Down-weight easy examples

In [None]:
class ImbalanceHandler:
    """
    Handles class imbalance for trading signal prediction.
    
    Provides multiple strategies:
    - Class weights for loss function
    - Weighted sampling for DataLoader
    - Focal Loss implementation
    """
    
    @staticmethod
    def compute_class_weights(y, strategy='balanced'):
        """
        Compute class weights for imbalanced data.
        
        Parameters:
        -----------
        y : array-like - Binary labels
        strategy : str - 'balanced', 'sqrt', or 'none'
        
        Returns:
        --------
        weights : dict - Class weights {0: w0, 1: w1}
        """
        y = np.array(y)
        n_samples = len(y)
        n_classes = 2
        
        # Count classes
        class_counts = np.bincount(y.astype(int))
        
        if strategy == 'balanced':
            # Inverse frequency weighting
            weights = n_samples / (n_classes * class_counts)
        elif strategy == 'sqrt':
            # Square root of inverse frequency (less aggressive)
            weights = np.sqrt(n_samples / (n_classes * class_counts))
        else:
            weights = np.ones(n_classes)
        
        return {0: weights[0], 1: weights[1]}
    
    @staticmethod
    def create_weighted_sampler(y):
        """
        Create WeightedRandomSampler for balanced batch sampling.
        
        This ensures each batch has roughly equal class representation.
        """
        y = np.array(y)
        class_counts = np.bincount(y.astype(int))
        
        # Weight for each sample (inverse of its class frequency)
        weights = 1.0 / class_counts[y.astype(int)]
        weights = torch.FloatTensor(weights)
        
        sampler = WeightedRandomSampler(
            weights=weights,
            num_samples=len(weights),
            replacement=True
        )
        
        return sampler


class FocalLoss(nn.Module):
    """
    Focal Loss for handling class imbalance.
    
    FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)
    
    gamma > 0 reduces loss for well-classified examples,
    focusing training on hard, misclassified examples.
    """
    
    def __init__(self, alpha=None, gamma=2.0, reduction='mean'):
        """
        Parameters:
        -----------
        alpha : float or list - Class weights [w0, w1]
        gamma : float - Focusing parameter (higher = more focus on hard examples)
        reduction : str - 'mean', 'sum', or 'none'
        """
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction
    
    def forward(self, inputs, targets):
        # Apply sigmoid to get probabilities
        p = torch.sigmoid(inputs)
        
        # Get probability for true class
        p_t = p * targets + (1 - p) * (1 - targets)
        
        # Calculate focal weight
        focal_weight = (1 - p_t) ** self.gamma
        
        # Binary cross-entropy
        bce = F.binary_cross_entropy_with_logits(
            inputs, targets, reduction='none'
        )
        
        # Apply focal weight
        loss = focal_weight * bce
        
        # Apply alpha weighting
        if self.alpha is not None:
            alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
            loss = alpha_t * loss
        
        if self.reduction == 'mean':
            return loss.mean()
        elif self.reduction == 'sum':
            return loss.sum()
        else:
            return loss


# Analyze class distribution
print("Class Distribution Analysis:")
print("=" * 50)

y = targets.values
class_counts = np.bincount(y.astype(int))

print(f"Class 0 (Down): {class_counts[0]:,} samples ({class_counts[0]/len(y)*100:.1f}%)")
print(f"Class 1 (Up):   {class_counts[1]:,} samples ({class_counts[1]/len(y)*100:.1f}%)")
print(f"Imbalance ratio: {max(class_counts)/min(class_counts):.2f}:1")

# Compute class weights
weights = ImbalanceHandler.compute_class_weights(y, strategy='balanced')
print(f"\nBalanced class weights: {weights}")

# Visualize
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(['Down (0)', 'Up (1)'], class_counts, color=['red', 'green'], alpha=0.7)
ax.set_ylabel('Count')
ax.set_title('Class Distribution')
plt.tight_layout()
plt.show()

---

## 9. GPU vs CPU Considerations

### When to Use GPU

**GPU Recommended:**
- Large models (>100K parameters)
- Large datasets (>50K samples)
- Training deep networks (>5 layers)
- Batch sizes >= 64

**CPU May Be Better:**
- Small models (<10K parameters)
- Small datasets (<5K samples)
- Inference with small batch sizes
- Data loading is the bottleneck

In [None]:
def benchmark_cpu_vs_gpu(input_dim=100, hidden_dim=128, n_samples=10000, 
                         batch_size=64, n_epochs=5):
    """
    Benchmark CPU vs GPU performance for neural network training.
    """
    # Create dummy data
    X = torch.randn(n_samples, input_dim)
    y = torch.randint(0, 2, (n_samples, 1)).float()
    
    # Simple model
    class BenchmarkModel(nn.Module):
        def __init__(self):
            super().__init__()
            self.net = nn.Sequential(
                nn.Linear(input_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, 1)
            )
        
        def forward(self, x):
            return self.net(x)
    
    results = {}
    devices_to_test = ['cpu']
    if torch.cuda.is_available():
        devices_to_test.append('cuda')
    
    for device_name in devices_to_test:
        device = torch.device(device_name)
        
        # Create model and move to device
        set_all_seeds(42)
        model = BenchmarkModel().to(device)
        optimizer = optim.Adam(model.parameters())
        criterion = nn.BCEWithLogitsLoss()
        
        # Create DataLoader
        dataset = TensorDataset(X, y)
        loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
        
        # Warmup
        for batch_X, batch_y in loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            _ = model(batch_X)
            break
        
        # Benchmark
        start_time = time.time()
        
        for epoch in range(n_epochs):
            for batch_X, batch_y in loader:
                batch_X, batch_y = batch_X.to(device), batch_y.to(device)
                
                optimizer.zero_grad()
                output = model(batch_X)
                loss = criterion(output, batch_y)
                loss.backward()
                optimizer.step()
        
        elapsed = time.time() - start_time
        results[device_name] = elapsed
    
    # Report
    print("CPU vs GPU Benchmark Results:")
    print("=" * 50)
    print(f"Configuration: {n_samples} samples, batch_size={batch_size}, {n_epochs} epochs")
    print("-" * 50)
    
    for device_name, elapsed in results.items():
        print(f"{device_name.upper():6s}: {elapsed:.3f} seconds")
    
    if 'cuda' in results:
        speedup = results['cpu'] / results['cuda']
        print(f"\nGPU Speedup: {speedup:.2f}x")
    
    return results


# Run benchmark
benchmark_results = benchmark_cpu_vs_gpu()

---

## 10. Practical: Full Production Training Pipeline

Now we'll implement a complete, production-ready training pipeline incorporating all best practices.

In [None]:
class TradingDataset(Dataset):
    """
    PyTorch Dataset for trading data with proper preprocessing.
    """
    
    def __init__(self, X, y, preprocessor=None, fit_preprocessor=False):
        """
        Parameters:
        -----------
        X : array-like - Features
        y : array-like - Targets
        preprocessor : FinancialDataPreprocessor - Fitted preprocessor
        fit_preprocessor : bool - Whether to fit preprocessor (only for training)
        """
        self.y = np.array(y, dtype=np.float32)
        
        if fit_preprocessor:
            self.preprocessor = FinancialDataPreprocessor(scaling_method='robust')
            self.X = self.preprocessor.fit_transform(X)
        elif preprocessor is not None:
            self.preprocessor = preprocessor
            self.X = self.preprocessor.transform(X)
        else:
            self.preprocessor = None
            self.X = np.array(X, dtype=np.float32)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return (
            torch.FloatTensor(self.X[idx]),
            torch.FloatTensor([self.y[idx]])
        )

In [None]:
class TradingNeuralNetwork(nn.Module):
    """
    Production-ready neural network for trading signals.
    
    Incorporates:
    - Residual connections
    - Batch normalization
    - Dropout regularization
    - GELU activation (modern alternative to ReLU)
    """
    
    def __init__(self, input_dim, hidden_dims=[128, 64, 32], dropout=0.3):
        super(TradingNeuralNetwork, self).__init__()
        
        layers = []
        prev_dim = input_dim
        
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.BatchNorm1d(hidden_dim),
                nn.GELU(),
                nn.Dropout(dropout)
            ])
            prev_dim = hidden_dim
        
        self.backbone = nn.Sequential(*layers)
        self.output = nn.Linear(prev_dim, 1)
        
        # Initialize weights
        self._init_weights()
    
    def _init_weights(self):
        """Initialize weights using Xavier/Glorot initialization."""
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
    
    def forward(self, x):
        x = self.backbone(x)
        return self.output(x)

In [None]:
class ProductionTrainer:
    """
    Production-ready training pipeline with all best practices.
    
    Features:
    - Time series aware splits
    - Class imbalance handling
    - Learning rate scheduling
    - Early stopping
    - Gradient clipping
    - Comprehensive logging
    - Model checkpointing
    """
    
    def __init__(self, model, device='cpu', seed=42):
        self.model = model.to(device)
        self.device = device
        self.seed = seed
        self.history = {'train_loss': [], 'val_loss': [], 'val_acc': [], 'val_auc': []}
        self.best_model_state = None
        self.best_val_loss = float('inf')
    
    def train(self, train_loader, val_loader, num_epochs=100, lr=1e-3,
              weight_decay=1e-5, patience=15, class_weights=None,
              use_focal_loss=False, focal_gamma=2.0):
        """
        Full training loop with all best practices.
        """
        # Set seed for reproducibility
        set_all_seeds(self.seed)
        
        # Setup loss function
        if use_focal_loss:
            alpha = class_weights[1] / (class_weights[0] + class_weights[1]) if class_weights else 0.5
            criterion = FocalLoss(alpha=alpha, gamma=focal_gamma)
        elif class_weights:
            # Use weighted BCE loss
            pos_weight = torch.FloatTensor([class_weights[1] / class_weights[0]]).to(self.device)
            criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
        else:
            criterion = nn.BCEWithLogitsLoss()
        
        # Optimizer with weight decay (L2 regularization)
        optimizer = optim.AdamW(self.model.parameters(), lr=lr, weight_decay=weight_decay)
        
        # Learning rate scheduler
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode='min', factor=0.5, patience=5, verbose=False
        )
        
        # Early stopping
        patience_counter = 0
        
        print("Training Configuration:")
        print("=" * 60)
        print(f"  Device: {self.device}")
        print(f"  Epochs: {num_epochs}")
        print(f"  Learning Rate: {lr}")
        print(f"  Weight Decay: {weight_decay}")
        print(f"  Patience: {patience}")
        print(f"  Loss: {'Focal Loss' if use_focal_loss else 'BCE Loss'}")
        print("=" * 60)
        
        for epoch in range(num_epochs):
            # ===== TRAINING PHASE =====
            self.model.train()
            train_loss = 0
            
            for batch_X, batch_y in train_loader:
                batch_X = batch_X.to(self.device)
                batch_y = batch_y.to(self.device)
                
                optimizer.zero_grad()
                output = self.model(batch_X)
                loss = criterion(output, batch_y)
                loss.backward()
                
                # Gradient clipping
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=1.0)
                
                optimizer.step()
                train_loss += loss.item()
            
            train_loss /= len(train_loader)
            
            # ===== VALIDATION PHASE =====
            val_metrics = self._evaluate(val_loader, criterion)
            
            # Update history
            self.history['train_loss'].append(train_loss)
            self.history['val_loss'].append(val_metrics['loss'])
            self.history['val_acc'].append(val_metrics['accuracy'])
            self.history['val_auc'].append(val_metrics['auc'])
            
            # Learning rate scheduling
            scheduler.step(val_metrics['loss'])
            
            # Print progress
            if (epoch + 1) % 10 == 0 or epoch == 0:
                current_lr = optimizer.param_groups[0]['lr']
                print(f"Epoch {epoch+1:3d}: Train Loss = {train_loss:.4f}, "
                      f"Val Loss = {val_metrics['loss']:.4f}, "
                      f"Val Acc = {val_metrics['accuracy']:.4f}, "
                      f"Val AUC = {val_metrics['auc']:.4f}, "
                      f"LR = {current_lr:.2e}")
            
            # Early stopping and checkpointing
            if val_metrics['loss'] < self.best_val_loss:
                self.best_val_loss = val_metrics['loss']
                self.best_model_state = self.model.state_dict().copy()
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= patience:
                    print(f"\nâš¡ Early stopping at epoch {epoch+1}")
                    break
        
        # Restore best model
        if self.best_model_state is not None:
            self.model.load_state_dict(self.best_model_state)
            print("âœ“ Restored best model weights")
        
        return self.history
    
    def _evaluate(self, loader, criterion):
        """Evaluate model on validation/test data."""
        self.model.eval()
        total_loss = 0
        all_preds = []
        all_targets = []
        all_probs = []
        
        with torch.no_grad():
            for batch_X, batch_y in loader:
                batch_X = batch_X.to(self.device)
                batch_y = batch_y.to(self.device)
                
                output = self.model(batch_X)
                loss = criterion(output, batch_y)
                total_loss += loss.item()
                
                probs = torch.sigmoid(output)
                preds = (probs > 0.5).float()
                
                all_preds.extend(preds.cpu().numpy().flatten())
                all_targets.extend(batch_y.cpu().numpy().flatten())
                all_probs.extend(probs.cpu().numpy().flatten())
        
        # Calculate metrics
        accuracy = accuracy_score(all_targets, all_preds)
        try:
            auc = roc_auc_score(all_targets, all_probs)
        except:
            auc = 0.5
        
        return {
            'loss': total_loss / len(loader),
            'accuracy': accuracy,
            'auc': auc,
            'predictions': all_preds,
            'probabilities': all_probs
        }
    
    def evaluate_trading_performance(self, test_loader, test_returns=None):
        """
        Evaluate model as a trading strategy.
        """
        self.model.eval()
        criterion = nn.BCEWithLogitsLoss()
        
        metrics = self._evaluate(test_loader, criterion)
        
        print("\nTrading Strategy Performance:")
        print("=" * 50)
        print(f"Accuracy: {metrics['accuracy']:.4f}")
        print(f"AUC-ROC: {metrics['auc']:.4f}")
        
        # Classification report
        targets = [int(p) for p in metrics['predictions']]
        print("\nClassification Report:")
        print(classification_report(
            test_loader.dataset.y[:len(targets)].astype(int),
            targets,
            target_names=['Down', 'Up']
        ))
        
        # Trading metrics (simplified)
        if test_returns is not None:
            positions = np.array([1 if p > 0.5 else -1 for p in metrics['probabilities']])
            strategy_returns = positions * test_returns[:len(positions)]
            
            cum_return = (1 + strategy_returns).cumprod()[-1] - 1
            sharpe = np.sqrt(252) * strategy_returns.mean() / (strategy_returns.std() + 1e-8)
            
            print(f"\nCumulative Return: {cum_return*100:.2f}%")
            print(f"Sharpe Ratio: {sharpe:.2f}")
        
        return metrics

In [None]:
# Apply the complete training pipeline

# 1. Time series split
splitter = TimeSeriesSplitter(train_ratio=0.7, val_ratio=0.15, gap_days=5)
splits = splitter.split(features_df.values, targets.values, dates=features_df.index)

# 2. Create datasets with proper preprocessing
train_dataset = TradingDataset(
    splits['train']['X'], 
    splits['train']['y'],
    fit_preprocessor=True
)

val_dataset = TradingDataset(
    splits['val']['X'],
    splits['val']['y'],
    preprocessor=train_dataset.preprocessor
)

test_dataset = TradingDataset(
    splits['test']['X'],
    splits['test']['y'],
    preprocessor=train_dataset.preprocessor
)

# 3. Compute class weights for imbalance handling
class_weights = ImbalanceHandler.compute_class_weights(
    splits['train']['y'], 
    strategy='balanced'
)
print(f"Class weights: {class_weights}")

# 4. Create data loaders
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"\nDataset sizes: Train={len(train_dataset)}, Val={len(val_dataset)}, Test={len(test_dataset)}")

In [None]:
# 5. Create model
input_dim = train_dataset.X.shape[1]
model = TradingNeuralNetwork(
    input_dim=input_dim,
    hidden_dims=[128, 64, 32],
    dropout=0.3
)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# 6. Train
trainer = ProductionTrainer(model, device=device, seed=SEED)

history = trainer.train(
    train_loader, val_loader,
    num_epochs=100,
    lr=1e-3,
    weight_decay=1e-5,
    patience=15,
    class_weights=class_weights,
    use_focal_loss=False
)

In [None]:
# 7. Plot training history
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Loss
axes[0].plot(history['train_loss'], label='Train', linewidth=2)
axes[0].plot(history['val_loss'], label='Validation', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training and Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(history['val_acc'], label='Validation', linewidth=2, color='green')
axes[1].axhline(y=0.5, color='r', linestyle='--', label='Random', alpha=0.5)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# AUC
axes[2].plot(history['val_auc'], label='Validation', linewidth=2, color='purple')
axes[2].axhline(y=0.5, color='r', linestyle='--', label='Random', alpha=0.5)
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('AUC-ROC')
axes[2].set_title('Validation AUC-ROC')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# 8. Final evaluation on test set
test_metrics = trainer.evaluate_trading_performance(test_loader)

---

## 11. Key Takeaways

### Training Best Practices Summary

1. **Reproducibility is Non-Negotiable**
   - Set all seeds: NumPy, Python, PyTorch
   - Enable deterministic mode for auditable results
   - Document all hyperparameters

2. **Proper Data Preprocessing**
   - Use RobustScaler for financial data (outlier-resistant)
   - NEVER fit scaler on validation/test data
   - Handle missing values explicitly

3. **Time Series Splits Are Critical**
   - Always maintain chronological order
   - Include gap periods to prevent leakage
   - Use walk-forward validation for realistic assessment

4. **Handle Class Imbalance**
   - Use class weights or focal loss
   - Consider weighted sampling
   - Monitor precision/recall, not just accuracy

5. **Training Stability**
   - Gradient clipping prevents explosions
   - Learning rate scheduling improves convergence
   - Early stopping prevents overfitting

### European Regulatory Considerations ðŸ‡ªðŸ‡º

- Document all training decisions (MiFID II)
- Ensure reproducibility for audits
- Maintain model versioning and governance
- Test on data from multiple market regimes

---

## 12. Exercises

### Exercise 1: Walk-Forward Validation
Implement full walk-forward training:
- Train separate models for each fold
- Combine predictions across folds
- Report aggregate performance

### Exercise 2: Focal Loss Comparison
Compare BCE loss vs Focal Loss:
- Train identical models with each loss
- Analyze precision/recall tradeoffs
- Determine which works better for your data

### Exercise 3: Learning Rate Finder
Implement a learning rate finder:
- Train with exponentially increasing LR
- Plot loss vs learning rate
- Find optimal starting LR

### Exercise 4: European Market Strategy
Build a model for European markets:
- Use European stocks only
- Account for different trading hours
- Handle currency considerations

In [None]:
# Exercise space
print("Ready for exercises!")

# Example: Learning Rate Finder starter
def lr_finder(model, train_loader, init_lr=1e-7, final_lr=1, num_iter=100):
    """
    Find optimal learning rate using the LR Range Test.
    
    Implementation hint:
    - Start with very low LR
    - Exponentially increase LR each iteration
    - Track loss at each step
    - Plot loss vs LR (use log scale)
    - Optimal LR is where loss decreases fastest
    """
    # Your code here...
    pass

---

## References

1. **Smith (2018)** - "A disciplined approach to neural network hyper-parameters" - Learning rate, batch size
2. **He et al. (2019)** - "Bag of Tricks for Image Classification" - Training best practices
3. **Lin et al. (2017)** - "Focal Loss for Dense Object Detection" - Focal loss
4. **LÃ³pez de Prado (2018)** - "Advances in Financial Machine Learning" - Walk-forward validation
5. **Goyal et al. (2017)** - "Accurate, Large Minibatch SGD" - Batch size effects

---

**Next:** Day 7 - Interview Review & Project