# ML Model Factory - Unified Pipeline

**Complete ML training pipeline for OHLCV time series with 13 models.**

## ✨ Features (Production Ready)

### 🤖 Model Support (13 Models)
- **Boosting (3):** XGBoost, LightGBM, CatBoost
- **Neural (4):** LSTM, GRU, TCN, Transformer
- **Classical (3):** Random Forest, Logistic Regression, SVM
- **Ensemble (3):** Voting, Stacking, Blending

### 🔬 Advanced Capabilities
- **Transformer Support:** Self-attention with 8-head architecture + attention visualization
- **Hyperparameter Tuning:** Optuna integration with 20+ trials per model
- **Cross-Validation:** Purged K-fold for time series (prevents lookahead bias)
- **Ensemble Intelligence:** Diversity analysis, contribution metrics, production recommendations
- **Professional Export:** ONNX conversion, model cards, ZIP packages

### 📊 Rich Visualizations
- Confusion matrices with class-wise accuracy
- Feature importance (boosting models)
- Learning curves (training/validation loss)
- Prediction distribution analysis
- Per-class precision/recall/F1 metrics
- **Transformer attention heatmaps** (NEW)

### 📈 Evaluation & Analysis
- Test set evaluation with generalization gap analysis
- Out-of-fold predictions for stacking
- Cross-validation with time series purge/embargo
- Ensemble diversity metrics (disagreement, correlation, Q-statistic)
- Production readiness scoring

## Pipeline Phases
1. **Configuration** - All settings in one place (13 model toggles)
2. **Environment Setup** - Auto-detects Colab vs Local
3. **Phase 1: Data Pipeline** - Clean → Features → Labels → Splits → Scale
4. **Phase 2: Model Training** - Train any of 13 model types
   - 4.1 Train Models
   - 4.2 Training Summary
   - 4.3 Visualizations (5 types)
   - 4.4 Transformer Attention (NEW)
   - 4.5 Test Set Performance
5. **Phase 3: Cross-Validation** - Robust evaluation with tuning (optional)
6. **Phase 4: Ensemble** - Combine models intelligently (optional)
7. **Results & Export** - Professional packages with ONNX

---

# 1. MASTER CONFIGURATION

**Configure ALL settings here. No need to modify any other cells.**

In [None]:
#@title 1.1 Master Configuration Panel { display-mode: "form" }
#@markdown ## Data Configuration
#@markdown ---

#@markdown ### Contract Selection
SYMBOL = "SI"  #@param ["SI", "MES", "MGC", "ES", "GC", "NQ", "CL", "HG", "ZB", "ZN"]
#@markdown Select ONE contract. Each contract is trained in complete isolation.
#@markdown - **SI** = Silver, **MES** = Micro E-mini S&P, **MGC** = Micro Gold
#@markdown - **ES** = E-mini S&P, **GC** = Gold, **NQ** = E-mini Nasdaq
#@markdown - **CL** = Crude Oil, **HG** = Copper, **ZB/ZN** = Bonds

#@markdown ### Date Range Selection
DATE_RANGE = "2019-2024"  #@param ["2019-2024", "2020-2024", "2021-2024", "2022-2024", "2023-2024", "Full Dataset"]
#@markdown Select the date range for your data

#@markdown ### Data Source
DRIVE_DATA_PATH = "research/data/raw"  #@param {type: "string"}
#@markdown Google Drive path relative to My Drive

#@markdown ### Custom Data File (optional)
CUSTOM_DATA_FILE = ""  #@param {type: "string"}
#@markdown Leave empty for auto-detection, or specify exact filename (e.g., `si_historical_2019_2024.parquet`)

#@markdown ---
#@markdown ## Pipeline Configuration

#@markdown ### Label Horizons (bars)
HORIZONS = "5,10,15,20"  #@param {type: "string"}
#@markdown Comma-separated prediction horizons

#@markdown ### Train/Val/Test Split Ratios
TRAIN_RATIO = 0.70  #@param {type: "number"}
VAL_RATIO = 0.15  #@param {type: "number"}
TEST_RATIO = 0.15  #@param {type: "number"}

#@markdown ### Leakage Prevention
PURGE_BARS = 60  #@param {type: "integer"}
#@markdown Bars to purge around train/val boundary (3x max horizon)
EMBARGO_BARS = 1440  #@param {type: "integer"}
#@markdown Embargo period after validation (~5 days at 5-min)

#@markdown ---
#@markdown ## Model Training Configuration

#@markdown ### Training Horizon
TRAINING_HORIZON = 20  #@param [5, 10, 15, 20]
#@markdown Which horizon to train models on

#@markdown ### Model Selection
#@markdown #### Boosting Models
TRAIN_XGBOOST = True  #@param {type: "boolean"}
TRAIN_LIGHTGBM = True  #@param {type: "boolean"}
TRAIN_CATBOOST = True  #@param {type: "boolean"}

#@markdown #### Classical Models
TRAIN_RANDOM_FOREST = False  #@param {type: "boolean"}
TRAIN_LOGISTIC = False  #@param {type: "boolean"}
TRAIN_SVM = False  #@param {type: "boolean"}

#@markdown #### Neural Network Models
TRAIN_LSTM = False  #@param {type: "boolean"}
TRAIN_GRU = False  #@param {type: "boolean"}
TRAIN_TCN = False  #@param {type: "boolean"}
TRAIN_TRANSFORMER = False  #@param {type: "boolean"}

#@markdown #### Ensemble Models
TRAIN_VOTING = False  #@param {type: "boolean"}
TRAIN_STACKING = False  #@param {type: "boolean"}
TRAIN_BLENDING = False  #@param {type: "boolean"}

#@markdown ### Neural Network Settings
SEQUENCE_LENGTH = 60  #@param {type: "slider", min: 30, max: 120, step: 10}
BATCH_SIZE = 256  #@param [64, 128, 256, 512, 1024]
MAX_EPOCHS = 50  #@param {type: "integer"}
EARLY_STOPPING_PATIENCE = 10  #@param {type: "integer"}

#@markdown ### Transformer Settings (when enabled)
TRANSFORMER_SEQUENCE_LENGTH = 128  #@param {type: "integer"}
TRANSFORMER_N_HEADS = 8  #@param [4, 8, 16]
TRANSFORMER_N_LAYERS = 3  #@param [2, 3, 4, 6]
TRANSFORMER_D_MODEL = 256  #@param [128, 256, 512]

#@markdown ### Boosting Settings
N_ESTIMATORS = 500  #@param {type: "integer"}
BOOSTING_EARLY_STOPPING = 50  #@param {type: "integer"}

#@markdown ### Voting Ensemble Configuration (when enabled)
VOTING_BASE_MODELS = "xgboost,lightgbm,catboost"  #@param {type: "string"}
VOTING_WEIGHTS = ""  #@param {type: "string"}
#@markdown Leave weights empty for equal weighting

#@markdown ### Stacking Ensemble Configuration (when enabled)
STACKING_BASE_MODELS = "xgboost,lightgbm,lstm"  #@param {type: "string"}
STACKING_META_LEARNER = "logistic"  #@param ["logistic", "xgboost", "random_forest"]
STACKING_N_FOLDS = 5  #@param {type: "integer"}

#@markdown ### Blending Ensemble Configuration (when enabled)
BLENDING_BASE_MODELS = "xgboost,lightgbm,random_forest"  #@param {type: "string"}
BLENDING_META_LEARNER = "logistic"  #@param ["logistic", "xgboost", "random_forest"]
BLENDING_HOLDOUT_RATIO = 0.2  #@param {type: "number"}

#@markdown ---
#@markdown ## Optional Phases

#@markdown ### Cross-Validation
RUN_CROSS_VALIDATION = False  #@param {type: "boolean"}
CV_N_SPLITS = 5  #@param {type: "integer"}
CV_TUNE_HYPERPARAMS = False  #@param {type: "boolean"}
CV_N_TRIALS = 20  #@param {type: "integer"}

#@markdown ### Ensemble Training
TRAIN_ENSEMBLE = False  #@param {type: "boolean"}
ENSEMBLE_TYPE = "voting"  #@param ["voting", "stacking", "blending"]
ENSEMBLE_META_LEARNER = "logistic"  #@param ["logistic", "random_forest", "xgboost"]

#@markdown ---
#@markdown ## Execution Options

#@markdown ### What to Run
RUN_DATA_PIPELINE = True  #@param {type: "boolean"}
#@markdown Run Phase 1 data pipeline
RUN_MODEL_TRAINING = True  #@param {type: "boolean"}
#@markdown Run Phase 2 model training

#@markdown ### Memory Management
SAFE_MODE = False  #@param {type: "boolean"}
#@markdown Enable for low-memory environments (reduces batch size, limits iterations)

# ============================================================
# BUILD CONFIGURATION (DO NOT MODIFY BELOW)
# ============================================================

import os
from datetime import datetime

# Parse horizons
HORIZON_LIST = [int(h.strip()) for h in HORIZONS.split(',')]

# Parse date range
if DATE_RANGE == "Full Dataset":
    YEAR_START = None
    YEAR_END = None
else:
    years = DATE_RANGE.split('-')
    YEAR_START = int(years[0])
    YEAR_END = int(years[1])

# Build model list
MODELS_TO_TRAIN = []
if TRAIN_XGBOOST: MODELS_TO_TRAIN.append('xgboost')
if TRAIN_LIGHTGBM: MODELS_TO_TRAIN.append('lightgbm')
if TRAIN_CATBOOST: MODELS_TO_TRAIN.append('catboost')
if TRAIN_RANDOM_FOREST: MODELS_TO_TRAIN.append('random_forest')
if TRAIN_LOGISTIC: MODELS_TO_TRAIN.append('logistic')
if TRAIN_SVM: MODELS_TO_TRAIN.append('svm')
if TRAIN_LSTM: MODELS_TO_TRAIN.append('lstm')
if TRAIN_GRU: MODELS_TO_TRAIN.append('gru')
if TRAIN_TCN: MODELS_TO_TRAIN.append('tcn')
if TRAIN_TRANSFORMER: MODELS_TO_TRAIN.append('transformer')
if TRAIN_VOTING: MODELS_TO_TRAIN.append('voting')
if TRAIN_STACKING: MODELS_TO_TRAIN.append('stacking')
if TRAIN_BLENDING: MODELS_TO_TRAIN.append('blending')

# Date range will be auto-detected from data file
DATA_START = None  # Auto-detected
DATA_END = None    # Auto-detected

# Safe mode adjustments
if SAFE_MODE:
    BATCH_SIZE = min(BATCH_SIZE, 64)
    N_ESTIMATORS = min(N_ESTIMATORS, 300)
    SEQUENCE_LENGTH = min(SEQUENCE_LENGTH, 30)
    TRANSFORMER_SEQUENCE_LENGTH = min(TRANSFORMER_SEQUENCE_LENGTH, 64)

# Print configuration summary
print("=" * 70)
print(" ML PIPELINE CONFIGURATION")
print("=" * 70)
print(f"\n  Contract:        {SYMBOL}")
print(f"  Date Range:      {DATE_RANGE}")
if CUSTOM_DATA_FILE:
    print(f"  Custom File:     {CUSTOM_DATA_FILE}")
print(f"  Horizons:        {HORIZON_LIST}")
print(f"  Split Ratios:    {TRAIN_RATIO}/{VAL_RATIO}/{TEST_RATIO}")
print(f"  Training Horizon: H{TRAINING_HORIZON}")
print(f"  Models:          {MODELS_TO_TRAIN if MODELS_TO_TRAIN else 'None selected'}")
if MODELS_TO_TRAIN:
    boosting_models = [m for m in MODELS_TO_TRAIN if m in ['xgboost', 'lightgbm', 'catboost']]
    classical_models = [m for m in MODELS_TO_TRAIN if m in ['random_forest', 'logistic', 'svm']]
    neural_models = [m for m in MODELS_TO_TRAIN if m in ['lstm', 'gru', 'tcn', 'transformer']]
    ensemble_models = [m for m in MODELS_TO_TRAIN if m in ['voting', 'stacking', 'blending']]
    if boosting_models:
        print(f"    Boosting:      {boosting_models}")
    if classical_models:
        print(f"    Classical:     {classical_models}")
    if neural_models:
        print(f"    Neural:        {neural_models}")
    if ensemble_models:
        print(f"    Ensemble:      {ensemble_models}")
print(f"\n  Run Pipeline:    {RUN_DATA_PIPELINE}")
print(f"  Run Training:    {RUN_MODEL_TRAINING}")
print(f"  Cross-Validation: {RUN_CROSS_VALIDATION}")
print(f"  Ensemble:        {TRAIN_ENSEMBLE}")
print(f"  Safe Mode:       {SAFE_MODE}")
print("=" * 70)
print("\nConfiguration complete! Run the next cells sequentially.")

---
# 2. ENVIRONMENT SETUP

Auto-detects Colab vs Local environment and sets up paths.

In [None]:
#@title 2.1 Environment Detection & Setup { display-mode: "form" }

import os
import sys
import gc
from pathlib import Path

# ============================================================
# ENVIRONMENT DETECTION
# ============================================================
IS_COLAB = os.path.exists('/content')

print("=" * 70)
print(" ENVIRONMENT SETUP")
print("=" * 70)

if IS_COLAB:
    print("\n[Environment] Google Colab detected")
    
    # Mount Google Drive
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Clone/update repository
    REPO_PATH = Path('/content/research')
    if not REPO_PATH.exists():
        print("\n[Setup] Cloning repository...")
        !git clone https://github.com/Snehpatel101/research.git /content/research
    else:
        print("\n[Setup] Updating repository...")
        !cd /content/research && git pull --quiet
    
    # Set paths
    PROJECT_ROOT = REPO_PATH
    DRIVE_ROOT = Path('/content/drive/MyDrive')
    RAW_DATA_DIR = DRIVE_ROOT / DRIVE_DATA_PATH
    RESULTS_DIR = DRIVE_ROOT / 'research/experiments'
    
    os.chdir(PROJECT_ROOT)
    
else:
    print("\n[Environment] Local environment detected")
    
    PROJECT_ROOT = Path('.')
    DRIVE_ROOT = None
    RAW_DATA_DIR = PROJECT_ROOT / 'data/raw'
    RESULTS_DIR = PROJECT_ROOT / 'experiments'
    
    os.chdir(PROJECT_ROOT)

# Add to Python path
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# Create output directories
SPLITS_DIR = PROJECT_ROOT / 'data/splits/scaled'
EXPERIMENTS_DIR = RESULTS_DIR / 'runs'
EXPERIMENTS_DIR.mkdir(parents=True, exist_ok=True)

print(f"\n  Project Root:  {PROJECT_ROOT}")
print(f"  Raw Data:      {RAW_DATA_DIR}")
print(f"  Splits:        {SPLITS_DIR}")
print(f"  Experiments:   {EXPERIMENTS_DIR}")

In [None]:
#@title 2.2 Install Dependencies { display-mode: "form" }

if IS_COLAB:
    print("[Dependencies] Installing packages...")
    !pip install -q torch xgboost lightgbm catboost optuna ta pywavelets scikit-learn pandas numpy matplotlib tqdm pyarrow numba psutil
    print("[Dependencies] Installation complete!")
else:
    print("[Dependencies] Local environment - assuming packages installed.")
    print("  If needed: pip install torch xgboost lightgbm catboost optuna ta pywavelets psutil")

# Verify imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

print(f"\n  pandas: {pd.__version__}")
print(f"  numpy: {np.__version__}")

In [None]:
#@title 2.3 GPU Detection { display-mode: "form" }

import torch

GPU_AVAILABLE = torch.cuda.is_available()
GPU_NAME = None
GPU_MEMORY = 0

print("=" * 70)
print(" HARDWARE DETECTION")
print("=" * 70)

if GPU_AVAILABLE:
    props = torch.cuda.get_device_properties(0)
    GPU_NAME = props.name
    GPU_MEMORY = props.total_memory / (1024**3)
    
    print(f"\n  GPU: {GPU_NAME}")
    print(f"  Memory: {GPU_MEMORY:.1f} GB")
    print(f"  Compute: {props.major}.{props.minor}")
    
    # Adjust batch size based on GPU memory
    if GPU_MEMORY >= 40:
        RECOMMENDED_BATCH = 1024
    elif GPU_MEMORY >= 15:
        RECOMMENDED_BATCH = 512
    else:
        RECOMMENDED_BATCH = 256
    
    print(f"  Recommended batch: {RECOMMENDED_BATCH}")
else:
    print("\n  GPU: Not available (using CPU)")
    print("  Tip: Runtime -> Change runtime type -> GPU")
    RECOMMENDED_BATCH = 128

# Check for neural models without GPU
NEURAL_MODELS = {'lstm', 'gru', 'tcn'}
selected_neural = set(MODELS_TO_TRAIN) & NEURAL_MODELS
if selected_neural and not GPU_AVAILABLE:
    print(f"\n  [WARNING] Neural models selected but no GPU: {selected_neural}")
    print("  Training will be slow on CPU.")

In [None]:
#@title 2.4 Memory Utilities { display-mode: "form" }

import psutil
import gc
import torch

# Ensure GPU_AVAILABLE is defined (in case cells run out of order)
if 'GPU_AVAILABLE' not in dir():
    GPU_AVAILABLE = torch.cuda.is_available()

def print_memory_status(label: str = "Current"):
    """Print current RAM and GPU memory usage."""
    print(f"\n--- Memory: {label} ---")
    
    # RAM
    ram = psutil.virtual_memory()
    print(f"RAM: {ram.used/1e9:.1f}GB / {ram.total/1e9:.1f}GB ({ram.percent}%)")
    
    # GPU
    if GPU_AVAILABLE:
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"GPU: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")

def clear_memory():
    """Clear RAM and GPU memory."""
    gc.collect()
    if GPU_AVAILABLE:
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    print("Memory cleared.")

print("Memory utilities loaded.")
print_memory_status("Initial")

---
# 3. PHASE 1: DATA PIPELINE

Processes raw OHLCV data into training-ready datasets.

**Pipeline stages:**
1. Load raw 1-minute data
2. Clean and resample to 5-minute bars
3. Generate 150+ technical features
4. Apply triple-barrier labeling
5. Create train/val/test splits with purge/embargo
6. Scale features (train-only fit)

In [None]:
#@title 3.1 Verify Raw Data & Detect Date Range { display-mode: "form" }

import os
import gc
import re
import pandas as pd
from pathlib import Path

# Ensure environment variables are defined (in case cells run out of order)
if 'IS_COLAB' not in dir():
    IS_COLAB = os.path.exists('/content')
if 'RAW_DATA_DIR' not in dir():
    if IS_COLAB:
        RAW_DATA_DIR = Path('/content/drive/MyDrive') / DRIVE_DATA_PATH
    else:
        RAW_DATA_DIR = Path('./data/raw')

print("=" * 70)
print(" RAW DATA VERIFICATION")
print("=" * 70)
print(f"\nLooking for {SYMBOL} data in: {RAW_DATA_DIR}")

# ============================================================
# FLEXIBLE FILE DETECTION
# ============================================================
RAW_DATA_FILE = None

# If custom file specified, use it directly
if CUSTOM_DATA_FILE:
    custom_path = RAW_DATA_DIR / CUSTOM_DATA_FILE
    if custom_path.exists():
        RAW_DATA_FILE = custom_path
        print(f"\n  Using custom file: {CUSTOM_DATA_FILE}")
    else:
        print(f"\n  [WARNING] Custom file not found: {CUSTOM_DATA_FILE}")

# Auto-detect file with flexible patterns
if RAW_DATA_FILE is None and RAW_DATA_DIR.exists():
    symbol_lower = SYMBOL.lower()
    symbol_upper = SYMBOL.upper()
    
    # Build list of all matching files
    matching_files = []
    
    for f in RAW_DATA_DIR.iterdir():
        if f.suffix not in ['.parquet', '.csv']:
            continue
        
        fname_lower = f.name.lower()
        
        # Check if file contains the symbol (case-insensitive)
        if symbol_lower in fname_lower:
            # Priority scoring: prefer files with date range matching config
            priority = 0
            
            # Boost priority if filename contains the configured date range
            if YEAR_START and YEAR_END:
                date_pattern = f"{YEAR_START}_{YEAR_END}|{YEAR_START}-{YEAR_END}"
                if re.search(date_pattern, fname_lower):
                    priority += 10
            
            # Boost priority for common naming patterns
            if '_1m' in fname_lower or '_1min' in fname_lower:
                priority += 5
            if 'historical' in fname_lower:
                priority += 3
            if f.suffix == '.parquet':
                priority += 2  # Prefer parquet over CSV
            
            matching_files.append((priority, f))
    
    # Sort by priority (highest first) and pick best match
    if matching_files:
        matching_files.sort(key=lambda x: x[0], reverse=True)
        RAW_DATA_FILE = matching_files[0][1]
        
        if len(matching_files) > 1:
            print(f"\n  Found {len(matching_files)} matching files:")
            for pri, f in matching_files[:5]:
                marker = "→" if f == RAW_DATA_FILE else " "
                print(f"    {marker} {f.name} (priority: {pri})")

# ============================================================
# VALIDATE AND LOAD DATA
# ============================================================
if RAW_DATA_FILE:
    size_mb = RAW_DATA_FILE.stat().st_size / 1e6
    print(f"\n  Selected: {RAW_DATA_FILE.name} ({size_mb:.1f} MB)")
    
    # Load and validate
    print("  Loading data...")
    if RAW_DATA_FILE.suffix == '.parquet':
        df_raw = pd.read_parquet(RAW_DATA_FILE)
    else:
        df_raw = pd.read_csv(RAW_DATA_FILE)
    
    print(f"  Rows: {len(df_raw):,}")
    print(f"  Columns: {list(df_raw.columns)}")
    
    # Validate OHLCV columns (case-insensitive)
    required = {'open', 'high', 'low', 'close', 'volume'}
    found = {c.lower() for c in df_raw.columns}
    if required.issubset(found):
        print("  OHLCV columns: ✓ OK")
    else:
        missing = required - found
        print(f"  [ERROR] Missing columns: {missing}")
    
    # ============================================================
    # AUTO-DETECT DATE RANGE FROM DATA
    # ============================================================
    date_col = None
    for c in df_raw.columns:
        if 'date' in c.lower() or 'time' in c.lower():
            date_col = c
            break
    
    if date_col:
        df_raw[date_col] = pd.to_datetime(df_raw[date_col])
        
        # Store globally for pipeline use
        DATA_START = df_raw[date_col].min()
        DATA_END = df_raw[date_col].max()
        DATA_START_YEAR = DATA_START.year
        DATA_END_YEAR = DATA_END.year
        
        print(f"\n  [DATE RANGE DETECTED]")
        print(f"  Start: {DATA_START.strftime('%Y-%m-%d %H:%M')} ({DATA_START_YEAR})")
        print(f"  End:   {DATA_END.strftime('%Y-%m-%d %H:%M')} ({DATA_END_YEAR})")
        print(f"  Span:  {(DATA_END - DATA_START).days:,} days ({DATA_END_YEAR - DATA_START_YEAR + 1} years)")
        
        # Validate against configured date range
        if YEAR_START and YEAR_END:
            if DATA_START_YEAR <= YEAR_START and DATA_END_YEAR >= YEAR_END:
                print(f"  Config Match: ✓ Data covers {DATE_RANGE}")
            else:
                print(f"  [WARNING] Data range ({DATA_START_YEAR}-{DATA_END_YEAR}) differs from config ({DATE_RANGE})")
    else:
        print("  [WARNING] No datetime column found - using index")
        DATA_START = None
        DATA_END = None
        DATA_START_YEAR = YEAR_START or 2019
        DATA_END_YEAR = YEAR_END or 2024
    
    del df_raw
    gc.collect()
    
    print("\n  ✓ Data verified and ready for processing!")
else:
    print(f"\n  [ERROR] No data file found for {SYMBOL}!")
    print(f"  Expected location: {RAW_DATA_DIR}")
    print(f"\n  Tried patterns matching '{SYMBOL}' (case-insensitive)")
    print(f"\n  Available files in directory:")
    if RAW_DATA_DIR.exists():
        for f in sorted(RAW_DATA_DIR.iterdir()):
            if f.suffix in ['.csv', '.parquet']:
                print(f"    - {f.name}")
    else:
        print(f"    Directory does not exist!")
    
    print(f"\n  [FIX] Set CUSTOM_DATA_FILE in Section 1 to your exact filename")
    
    RAW_DATA_FILE = None
    DATA_START = None
    DATA_END = None

In [None]:
#@title 3.2 Run Data Pipeline { display-mode: "form" }

import os
import gc
import time
import shutil
from pathlib import Path
from datetime import datetime

# Ensure environment variables are defined
if 'IS_COLAB' not in dir():
    IS_COLAB = os.path.exists('/content')
if 'PROJECT_ROOT' not in dir():
    PROJECT_ROOT = Path('/content/research') if IS_COLAB else Path.cwd()
if 'SPLITS_DIR' not in dir():
    SPLITS_DIR = PROJECT_ROOT / 'data/splits/scaled'
if 'RUN_DATA_PIPELINE' not in dir():
    RUN_DATA_PIPELINE = True
if 'RAW_DATA_FILE' not in dir():
    RAW_DATA_FILE = None

if not RUN_DATA_PIPELINE:
    print("[Skipped] Data pipeline disabled in configuration.")
    print("Set RUN_DATA_PIPELINE = True in Section 1 to enable.")
elif RAW_DATA_FILE is None:
    print("[Error] No raw data file found. Cannot run pipeline.")
    print("Run Section 3.1 first to detect data files.")
else:
    import pandas as pd
    
    print("=" * 70)
    print(" PHASE 1: DATA PIPELINE")
    print("=" * 70)
    print(f"\n  Symbol: {SYMBOL}")
    
    # ============================================================
    # COPY DATA FROM DRIVE TO PROJECT (Colab only)
    # ============================================================
    if IS_COLAB and RAW_DATA_FILE is not None:
        # The pipeline expects data in PROJECT_ROOT/data/raw/
        project_raw_dir = PROJECT_ROOT / 'data/raw'
        project_raw_dir.mkdir(parents=True, exist_ok=True)
        
        # Create standardized filename for pipeline: SYMBOL_1m.parquet
        target_filename = f"{SYMBOL}_1m{RAW_DATA_FILE.suffix}"
        target_path = project_raw_dir / target_filename
        
        # Only copy if source is not already in project directory
        source_in_project = str(RAW_DATA_FILE).startswith(str(PROJECT_ROOT))
        if not source_in_project:
            if not target_path.exists() or target_path.stat().st_size != RAW_DATA_FILE.stat().st_size:
                print(f"\n  [Setup] Copying data from Drive to project...")
                print(f"    From: {RAW_DATA_FILE}")
                print(f"    To:   {target_path}")
                shutil.copy2(RAW_DATA_FILE, target_path)
                print(f"    Done! ({target_path.stat().st_size / 1e6:.1f} MB)")
            else:
                print(f"\n  [Setup] Data already in project: {target_path.name}")
        else:
            print(f"\n  [Setup] Data already in project directory")
    
    # Use auto-detected date range
    if 'DATA_START' in dir() and DATA_START is not None:
        print(f"  Date Range: {DATA_START.strftime('%Y-%m-%d')} to {DATA_END.strftime('%Y-%m-%d')} (auto-detected)")
        start_date_str = DATA_START.strftime('%Y-%m-%d')
        end_date_str = DATA_END.strftime('%Y-%m-%d')
    else:
        print(f"  Date Range: Full dataset (no filter)")
        start_date_str = None
        end_date_str = None
    
    print(f"  Horizons: {HORIZON_LIST}")
    print(f"  Purge: {PURGE_BARS} bars, Embargo: {EMBARGO_BARS} bars")
    
    start_time = time.time()
    
    try:
        from src.phase1.pipeline_config import PipelineConfig
        from src.pipeline.runner import PipelineRunner
        
        # Configure pipeline with auto-detected dates
        # NOTE: auto_scale_purge_embargo=False uses our explicit PURGE_BARS/EMBARGO_BARS
        config = PipelineConfig(
            symbols=[SYMBOL],
            project_root=PROJECT_ROOT,
            label_horizons=HORIZON_LIST,
            train_ratio=TRAIN_RATIO,
            val_ratio=VAL_RATIO,
            test_ratio=TEST_RATIO,
            purge_bars=PURGE_BARS,
            embargo_bars=EMBARGO_BARS,
            start_date=start_date_str,
            end_date=end_date_str,
            allow_batch_symbols=False,  # Single-contract architecture
            auto_scale_purge_embargo=False,  # Use explicit purge/embargo values
        )
        
        # Run pipeline
        runner = PipelineRunner(config)
        success = runner.run()
        
        elapsed = time.time() - start_time
        
        if success:
            print(f"\n  Pipeline completed in {elapsed/60:.1f} minutes")
            
            # Verify output
            if (SPLITS_DIR / 'train_scaled.parquet').exists():
                for split in ['train', 'val', 'test']:
                    df = pd.read_parquet(SPLITS_DIR / f'{split}_scaled.parquet')
                    print(f"  {split}: {len(df):,} samples")
                    del df
                gc.collect()
                print("\n  Data ready for training!")
        else:
            print("\n  [ERROR] Pipeline failed. Check logs above.")
        
        del runner, config
        if 'clear_memory' in dir():
            clear_memory()
        else:
            gc.collect()
        
    except Exception as e:
        print(f"\n  [ERROR] Pipeline failed: {e}")
        import traceback
        traceback.print_exc()

In [None]:
#@title 3.3 Verify Processed Data { display-mode: "form" }

import os
import gc
import pandas as pd
from pathlib import Path

# Ensure environment variables are defined (in case cells run out of order)
if 'IS_COLAB' not in dir():
    IS_COLAB = os.path.exists('/content')
if 'PROJECT_ROOT' not in dir():
    PROJECT_ROOT = Path('/content/research') if IS_COLAB else Path.cwd()
if 'SPLITS_DIR' not in dir():
    SPLITS_DIR = PROJECT_ROOT / 'data/splits/scaled'

print("=" * 70)
print(" PROCESSED DATA VERIFICATION")
print("=" * 70)

# Check for pre-processed data (local) or pipeline output (Colab)
if not IS_COLAB:
    # Local: check pre-processed data
    local_splits = PROJECT_ROOT / 'data/splits/final_correct/scaled'
    if (local_splits / 'train_scaled.parquet').exists():
        SPLITS_DIR = local_splits
        print(f"\nUsing pre-processed data: {SPLITS_DIR}")

if (SPLITS_DIR / 'train_scaled.parquet').exists():
    # Load metadata without keeping DataFrames
    train_df = pd.read_parquet(SPLITS_DIR / 'train_scaled.parquet')
    
    FEATURE_COLS = [c for c in train_df.columns 
                   if not c.startswith(('label_', 'sample_weight', 'quality_', 'datetime', 'symbol'))]
    LABEL_COLS = [c for c in train_df.columns if c.startswith('label_')]
    TRAIN_LEN = len(train_df)
    
    # Label distribution with safety check
    label_dists = {}
    for col in LABEL_COLS:
        label_dists[col] = train_df[col].value_counts().sort_index().to_dict()
    
    del train_df
    
    # Get val/test sizes
    val_df = pd.read_parquet(SPLITS_DIR / 'val_scaled.parquet')
    VAL_LEN = len(val_df)
    del val_df
    
    test_df = pd.read_parquet(SPLITS_DIR / 'test_scaled.parquet')
    TEST_LEN = len(test_df)
    del test_df
    
    gc.collect()
    
    print(f"\nDataset Summary:")
    print(f"  Train: {TRAIN_LEN:,} samples")
    print(f"  Val:   {VAL_LEN:,} samples")
    print(f"  Test:  {TEST_LEN:,} samples")
    print(f"  Total: {TRAIN_LEN + VAL_LEN + TEST_LEN:,} samples")
    print(f"\n  Features: {len(FEATURE_COLS)}")
    print(f"  Labels: {LABEL_COLS}")
    
    print(f"\nLabel Distribution (train):")
    for col, dist in label_dists.items():
        total = sum(dist.values())
        if total == 0:
            print(f"  {col}: No valid samples!")
            continue
        long_pct = dist.get(1, 0) / total * 100
        neutral_pct = dist.get(0, 0) / total * 100
        short_pct = dist.get(-1, 0) / total * 100
        print(f"  {col}: Long={long_pct:.1f}% | Neutral={neutral_pct:.1f}% | Short={short_pct:.1f}%")
    
    # Validate TRAINING_HORIZON is in available labels
    if 'TRAINING_HORIZON' in dir() and 'HORIZON_LIST' in dir():
        if TRAINING_HORIZON not in HORIZON_LIST:
            print(f"\n  [WARNING] TRAINING_HORIZON={TRAINING_HORIZON} not in HORIZON_LIST={HORIZON_LIST}")
            print(f"  Model training may fail. Update TRAINING_HORIZON in Section 1.")
    
    DATA_READY = True
    print("\n  Data verified and ready for training!")
else:
    print("\n[ERROR] Processed data not found!")
    print(f"  Expected: {SPLITS_DIR}/train_scaled.parquet")
    print("  Run Section 3.2 to process raw data.")
    DATA_READY = False

---
# 4. PHASE 2: MODEL TRAINING

Train selected models on the processed data.

In [None]:
#@title 4.1 Train Models { display-mode: "form" }

import os
import gc
import time
import json
from pathlib import Path

# Ensure environment variables are defined
if 'IS_COLAB' not in dir():
    IS_COLAB = os.path.exists('/content')
if 'PROJECT_ROOT' not in dir():
    PROJECT_ROOT = Path('/content/research') if IS_COLAB else Path.cwd()
if 'SPLITS_DIR' not in dir():
    SPLITS_DIR = PROJECT_ROOT / 'data/splits/scaled'
if 'EXPERIMENTS_DIR' not in dir():
    EXPERIMENTS_DIR = PROJECT_ROOT / 'experiments/runs'
if 'RUN_MODEL_TRAINING' not in dir():
    RUN_MODEL_TRAINING = True
if 'DATA_READY' not in dir():
    DATA_READY = (SPLITS_DIR / 'train_scaled.parquet').exists()
if 'MODELS_TO_TRAIN' not in dir():
    MODELS_TO_TRAIN = ['xgboost', 'lightgbm', 'catboost']
if 'HORIZON_LIST' not in dir():
    HORIZON_LIST = [5, 10, 15, 20]
if 'GPU_AVAILABLE' not in dir():
    import torch
    GPU_AVAILABLE = torch.cuda.is_available()

# Define clear_memory if not available
if 'clear_memory' not in dir():
    def clear_memory():
        gc.collect()
        if GPU_AVAILABLE:
            import torch
            torch.cuda.empty_cache()

# Validate training horizon before starting
horizon_valid = True
if 'TRAINING_HORIZON' in dir() and TRAINING_HORIZON not in HORIZON_LIST:
    print(f"[ERROR] TRAINING_HORIZON={TRAINING_HORIZON} not in processed horizons {HORIZON_LIST}")
    print(f"  Update TRAINING_HORIZON in Section 1 to one of: {HORIZON_LIST}")
    horizon_valid = False

if not RUN_MODEL_TRAINING:
    print("[Skipped] Model training disabled in configuration.")
elif not DATA_READY:
    print("[Error] Data not ready. Run Section 3 first.")
elif not MODELS_TO_TRAIN:
    print("[Error] No models selected. Enable models in Section 1.")
elif not horizon_valid:
    print("[Error] Invalid training horizon. See error above.")
else:
    print("=" * 70)
    print(" PHASE 2: MODEL TRAINING")
    print("=" * 70)
    print(f"\n  Models: {MODELS_TO_TRAIN}")
    print(f"  Horizon: H{TRAINING_HORIZON}")
    
    # Initialize results dict before training loop
    TRAINING_RESULTS = {}
    
    try:
        from src.models import ModelRegistry, Trainer, TrainerConfig
        from src.phase1.stages.datasets.container import TimeSeriesDataContainer
        
        # Load data container
        print("\nLoading data...")
        container = TimeSeriesDataContainer.from_parquet_dir(
            path=SPLITS_DIR,
            horizon=TRAINING_HORIZON
        )
        print(f"  Train: {container.splits['train'].n_samples:,}")
        print(f"  Val: {container.splits['val'].n_samples:,}")
        
        # Train each model with per-model error handling
        for i, model_name in enumerate(MODELS_TO_TRAIN, 1):
            print(f"\n{'='*60}")
            print(f" [{i}/{len(MODELS_TO_TRAIN)}] Training: {model_name.upper()}")
            print("=" * 60)
            
            clear_memory()
            start_time = time.time()
            
            try:
                # Configure model
                if model_name in ['lstm', 'gru', 'tcn']:
                    config = TrainerConfig(
                        model_name=model_name,
                        horizon=TRAINING_HORIZON,
                        sequence_length=SEQUENCE_LENGTH,
                        batch_size=BATCH_SIZE,
                        max_epochs=MAX_EPOCHS,
                        early_stopping_patience=EARLY_STOPPING_PATIENCE,
                        output_dir=EXPERIMENTS_DIR,
                        device="cuda" if GPU_AVAILABLE else "cpu",
                    )
                elif model_name == 'catboost':
                    config = TrainerConfig(
                        model_name=model_name,
                        horizon=TRAINING_HORIZON,
                        output_dir=EXPERIMENTS_DIR,
                        model_config={
                            "iterations": N_ESTIMATORS,
                            "early_stopping_rounds": BOOSTING_EARLY_STOPPING,
                            "use_gpu": False,
                            "task_type": "CPU",
                            "verbose": False,
                        },
                    )
                else:
                    config = TrainerConfig(
                        model_name=model_name,
                        horizon=TRAINING_HORIZON,
                        output_dir=EXPERIMENTS_DIR,
                        model_config={
                            "n_estimators": N_ESTIMATORS,
                            "early_stopping_rounds": BOOSTING_EARLY_STOPPING,
                        } if model_name in ['xgboost', 'lightgbm'] else None,
                    )
                
                # Train
                trainer = Trainer(config)
                results = trainer.run(container)
                elapsed = time.time() - start_time
                
                # Store results
                metrics = results.get('evaluation_metrics', {})
                TRAINING_RESULTS[model_name] = {
                    'metrics': metrics,
                    'time': elapsed,
                    'run_id': results.get('run_id', 'unknown'),
                }
                
                print(f"\n  Accuracy: {metrics.get('accuracy', 0):.2%}")
                print(f"  Macro F1: {metrics.get('macro_f1', 0):.4f}")
                print(f"  Time: {elapsed:.1f}s")
                
                del trainer, config
                
            except Exception as model_error:
                # Per-model error handling - continue to next model
                elapsed = time.time() - start_time
                print(f"\n  [ERROR] {model_name} training failed: {model_error}")
                TRAINING_RESULTS[model_name] = {
                    'metrics': {},
                    'time': elapsed,
                    'run_id': 'failed',
                    'error': str(model_error),
                }
                import traceback
                traceback.print_exc()
            
            clear_memory()
        
        # Save results
        results_file = EXPERIMENTS_DIR / 'training_results.json'
        with open(results_file, 'w') as f:
            json.dump(TRAINING_RESULTS, f, indent=2)
        
        # Summary
        successful = [m for m, r in TRAINING_RESULTS.items() if r.get('run_id') != 'failed']
        failed = [m for m, r in TRAINING_RESULTS.items() if r.get('run_id') == 'failed']
        print(f"\n  Completed: {len(successful)}/{len(MODELS_TO_TRAIN)} models")
        if failed:
            print(f"  Failed: {failed}")
        print(f"\nResults saved to: {results_file}")
        
        del container
        clear_memory()
        
    except Exception as e:
        print(f"\n[ERROR] Training setup failed: {e}")
        import traceback
        traceback.print_exc()
        clear_memory()

In [None]:
#@title 4.2 Compare Models { display-mode: "form" }

import pandas as pd
import matplotlib.pyplot as plt

# Ensure TRAINING_RESULTS is defined
if 'TRAINING_RESULTS' not in dir():
    TRAINING_RESULTS = {}

if TRAINING_RESULTS:
    print("=" * 70)
    print(" MODEL COMPARISON")
    print("=" * 70)
    
    # Build comparison table
    rows = []
    for model, data in TRAINING_RESULTS.items():
        metrics = data.get('metrics', {})
        rows.append({
            'Model': model,
            'Accuracy': metrics.get('accuracy', 0),
            'Macro F1': metrics.get('macro_f1', 0),
            'Weighted F1': metrics.get('weighted_f1', 0),
            'Time (s)': data.get('time', 0),
        })
    
    comparison_df = pd.DataFrame(rows)
    comparison_df = comparison_df.sort_values('Macro F1', ascending=False)
    
    print("\n")
    print(comparison_df.to_string(index=False))
    
    # Best model
    best_model = comparison_df.iloc[0]['Model']
    best_f1 = comparison_df.iloc[0]['Macro F1']
    print(f"\n  Best Model: {best_model} (F1: {best_f1:.4f})")
    
    # Visualization
    if len(TRAINING_RESULTS) > 1:
        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
        
        # Accuracy comparison
        sorted_df = comparison_df.sort_values('Accuracy', ascending=True)
        axes[0].barh(sorted_df['Model'], sorted_df['Accuracy'], color='steelblue')
        axes[0].set_xlabel('Accuracy')
        axes[0].set_title('Model Accuracy')
        axes[0].set_xlim(0, 1)
        
        # Training time
        sorted_df = comparison_df.sort_values('Time (s)', ascending=True)
        axes[1].barh(sorted_df['Model'], sorted_df['Time (s)'], color='coral')
        axes[1].set_xlabel('Training Time (seconds)')
        axes[1].set_title('Training Time')
        
        plt.tight_layout()
        plt.show()
else:
    print("No training results available.")
    print("Run Section 4.1 to train models.")

In [None]:
#@title 4.3 Visualize Training Results { display-mode: "form" }

import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_recall_fscore_support

# Visualization toggles
show_confusion_matrix = True  #@param {type: "boolean"}
show_feature_importance = True  #@param {type: "boolean"}
show_learning_curves = True  #@param {type: "boolean"}
show_prediction_dist = True  #@param {type: "boolean"}
show_per_class_metrics = True  #@param {type: "boolean"}

# Ensure environment variables are defined
if 'IS_COLAB' not in dir():
    IS_COLAB = os.path.exists('/content')
if 'PROJECT_ROOT' not in dir():
    PROJECT_ROOT = Path('/content/research') if IS_COLAB else Path.cwd()
if 'EXPERIMENTS_DIR' not in dir():
    EXPERIMENTS_DIR = PROJECT_ROOT / 'experiments/runs'
if 'TRAINING_RESULTS' not in dir():
    TRAINING_RESULTS = {}

# Check if we have training results
if not TRAINING_RESULTS:
    print("No training results available.")
    print("Run Section 4.1 to train models first.")
else:
    print("=" * 70)
    print(" TRAINING VISUALIZATIONS")
    print("=" * 70)
    
    # Filter successful models only
    successful_models = {
        name: data for name, data in TRAINING_RESULTS.items()
        if data.get('run_id') != 'failed' and data.get('metrics')
    }
    
    if not successful_models:
        print("\nNo successful models to visualize.")
        print("All models failed during training.")
    else:
        print(f"\nVisualizing {len(successful_models)} models: {list(successful_models.keys())}")
        
        # ============================================================
        # 1. CONFUSION MATRICES
        # ============================================================
        if show_confusion_matrix:
            print("\n[1/5] Generating confusion matrices...")
            
            n_models = len(successful_models)
            n_cols = min(3, n_models)
            n_rows = (n_models + n_cols - 1) // n_cols
            
            fig, axes = plt.subplots(n_rows, n_cols, figsize=(5*n_cols, 4*n_rows))
            if n_models == 1:
                axes = np.array([axes])
            axes = axes.flatten()
            
            for idx, (model_name, data) in enumerate(successful_models.items()):
                run_id = data.get('run_id', 'unknown')
                predictions_file = EXPERIMENTS_DIR / run_id / 'predictions.json'
                
                if predictions_file.exists():
                    with open(predictions_file, 'r') as f:
                        pred_data = json.load(f)
                    
                    y_true = np.array(pred_data.get('y_true', []))
                    y_pred = np.array(pred_data.get('y_pred', []))
                    
                    if len(y_true) > 0 and len(y_pred) > 0:
                        cm = confusion_matrix(y_true, y_pred, labels=[-1, 0, 1])
                        disp = ConfusionMatrixDisplay(
                            confusion_matrix=cm,
                            display_labels=['Short', 'Neutral', 'Long']
                        )
                        disp.plot(ax=axes[idx], cmap='Blues', values_format='d')
                        axes[idx].set_title(f'{model_name.upper()}', fontweight='bold')
                    else:
                        axes[idx].text(0.5, 0.5, 'No predictions available',
                                     ha='center', va='center')
                        axes[idx].set_title(f'{model_name.upper()}')
                else:
                    axes[idx].text(0.5, 0.5, 'Predictions not found',
                                 ha='center', va='center')
                    axes[idx].set_title(f'{model_name.upper()}')
            
            # Hide extra subplots
            for idx in range(n_models, len(axes)):
                axes[idx].axis('off')
            
            plt.tight_layout()
            plt.show()
        
        # ============================================================
        # 2. FEATURE IMPORTANCE (Top 20)
        # ============================================================
        if show_feature_importance:
            print("\n[2/5] Generating feature importance plots...")
            
            # Models that support feature importance
            boosting_models = ['xgboost', 'lightgbm', 'catboost']
            classical_models = ['random_forest']
            
            fi_models = {
                name: data for name, data in successful_models.items()
                if name in boosting_models + classical_models
            }
            
            if fi_models:
                n_models = len(fi_models)
                n_cols = min(2, n_models)
                n_rows = (n_models + n_cols - 1) // n_cols
                
                fig, axes = plt.subplots(n_rows, n_cols, figsize=(10*n_cols, 6*n_rows))
                if n_models == 1:
                    axes = np.array([axes])
                axes = axes.flatten()
                
                for idx, (model_name, data) in enumerate(fi_models.items()):
                    run_id = data.get('run_id', 'unknown')
                    fi_file = EXPERIMENTS_DIR / run_id / 'feature_importance.json'
                    
                    if fi_file.exists():
                        with open(fi_file, 'r') as f:
                            fi_data = json.load(f)
                        
                        # Convert to DataFrame and sort
                        fi_df = pd.DataFrame(list(fi_data.items()),
                                            columns=['feature', 'importance'])
                        fi_df = fi_df.sort_values('importance', ascending=False).head(20)
                        
                        # Plot
                        axes[idx].barh(range(len(fi_df)), fi_df['importance'], color='steelblue')
                        axes[idx].set_yticks(range(len(fi_df)))
                        axes[idx].set_yticklabels(fi_df['feature'], fontsize=8)
                        axes[idx].invert_yaxis()
                        axes[idx].set_xlabel('Importance', fontweight='bold')
                        axes[idx].set_title(f'{model_name.upper()} - Top 20 Features',
                                          fontweight='bold')
                        axes[idx].grid(axis='x', alpha=0.3)
                    else:
                        axes[idx].text(0.5, 0.5, 'Feature importance not available',
                                     ha='center', va='center')
                        axes[idx].set_title(f'{model_name.upper()}')
                
                # Hide extra subplots
                for idx in range(n_models, len(axes)):
                    axes[idx].axis('off')
                
                plt.tight_layout()
                plt.show()
            else:
                print("  No models with feature importance (boosting/classical models only)")
        
        # ============================================================
        # 3. LEARNING CURVES (Neural Models)
        # ============================================================
        if show_learning_curves:
            print("\n[3/5] Generating learning curves...")
            
            # Neural models that have training history
            neural_models = ['lstm', 'gru', 'tcn', 'transformer']
            
            lc_models = {
                name: data for name, data in successful_models.items()
                if name in neural_models
            }
            
            if lc_models:
                n_models = len(lc_models)
                
                fig, axes = plt.subplots(n_models, 2, figsize=(12, 4*n_models))
                if n_models == 1:
                    axes = axes.reshape(1, -1)
                
                for idx, (model_name, data) in enumerate(lc_models.items()):
                    run_id = data.get('run_id', 'unknown')
                    history_file = EXPERIMENTS_DIR / run_id / 'training_history.json'
                    
                    if history_file.exists():
                        with open(history_file, 'r') as f:
                            history = json.load(f)
                        
                        epochs = range(1, len(history.get('train_loss', [])) + 1)
                        
                        # Loss plot
                        axes[idx, 0].plot(epochs, history.get('train_loss', []),
                                        label='Train', linewidth=2, color='steelblue')
                        axes[idx, 0].plot(epochs, history.get('val_loss', []),
                                        label='Val', linewidth=2, color='coral')
                        axes[idx, 0].set_xlabel('Epoch', fontweight='bold')
                        axes[idx, 0].set_ylabel('Loss', fontweight='bold')
                        axes[idx, 0].set_title(f'{model_name.upper()} - Loss',
                                              fontweight='bold')
                        axes[idx, 0].legend()
                        axes[idx, 0].grid(alpha=0.3)
                        
                        # Accuracy plot
                        axes[idx, 1].plot(epochs, history.get('train_acc', []),
                                        label='Train', linewidth=2, color='steelblue')
                        axes[idx, 1].plot(epochs, history.get('val_acc', []),
                                        label='Val', linewidth=2, color='coral')
                        axes[idx, 1].set_xlabel('Epoch', fontweight='bold')
                        axes[idx, 1].set_ylabel('Accuracy', fontweight='bold')
                        axes[idx, 1].set_title(f'{model_name.upper()} - Accuracy',
                                              fontweight='bold')
                        axes[idx, 1].legend()
                        axes[idx, 1].grid(alpha=0.3)
                    else:
                        for col in [0, 1]:
                            axes[idx, col].text(0.5, 0.5, 'History not available',
                                              ha='center', va='center')
                            axes[idx, col].set_title(f'{model_name.upper()}')
                
                plt.tight_layout()
                plt.show()
            else:
                print("  No neural models with training history")
        
        # ============================================================
        # 4. PREDICTION DISTRIBUTION
        # ============================================================
        if show_prediction_dist:
            print("\n[4/5] Generating prediction distribution...")
            
            fig, ax = plt.subplots(figsize=(10, 6))
            
            # Prepare data for stacked bar chart
            model_names = []
            long_counts = []
            neutral_counts = []
            short_counts = []
            
            for model_name, data in successful_models.items():
                run_id = data.get('run_id', 'unknown')
                predictions_file = EXPERIMENTS_DIR / run_id / 'predictions.json'
                
                if predictions_file.exists():
                    with open(predictions_file, 'r') as f:
                        pred_data = json.load(f)
                    
                    y_pred = np.array(pred_data.get('y_pred', []))
                    
                    if len(y_pred) > 0:
                        unique, counts = np.unique(y_pred, return_counts=True)
                        count_dict = dict(zip(unique, counts))
                        total = len(y_pred)
                        
                        model_names.append(model_name)
                        short_counts.append(count_dict.get(-1, 0) / total * 100)
                        neutral_counts.append(count_dict.get(0, 0) / total * 100)
                        long_counts.append(count_dict.get(1, 0) / total * 100)
            
            if model_names:
                x = np.arange(len(model_names))
                width = 0.6
                
                p1 = ax.bar(x, short_counts, width, label='Short (-1)', color='#d62728')
                p2 = ax.bar(x, neutral_counts, width, bottom=short_counts,
                           label='Neutral (0)', color='#7f7f7f')
                p3 = ax.bar(x, long_counts, width,
                           bottom=np.array(short_counts) + np.array(neutral_counts),
                           label='Long (1)', color='#2ca02c')
                
                ax.set_ylabel('Percentage (%)', fontweight='bold')
                ax.set_title('Prediction Distribution by Model', fontweight='bold', fontsize=14)
                ax.set_xticks(x)
                ax.set_xticklabels([m.upper() for m in model_names], rotation=45, ha='right')
                ax.legend(loc='upper right')
                ax.grid(axis='y', alpha=0.3)
                
                # Add percentage labels
                for i, (s, n, l) in enumerate(zip(short_counts, neutral_counts, long_counts)):
                    if s > 5:
                        ax.text(i, s/2, f'{s:.1f}%', ha='center', va='center',
                               fontweight='bold', color='white', fontsize=8)
                    if n > 5:
                        ax.text(i, s + n/2, f'{n:.1f}%', ha='center', va='center',
                               fontweight='bold', color='white', fontsize=8)
                    if l > 5:
                        ax.text(i, s + n + l/2, f'{l:.1f}%', ha='center', va='center',
                               fontweight='bold', color='white', fontsize=8)
                
                plt.tight_layout()
                plt.show()
            else:
                print("  No predictions available for distribution plot")
        
        # ============================================================
        # 5. PER-CLASS METRICS
        # ============================================================
        if show_per_class_metrics:
            print("\n[5/5] Generating per-class metrics...")
            
            fig, axes = plt.subplots(1, 3, figsize=(15, 5))
            
            model_names = []
            short_precision = []
            neutral_precision = []
            long_precision = []
            short_recall = []
            neutral_recall = []
            long_recall = []
            short_f1 = []
            neutral_f1 = []
            long_f1 = []
            
            for model_name, data in successful_models.items():
                run_id = data.get('run_id', 'unknown')
                predictions_file = EXPERIMENTS_DIR / run_id / 'predictions.json'
                
                if predictions_file.exists():
                    with open(predictions_file, 'r') as f:
                        pred_data = json.load(f)
                    
                    y_true = np.array(pred_data.get('y_true', []))
                    y_pred = np.array(pred_data.get('y_pred', []))
                    
                    if len(y_true) > 0 and len(y_pred) > 0:
                        precision, recall, f1, _ = precision_recall_fscore_support(
                            y_true, y_pred, labels=[-1, 0, 1], average=None, zero_division=0
                        )
                        
                        model_names.append(model_name)
                        short_precision.append(precision[0])
                        neutral_precision.append(precision[1])
                        long_precision.append(precision[2])
                        short_recall.append(recall[0])
                        neutral_recall.append(recall[1])
                        long_recall.append(recall[2])
                        short_f1.append(f1[0])
                        neutral_f1.append(f1[1])
                        long_f1.append(f1[2])
            
            if model_names:
                x = np.arange(len(model_names))
                width = 0.25
                
                # Precision
                axes[0].bar(x - width, short_precision, width, label='Short', color='#d62728')
                axes[0].bar(x, neutral_precision, width, label='Neutral', color='#7f7f7f')
                axes[0].bar(x + width, long_precision, width, label='Long', color='#2ca02c')
                axes[0].set_ylabel('Precision', fontweight='bold')
                axes[0].set_title('Precision by Class', fontweight='bold')
                axes[0].set_xticks(x)
                axes[0].set_xticklabels([m.upper() for m in model_names], rotation=45, ha='right')
                axes[0].legend()
                axes[0].grid(axis='y', alpha=0.3)
                axes[0].set_ylim(0, 1)
                
                # Recall
                axes[1].bar(x - width, short_recall, width, label='Short', color='#d62728')
                axes[1].bar(x, neutral_recall, width, label='Neutral', color='#7f7f7f')
                axes[1].bar(x + width, long_recall, width, label='Long', color='#2ca02c')
                axes[1].set_ylabel('Recall', fontweight='bold')
                axes[1].set_title('Recall by Class', fontweight='bold')
                axes[1].set_xticks(x)
                axes[1].set_xticklabels([m.upper() for m in model_names], rotation=45, ha='right')
                axes[1].legend()
                axes[1].grid(axis='y', alpha=0.3)
                axes[1].set_ylim(0, 1)
                
                # F1 Score
                axes[2].bar(x - width, short_f1, width, label='Short', color='#d62728')
                axes[2].bar(x, neutral_f1, width, label='Neutral', color='#7f7f7f')
                axes[2].bar(x + width, long_f1, width, label='Long', color='#2ca02c')
                axes[2].set_ylabel('F1 Score', fontweight='bold')
                axes[2].set_title('F1 Score by Class', fontweight='bold')
                axes[2].set_xticks(x)
                axes[2].set_xticklabels([m.upper() for m in model_names], rotation=45, ha='right')
                axes[2].legend()
                axes[2].grid(axis='y', alpha=0.3)
                axes[2].set_ylim(0, 1)
                
                plt.tight_layout()
                plt.show()
            else:
                print("  No predictions available for per-class metrics")
        
        print("\n" + "=" * 70)
        print(" VISUALIZATIONS COMPLETE")
        print("=" * 70)

In [None]:
#@title 4.4 Transformer Attention Visualization { display-mode: "form" }

import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from scipy.stats import entropy

# Visualization settings
visualize_attention = True  #@param {type: "boolean"}
sample_index = 0  #@param {type: "integer"}
#@markdown Sample index from validation set to visualize
layer_to_visualize = -1  #@param {type: "integer"}
#@markdown Layer index (-1 for last layer, 0 for first)
head_to_visualize = 0  #@param {type: "integer"}
#@markdown Head index to visualize in detail (0-7)

# Ensure environment variables are defined
if 'IS_COLAB' not in dir():
    IS_COLAB = os.path.exists('/content')

if IS_COLAB:
    PROJECT_ROOT = Path('/content/Research')
else:
    PROJECT_ROOT = Path.home() / 'Research'

EXPERIMENTS_DIR = PROJECT_ROOT / 'experiments' / 'runs'
SPLITS_DIR = PROJECT_ROOT / 'data' / 'splits' / 'scaled'

if not visualize_attention:
    print("✓ Attention visualization disabled.")
    print("  Enable 'visualize_attention' to see transformer attention patterns.")
elif 'TRAINING_RESULTS' not in dir() or not TRAINING_RESULTS:
    print("⚠ No training results found.")
    print("  Run Section 4.1 (Model Training) first.")
elif 'transformer' not in TRAINING_RESULTS:
    print("⚠ Transformer model not trained.")
    print("  Enable TRAIN_TRANSFORMER in Section 1 and run Section 4.1.")
else:
    try:
        print("="*80)
        print("TRANSFORMER ATTENTION VISUALIZATION")
        print("="*80)
        
        # Get transformer run ID
        run_id = TRAINING_RESULTS['transformer']['run_id']
        print(f"\n[Loading Model]")
        print(f"  Run ID: {run_id}")
        
        # Load container
        from src.phase1.datasets.container import TimeSeriesDataContainer
        
        container = TimeSeriesDataContainer.from_parquet_dir(
            path=SPLITS_DIR,
            horizon=TRAINING_HORIZON
        )
        
        print(f"  Horizon: {TRAINING_HORIZON}")
        print(f"  Validation samples: {len(container.val_X)}")
        
        # Load trained transformer
        model_path = EXPERIMENTS_DIR / run_id / 'checkpoints'
        
        if not model_path.exists():
            print(f"\n⚠ Model checkpoint not found at {model_path}")
            print("  The model may not have been saved during training.")
        else:
            # Import transformer model
            from src.models import ModelRegistry
            from src.models.config import TrainerConfig
            import torch
            
            # Create model instance with same config
            config = TrainerConfig(
                model_type='transformer',
                horizon=TRAINING_HORIZON,
                seq_len=TRANSFORMER_SEQ_LEN,
                d_model=TRANSFORMER_D_MODEL,
                n_heads=TRANSFORMER_N_HEADS,
                n_layers=TRANSFORMER_N_LAYERS,
                dropout=0.1
            )
            
            model = ModelRegistry.create('transformer', config=config.to_dict())
            
            # Load trained weights
            checkpoint_file = list(model_path.glob('*.pt'))
            if checkpoint_file:
                model.load(model_path)
                print(f"  ✓ Model loaded from {checkpoint_file[0].name}")
            else:
                print(f"\n⚠ No .pt checkpoint files found in {model_path}")
                raise FileNotFoundError("Model checkpoint not found")
            
            # Get validation sample
            print(f"\n[Extracting Sample]")
            print(f"  Sample index: {sample_index}")
            
            if sample_index >= len(container.val_X):
                print(f"  ⚠ Sample index {sample_index} out of range (max: {len(container.val_X)-1})")
                print(f"  Using index 0 instead.")
                sample_index = 0
            
            # Prepare sample
            X_val = container.val_X.iloc[[sample_index]]
            y_val = container.val_y.iloc[sample_index]
            
            # Convert to torch tensor
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
            
            # Reshape for transformer: (batch, seq_len, features)
            seq_len = config.seq_len
            n_features = X_val.shape[1] // seq_len
            
            X_tensor = torch.FloatTensor(X_val.values).reshape(1, seq_len, n_features).to(device)
            
            print(f"  Input shape: {X_tensor.shape}")
            print(f"  True label: {y_val}")
            
            # Get model prediction and attention weights
            model.model.eval()
            with torch.no_grad():
                # Forward pass to get attention
                if hasattr(model.model, 'get_attention_weights'):
                    attention_weights, prediction = model.model.get_attention_weights(X_tensor, layer_idx=layer_to_visualize)
                else:
                    # Fallback: hook into transformer layers
                    print("  ⚠ Model doesn't have get_attention_weights method")
                    print("  Attention visualization requires model modifications")
                    raise NotImplementedError("Attention extraction not implemented")
            
            # Convert to numpy
            attention_weights = attention_weights.cpu().numpy()  # Shape: (batch, n_heads, seq_len, seq_len)
            attention_weights = attention_weights[0]  # Remove batch dim: (n_heads, seq_len, seq_len)
            prediction = prediction.cpu().numpy()[0]
            
            print(f"\n[Model Output]")
            print(f"  Prediction: {prediction.argmax()}")
            print(f"  Confidence: {prediction.max():.2%}")
            print(f"  Attention shape: {attention_weights.shape}")
            
            # Visualize attention heatmaps for all heads
            print(f"\n[Visualizing Attention Patterns]")
            
            n_heads = attention_weights.shape[0]
            n_rows = 2
            n_cols = (n_heads + n_rows - 1) // n_rows  # Ceiling division
            
            fig, axes = plt.subplots(n_rows, n_cols, figsize=(4*n_cols, 8))
            if n_heads == 1:
                axes = np.array([[axes]])
            elif n_rows == 1:
                axes = axes.reshape(1, -1)
            axes = axes.flatten()
            
            for head_idx in range(n_heads):
                ax = axes[head_idx]
                
                # Plot heatmap
                sns.heatmap(
                    attention_weights[head_idx],
                    cmap='viridis',
                    ax=ax,
                    cbar=True,
                    square=True,
                    vmin=0,
                    vmax=attention_weights[head_idx].max(),
                    cbar_kws={'label': 'Attention Weight'}
                )
                
                ax.set_title(f'Head {head_idx+1}', fontsize=12, fontweight='bold')
                ax.set_xlabel('Key Position (Source)')
                ax.set_ylabel('Query Position (Target)')
                
                # Add grid for readability
                ax.grid(False)
            
            # Hide unused subplots
            for idx in range(n_heads, len(axes)):
                axes[idx].axis('off')
            
            layer_name = f"Layer {layer_to_visualize}" if layer_to_visualize >= 0 else "Final Layer"
            plt.suptitle(
                f'Transformer Attention Weights - {layer_name}\nSample {sample_index} | True: {y_val} | Pred: {prediction.argmax()}',
                fontsize=14,
                fontweight='bold',
                y=1.02
            )
            plt.tight_layout()
            plt.show()
            
            # Detailed analysis for selected head
            print(f"\n[Attention Analysis - Head {head_to_visualize + 1}]")
            
            if head_to_visualize >= n_heads:
                print(f"  ⚠ Head {head_to_visualize} not available (max: {n_heads-1})")
                head_to_visualize = 0
            
            head_attention = attention_weights[head_to_visualize]
            
            # Average attention per position (what positions are attended to)
            avg_attention_received = head_attention.mean(axis=0)  # Average over queries
            avg_attention_given = head_attention.mean(axis=1)     # Average over keys
            
            print(f"\n  Most attended positions (received):")
            top_positions = avg_attention_received.argsort()[-5:][::-1]
            for pos in top_positions:
                print(f"    Position {pos:3d}: {avg_attention_received[pos]:.4f}")
            
            print(f"\n  Most attentive positions (given):")
            top_giving = avg_attention_given.argsort()[-5:][::-1]
            for pos in top_giving:
                print(f"    Position {pos:3d}: {avg_attention_given[pos]:.4f}")
            
            # Attention entropy (uniformity)
            attention_entropy = [entropy(head_attention[i]) for i in range(len(head_attention))]
            avg_entropy = np.mean(attention_entropy)
            
            print(f"\n  Attention entropy: {avg_entropy:.4f}")
            print(f"    (Higher = more uniform, Lower = more focused)")
            
            # Interpretability insights
            print(f"\n[Interpretability Insights]")
            
            # Check recency bias
            recent_positions = seq_len // 10  # Last 10% of sequence
            recent_attention = avg_attention_received[-recent_positions:].sum()
            
            if recent_attention > 0.3:  # >30% on recent bars
                print(f"  → Strong recency bias ({recent_attention:.1%} on last {recent_positions} positions)")
                print(f"     Model relies heavily on most recent observations")
            
            # Check long-range dependencies
            early_positions = seq_len // 10  # First 10% of sequence
            early_attention = avg_attention_received[:early_positions].sum()
            
            if early_attention > 0.15:  # >15% on early bars
                print(f"  → Long-range context ({early_attention:.1%} on first {early_positions} positions)")
                print(f"     Model uses historical information beyond recent bars")
            
            # Check attention focus vs spread
            if avg_entropy < 2.0:
                print(f"  → Focused attention (entropy={avg_entropy:.2f})")
                print(f"     Model concentrates on specific positions")
            elif avg_entropy > 4.0:
                print(f"  → Distributed attention (entropy={avg_entropy:.2f})")
                print(f"     Model spreads attention broadly across sequence")
            
            # Diagonal attention (position attends to itself)
            self_attention = np.diag(head_attention).mean()
            if self_attention > 0.2:
                print(f"  → Self-attention ({self_attention:.1%} average)")
                print(f"     Positions attend to themselves (local context)")
            
            print(f"\n✓ Attention visualization complete")
            
    except FileNotFoundError as e:
        print(f"\n⚠ Error: {e}")
        print("  The transformer model checkpoint was not found.")
        print("  Make sure the model completed training in Section 4.1.")
        
    except NotImplementedError as e:
        print(f"\n⚠ {e}")
        print("  The transformer model needs modifications to extract attention weights.")
        print("  Add a 'get_attention_weights' method to the transformer model class.")
        
    except Exception as e:
        print(f"\n⚠ Error during attention visualization:")
        print(f"  {type(e).__name__}: {e}")
        import traceback
        traceback.print_exc()

In [None]:
#@title 4.5 Test Set Performance { display-mode: "form" }

import os
import json
import pickle
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    confusion_matrix, classification_report
)

# Configuration options
run_test_evaluation = True  #@param {type: "boolean"}
show_sample_predictions = True  #@param {type: "boolean"}
n_samples_to_show = 20  #@param {type: "integer"}
show_generalization_gap = True  #@param {type: "boolean"}
save_test_predictions = True  #@param {type: "boolean"}

# Ensure environment variables are defined
if 'IS_COLAB' not in dir():
    IS_COLAB = os.path.exists('/content')
if 'PROJECT_ROOT' not in dir():
    PROJECT_ROOT = Path('/content/research') if IS_COLAB else Path.cwd()
if 'SPLITS_DIR' not in dir():
    SPLITS_DIR = PROJECT_ROOT / 'data/splits/scaled'
if 'EXPERIMENTS_DIR' not in dir():
    EXPERIMENTS_DIR = PROJECT_ROOT / 'experiments/runs'
if 'TRAINING_RESULTS' not in dir():
    TRAINING_RESULTS = {}

# Check if we can run test evaluation
if not run_test_evaluation:
    print("[Skipped] Test evaluation disabled. Enable checkbox above to run.")
elif not TRAINING_RESULTS:
    print("[Error] No training results available.")
    print("Run Section 4.1 to train models first.")
else:
    print("=" * 70)
    print(" TEST SET PERFORMANCE EVALUATION")
    print("=" * 70)
    
    # Filter successful models only
    successful_models = {
        name: data for name, data in TRAINING_RESULTS.items()
        if data.get('run_id') != 'failed' and data.get('metrics')
    }
    
    if not successful_models:
        print("\n[Error] No successful models to evaluate.")
        print("All models failed during training.")
    else:
        try:
            # ============================================================
            # LOAD TEST DATA
            # ============================================================
            print(f"\n[1/5] Loading test data...")
            
            from src.phase1.stages.datasets.container import TimeSeriesDataContainer
            
            container = TimeSeriesDataContainer.from_parquet_dir(
                path=SPLITS_DIR,
                horizon=TRAINING_HORIZON
            )
            
            # Get test data
            test_split = container.splits.get('test')
            if test_split is None:
                print("  [ERROR] Test split not found in container!")
                raise ValueError("Test data not available")
            
            X_test = test_split.features
            y_test = test_split.labels
            
            print(f"  Test samples: {len(X_test):,}")
            print(f"  Features: {X_test.shape[1]}")
            
            # ============================================================
            # RUN PREDICTIONS ON TEST SET
            # ============================================================
            print(f"\n[2/5] Running predictions on test set...")
            
            TEST_RESULTS = {}
            
            for model_name, train_data in successful_models.items():
                print(f"\n  Evaluating: {model_name.upper()}")
                
                try:
                    run_id = train_data.get('run_id', 'unknown')
                    model_dir = EXPERIMENTS_DIR / run_id
                    
                    # Load model from checkpoints
                    checkpoint_dir = model_dir / 'checkpoints'
                    
                    # Try different model file formats
                    model_loaded = False
                    model = None
                    
                    # Method 1: Try pickle format
                    pickle_path = checkpoint_dir / 'model.pkl'
                    if pickle_path.exists():
                        with open(pickle_path, 'rb') as f:
                            model = pickle.load(f)
                        model_loaded = True
                        print(f"    Loaded from: {pickle_path.name}")
                    
                    # Method 2: Try joblib format
                    if not model_loaded:
                        joblib_path = checkpoint_dir / 'model.joblib'
                        if joblib_path.exists():
                            model = joblib.load(joblib_path)
                            model_loaded = True
                            print(f"    Loaded from: {joblib_path.name}")
                    
                    # Method 3: Try PyTorch format (for neural models)
                    if not model_loaded and model_name in ['lstm', 'gru', 'tcn', 'transformer']:
                        torch_path = checkpoint_dir / 'model.pt'
                        if torch_path.exists():
                            import torch
                            from src.models import ModelRegistry
                            
                            # Recreate model architecture
                            model = ModelRegistry.create(model_name, config={
                                'input_size': X_test.shape[1],
                                'hidden_size': 128,
                                'num_layers': 2,
                            })
                            
                            # Load weights
                            state_dict = torch.load(torch_path, map_location='cpu')
                            model.model.load_state_dict(state_dict)
                            model.model.eval()
                            model_loaded = True
                            print(f"    Loaded from: {torch_path.name}")
                    
                    if not model_loaded:
                        print(f"    [WARNING] Model file not found in {checkpoint_dir}")
                        print(f"    Skipping {model_name}")
                        continue
                    
                    # Make predictions
                    if hasattr(model, 'predict'):
                        # Sklearn-style models
                        if model_name in ['xgboost', 'lightgbm', 'catboost', 'random_forest', 'logistic', 'svm']:
                            y_pred = model.predict(X_test)
                        else:
                            # Neural models - may need special handling
                            pred_result = model.predict(X_test)
                            if hasattr(pred_result, 'class_predictions'):
                                y_pred = pred_result.class_predictions
                            else:
                                y_pred = pred_result
                    else:
                        print(f"    [WARNING] Model has no predict method")
                        continue
                    
                    # Calculate test metrics
                    test_acc = accuracy_score(y_test, y_pred)
                    test_macro_f1 = f1_score(y_test, y_pred, average='macro', zero_division=0)
                    test_weighted_f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)
                    test_precision = precision_score(y_test, y_pred, average='macro', zero_division=0)
                    test_recall = recall_score(y_test, y_pred, average='macro', zero_division=0)
                    
                    # Per-class metrics
                    per_class_f1 = f1_score(y_test, y_pred, average=None, labels=[-1, 0, 1], zero_division=0)
                    per_class_precision = precision_score(y_test, y_pred, average=None, labels=[-1, 0, 1], zero_division=0)
                    per_class_recall = recall_score(y_test, y_pred, average=None, labels=[-1, 0, 1], zero_division=0)
                    
                    # Confusion matrix
                    cm = confusion_matrix(y_test, y_pred, labels=[-1, 0, 1])
                    
                    # Store results
                    val_metrics = train_data.get('metrics', {})
                    
                    TEST_RESULTS[model_name] = {
                        'test_metrics': {
                            'accuracy': test_acc,
                            'macro_f1': test_macro_f1,
                            'weighted_f1': test_weighted_f1,
                            'precision': test_precision,
                            'recall': test_recall,
                            'per_class_f1': per_class_f1.tolist(),
                            'per_class_precision': per_class_precision.tolist(),
                            'per_class_recall': per_class_recall.tolist(),
                            'confusion_matrix': cm.tolist(),
                        },
                        'val_metrics': val_metrics,
                        'predictions': y_pred.tolist() if hasattr(y_pred, 'tolist') else list(y_pred),
                        'run_id': run_id,
                    }
                    
                    print(f"    Test Accuracy: {test_acc:.2%}")
                    print(f"    Test Macro F1: {test_macro_f1:.4f}")
                    
                    # Save predictions if requested
                    if save_test_predictions:
                        test_pred_file = model_dir / 'test_predictions.json'
                        with open(test_pred_file, 'w') as f:
                            json.dump({
                                'y_true': y_test.tolist() if hasattr(y_test, 'tolist') else list(y_test),
                                'y_pred': y_pred.tolist() if hasattr(y_pred, 'tolist') else list(y_pred),
                                'test_metrics': TEST_RESULTS[model_name]['test_metrics'],
                            }, f, indent=2)
                    
                    del model
                    
                except Exception as model_error:
                    print(f"    [ERROR] Failed to evaluate {model_name}: {model_error}")
                    import traceback
                    traceback.print_exc()
                    continue
            
            # ============================================================
            # DISPLAY COMPARISON TABLE
            # ============================================================
            if TEST_RESULTS:
                print(f"\n[3/5] Model Performance Comparison")
                print("=" * 70)
                
                # Build comparison DataFrame
                comparison_data = []
                for model_name, results in TEST_RESULTS.items():
                    val_metrics = results['val_metrics']
                    test_metrics = results['test_metrics']
                    
                    val_acc = val_metrics.get('accuracy', 0)
                    test_acc = test_metrics.get('accuracy', 0)
                    val_f1 = val_metrics.get('macro_f1', 0)
                    test_f1 = test_metrics.get('macro_f1', 0)
                    
                    # Calculate generalization gap
                    acc_gap = (test_acc - val_acc) * 100
                    f1_gap = (test_f1 - val_f1) * 100
                    
                    comparison_data.append({
                        'Model': model_name,
                        'Val Acc': f"{val_acc:.2%}",
                        'Test Acc': f"{test_acc:.2%}",
                        'Val F1': f"{val_f1:.4f}",
                        'Test F1': f"{test_f1:.4f}",
                        'Acc Gap (%)': f"{acc_gap:+.2f}",
                        'F1 Gap (%)': f"{f1_gap:+.2f}",
                    })
                
                comparison_df = pd.DataFrame(comparison_data)
                
                # Sort by test F1 score
                comparison_df = comparison_df.sort_values(
                    by='Test F1',
                    ascending=False,
                    key=lambda x: x.str.replace('%', '').astype(float) if x.dtype == 'object' else x
                )
                
                print("\n")
                print(comparison_df.to_string(index=False))
                
                # Best performing model
                best_model_name = comparison_df.iloc[0]['Model']
                best_test_f1 = comparison_df.iloc[0]['Test F1']
                print(f"\n  Best Model on Test Set: {best_model_name} (F1: {best_test_f1})")
                
                # ============================================================
                # GENERALIZATION ANALYSIS
                # ============================================================
                if show_generalization_gap:
                    print(f"\n[4/5] Generalization Analysis")
                    print("=" * 70)
                    
                    for model_name, results in TEST_RESULTS.items():
                        val_metrics = results['val_metrics']
                        test_metrics = results['test_metrics']
                        
                        val_f1 = val_metrics.get('macro_f1', 0)
                        test_f1 = test_metrics.get('macro_f1', 0)
                        
                        gap_pct = ((test_f1 - val_f1) / val_f1 * 100) if val_f1 > 0 else 0
                        
                        # Color-code based on gap
                        if abs(gap_pct) < 2:
                            status = "✓ EXCELLENT"
                            color = "green"
                        elif abs(gap_pct) < 5:
                            status = "~ GOOD"
                            color = "yellow"
                        else:
                            status = "⚠ POOR"
                            color = "red"
                        
                        print(f"\n  {model_name.upper()}:")
                        print(f"    Val F1:  {val_f1:.4f}")
                        print(f"    Test F1: {test_f1:.4f}")
                        print(f"    Gap:     {gap_pct:+.2f}% [{status}]")
                
                # ============================================================
                # SAMPLE PREDICTIONS
                # ============================================================
                if show_sample_predictions and n_samples_to_show > 0:
                    print(f"\n[5/5] Sample Predictions (first {n_samples_to_show})")
                    print("=" * 70)
                    
                    # Show actual labels
                    sample_actual = y_test[:n_samples_to_show]
                    print(f"\n  Actual:     {list(sample_actual)}")
                    
                    # Show predictions for each model
                    for model_name, results in TEST_RESULTS.items():
                        predictions = results['predictions']
                        sample_pred = predictions[:n_samples_to_show]
                        
                        # Calculate accuracy for this sample
                        matches = sum(1 for a, p in zip(sample_actual, sample_pred) if a == p)
                        sample_acc = matches / len(sample_actual) * 100
                        
                        print(f"  {model_name:12s}: {sample_pred} ({sample_acc:.1f}% match)")
                
                print("\n" + "=" * 70)
                print(" TEST EVALUATION COMPLETE")
                print("=" * 70)
                
                print(f"\n  Evaluated: {len(TEST_RESULTS)} models")
                print(f"  Test samples: {len(X_test):,}")
                
                if save_test_predictions:
                    print(f"  Predictions saved to: {EXPERIMENTS_DIR}/[run_id]/test_predictions.json")
            else:
                print("\n[WARNING] No test results generated.")
                print("All models failed to load or predict.")
            
            # Clean up
            del container, X_test, y_test
            
        except Exception as e:
            print(f"\n[ERROR] Test evaluation failed: {e}")
            import traceback
            traceback.print_exc()

---
# 5. PHASE 3: CROSS-VALIDATION (Optional)

Run purged K-fold cross-validation for robust model evaluation.

In [None]:
#@title 5.1 Run Cross-Validation { display-mode: "form" }import osimport gcimport numpy as npimport pandas as pdfrom pathlib import Pathfrom tqdm.auto import tqdm# Ensure environment variables are definedif 'IS_COLAB' not in dir():    IS_COLAB = os.path.exists('/content')if 'PROJECT_ROOT' not in dir():    PROJECT_ROOT = Path('/content/research') if IS_COLAB else Path.cwd()if 'SPLITS_DIR' not in dir():    SPLITS_DIR = PROJECT_ROOT / 'data/splits/scaled'if 'RUN_CROSS_VALIDATION' not in dir():    RUN_CROSS_VALIDATION = Falseif 'TRAINING_RESULTS' not in dir():    TRAINING_RESULTS = {}if 'GPU_AVAILABLE' not in dir():    import torch    GPU_AVAILABLE = torch.cuda.is_available()if 'CV_TUNE_HYPERPARAMS' not in dir():    CV_TUNE_HYPERPARAMS = Falseif 'CV_N_TRIALS' not in dir():    CV_N_TRIALS = 20# Define clear_memory if not availableif 'clear_memory' not in dir():    def clear_memory():        gc.collect()        if GPU_AVAILABLE:            import torch            torch.cuda.empty_cache()# Initialize global results dictionariesif 'CV_RESULTS' not in dir():    CV_RESULTS = {}if 'TUNING_RESULTS' not in dir():    TUNING_RESULTS = {}if not RUN_CROSS_VALIDATION:    print("[Skipped] Cross-validation disabled in configuration.")    print("Set RUN_CROSS_VALIDATION = True in Section 1 to enable.")else:    print("=" * 70)    print(" PHASE 3: CROSS-VALIDATION")    print("=" * 70)        try:        from src.cross_validation import PurgedKFold, PurgedKFoldConfig        from src.cross_validation.cv_runner import TimeSeriesOptunaTuner        from src.cross_validation.param_spaces import get_param_space        from src.phase1.stages.datasets.container import TimeSeriesDataContainer        from src.models import ModelRegistry        from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score                # Load data        container = TimeSeriesDataContainer.from_parquet_dir(            path=SPLITS_DIR,            horizon=TRAINING_HORIZON        )                X, y, _ = container.get_sklearn_arrays('train')        print(f"\nData: {X.shape[0]:,} samples, {X.shape[1]} features")                # Configure CV        cv_config = PurgedKFoldConfig(            n_splits=CV_N_SPLITS,            purge_bars=PURGE_BARS,            embargo_bars=EMBARGO_BARS,        )        cv = PurgedKFold(cv_config)                print(f"CV: {CV_N_SPLITS} folds, purge={PURGE_BARS}, embargo={EMBARGO_BARS}")                # Get list of successfully trained models        successful_models = [m for m in TRAINING_RESULTS.keys() if TRAINING_RESULTS[m].get('status') == 'success']                if not successful_models:            print("\n[Warning] No trained models found. Train models first in Section 4.")        else:            print(f"\nRunning CV for {len(successful_models)} models: {', '.join(successful_models)}")                        # Run CV for ALL trained models            cv_summary_data = []                        for model_name in tqdm(successful_models, desc="Cross-Validation"):                print(f"\n{'='*50}")                print(f"Model: {model_name}")                print(f"{'='*50}")                                # Hyperparameter tuning (if enabled)                tuned_params = {}                if CV_TUNE_HYPERPARAMS:                    param_space = get_param_space(model_name)                                        if param_space:                        print(f"  Tuning hyperparameters ({CV_N_TRIALS} trials)...")                        try:                            tuner = TimeSeriesOptunaTuner(                                model_name=model_name,                                cv=cv,                                n_trials=CV_N_TRIALS,                                direction="maximize",                                metric="f1"                            )                                                        tuning_result = tuner.tune(                                X=pd.DataFrame(X),                                y=pd.Series(y),                                sample_weights=None,                                param_space=param_space                            )                                                        if not tuning_result.get('skipped', False):                                tuned_params = tuning_result.get('best_params', {})                                best_value = tuning_result.get('best_value', 0.0)                                                                TUNING_RESULTS[model_name] = {                                    'best_params': tuned_params,                                    'best_value': best_value,                                    'n_trials': tuning_result.get('n_trials', 0)                                }                                                                print(f"    Best F1: {best_value:.4f}")                                print(f"    Best params: {tuned_params}")                            else:                                print(f"    [Skipped] No tuning support or Optuna not installed")                                                        except Exception as e:                            print(f"    [Warning] Tuning failed: {e}")                    else:                        print(f"  [Skipped] No param space defined for {model_name}")                                # Get model config (use tuned params if available)                try:                    default_config = ModelRegistry.get_model_info(model_name).get('default_config', {})                except:                    default_config = {                        'n_estimators': N_ESTIMATORS,                        'early_stopping_rounds': BOOSTING_EARLY_STOPPING,                    }                                model_config = {**default_config, **tuned_params}                                # Run cross-validation                print(f"  Running {CV_N_SPLITS}-fold CV...")                fold_scores = []                fold_details = []                                for fold_idx, (train_idx, val_idx) in enumerate(cv.split(X, y)):                    X_train, X_val = X[train_idx], X[val_idx]                    y_train, y_val = y[train_idx], y[val_idx]                                        # Train model                    model = ModelRegistry.create(model_name, config=model_config)                    model.fit(X_train, y_train, X_val, y_val)                                        # Evaluate                    predictions = model.predict(X_val)                    y_pred = predictions.class_predictions                                        f1 = f1_score(y_val, y_pred, average='macro')                    acc = accuracy_score(y_val, y_pred)                    prec = precision_score(y_val, y_pred, average='macro', zero_division=0)                    rec = recall_score(y_val, y_pred, average='macro', zero_division=0)                                        fold_scores.append(f1)                    fold_details.append({                        'fold': fold_idx,                        'f1': f1,                        'accuracy': acc,                        'precision': prec,                        'recall': rec,                        'train_size': len(train_idx),                        'val_size': len(val_idx)                    })                                        del model                    clear_memory()                                # Calculate CV statistics                mean_f1 = np.mean(fold_scores)                std_f1 = np.std(fold_scores)                best_f1 = np.max(fold_scores)                                # Stability grading                if std_f1 < 0.01:                    stability = "Excellent"                elif std_f1 < 0.02:                    stability = "Good"                elif std_f1 < 0.04:                    stability = "Fair"                else:                    stability = "Poor"                                # Store results                CV_RESULTS[model_name] = {                    'mean_f1': mean_f1,                    'std_f1': std_f1,                    'best_f1': best_f1,                    'fold_scores': fold_scores,                    'fold_details': fold_details,                    'stability': stability,                    'tuned_params': tuned_params                }                                cv_summary_data.append({                    'Model': model_name,                    'CV Mean F1': mean_f1,                    'CV Std': std_f1,                    'Best F1': best_f1,                    'Stability': stability                })                                print(f"  Mean F1: {mean_f1:.4f} (+/- {std_f1:.4f})")                print(f"  Best F1: {best_f1:.4f}")                print(f"  Stability: {stability}")                        # Display CV summary table            print(f"\n{'='*70}")            print(" CROSS-VALIDATION SUMMARY")            print(f"{'='*70}\n")                        cv_summary_df = pd.DataFrame(cv_summary_data)            cv_summary_df = cv_summary_df.sort_values('CV Mean F1', ascending=False)                        # Format for display            print(cv_summary_df.to_string(index=False))                        print(f"\n{'='*70}")            print(f"Cross-validation complete for {len(successful_models)} models")            print(f"{'='*70}")                del container, X, y        clear_memory()            except Exception as e:        print(f"\n[ERROR] Cross-validation failed: {e}")        import traceback        traceback.print_exc()

In [None]:
#@title 5.2 Hyperparameter Tuning Results { display-mode: "form" }#@markdown Display hyperparameter tuning results and recommendations.show_retrain_recommendation = True  #@param {type: "boolean"}show_optimization_plots = False  #@param {type: "boolean"}import osimport pandas as pdimport numpy as npfrom pathlib import Path# Ensure environment variables are definedif 'IS_COLAB' not in dir():    IS_COLAB = os.path.exists('/content')if 'TUNING_RESULTS' not in dir():    TUNING_RESULTS = {}if 'TRAINING_RESULTS' not in dir():    TRAINING_RESULTS = {}if 'CV_TUNE_HYPERPARAMS' not in dir():    CV_TUNE_HYPERPARAMS = False# Check if tuning was runif not CV_TUNE_HYPERPARAMS:    print("=" * 70)    print(" HYPERPARAMETER TUNING RESULTS")    print("=" * 70)    print("\n[Skipped] Hyperparameter tuning not enabled.")    print("\nTo enable tuning:")    print("  1. Set CV_TUNE_HYPERPARAMS = True in Section 1")    print("  2. Run Cross-Validation (Section 5.1)")    print(f"  3. Configure CV_N_TRIALS (currently: {globals().get('CV_N_TRIALS', 20)})")elif 'TUNING_RESULTS' not in dir() or not TUNING_RESULTS:    print("=" * 70)    print(" HYPERPARAMETER TUNING RESULTS")    print("=" * 70)    print("\n[No Data] No tuning results available.")    print("\nPossible reasons:")    print("  - Cross-validation hasn't been run yet")    print("  - No models have param spaces defined")    print("  - Optuna is not installed")    print("\nRun Section 5.1 (Cross-Validation) first.")else:    print("=" * 70)    print(" HYPERPARAMETER TUNING RESULTS")    print("=" * 70)        if not TUNING_RESULTS:        print("\n[No Results] Tuning enabled but no models were tuned.")        print("\nModels with tuning support:")        from src.cross_validation.param_spaces import PARAM_SPACES        supported_models = list(PARAM_SPACES.keys())        print(f"  {', '.join(supported_models)}")    else:        print(f"\nTuned {len(TUNING_RESULTS)} model(s)")        print(f"Trials per model: {globals().get('CV_N_TRIALS', 20)}\n")                # Display results for each model        for model_name, results in TUNING_RESULTS.items():            print(f"\n{'='*60}")            print(f" {model_name.upper()}")            print(f"{'='*60}")                        best_params = results.get('best_params', {})            best_value = results.get('best_value', 0.0)            n_trials = results.get('n_trials', 0)                        print(f"\nOptimization Summary:")            print(f"  Trials completed: {n_trials}")            print(f"  Best F1 score:    {best_value:.4f}")                        if best_params:                print(f"\n  Best Parameters:")                                # Get default params for comparison                try:                    from src.models import ModelRegistry                    model_info = ModelRegistry.get_model_info(model_name)                    default_config = model_info.get('default_config', {})                except:                    default_config = {}                                # Create parameter comparison table                param_data = []                for param_name, tuned_value in best_params.items():                    default_value = default_config.get(param_name, None)                                        # Calculate change                    if default_value is not None:                        if isinstance(tuned_value, (int, float)) and isinstance(default_value, (int, float)):                            change_pct = ((tuned_value - default_value) / default_value * 100) if default_value != 0 else 0                            change_str = f"{change_pct:+.1f}%"                        else:                            change_str = "changed"                    else:                        change_str = "new"                                        param_data.append({                        'Parameter': param_name,                        'Default': str(default_value) if default_value is not None else 'N/A',                        'Tuned': str(tuned_value),                        'Change': change_str                    })                                if param_data:                    param_df = pd.DataFrame(param_data)                    print("\n" + param_df.to_string(index=False))                else:                    for param_name, value in best_params.items():                        print(f"    {param_name}: {value}")                        # Calculate improvement over default            if model_name in TRAINING_RESULTS:                default_f1 = TRAINING_RESULTS[model_name].get('metrics', {}).get('macro_f1', 0.0)                improvement = ((best_value - default_f1) / default_f1 * 100) if default_f1 > 0 else 0                                print(f"\n  Improvement Analysis:")                print(f"    Default F1:     {default_f1:.4f}")                print(f"    Tuned F1:       {best_value:.4f}")                print(f"    Improvement:    {improvement:+.2f}%")                        print()                # Show retrain recommendations        if show_retrain_recommendation:            print(f"\n{'='*70}")            print(" RETRAIN RECOMMENDATIONS")            print(f"{'='*70}\n")                        recommendations = []            for model_name, results in TUNING_RESULTS.items():                best_value = results.get('best_value', 0.0)                                # Calculate improvement                if model_name in TRAINING_RESULTS:                    default_f1 = TRAINING_RESULTS[model_name].get('metrics', {}).get('macro_f1', 0.0)                    improvement = ((best_value - default_f1) / default_f1 * 100) if default_f1 > 0 else 0                                        recommendations.append({                        'Model': model_name,                        'Default F1': default_f1,                        'Tuned F1': best_value,                        'Improvement': improvement,                        'Action': 'RETRAIN' if improvement > 2.0 else 'Optional'                    })                        if recommendations:                rec_df = pd.DataFrame(recommendations)                rec_df = rec_df.sort_values('Improvement', ascending=False)                                # Format for display                rec_df['Default F1'] = rec_df['Default F1'].apply(lambda x: f"{x:.4f}")                rec_df['Tuned F1'] = rec_df['Tuned F1'].apply(lambda x: f"{x:.4f}")                rec_df['Improvement'] = rec_df['Improvement'].apply(lambda x: f"{x:+.2f}%")                                print(rec_df.to_string(index=False))                                # Highlight high-priority retrains                high_priority = [r for r in recommendations if r['Improvement'] > 2.0]                if high_priority:                    print(f"\n⚠ HIGH PRIORITY: {len(high_priority)} model(s) show >2% improvement:")                    for rec in high_priority:                        print(f"  - {rec['Model']}: {rec['Improvement']:+.2f}% improvement")                    print("\n  Recommendation: Retrain these models with tuned parameters")                else:                    print("\n✓ All models performing near-optimally with default parameters")            else:                print("No comparison data available (models not trained with defaults)")                # Optimization plots (optional)        if show_optimization_plots:            print(f"\n{'='*70}")            print(" OPTIMIZATION HISTORY")            print(f"{'='*70}\n")            print("[Info] Optimization plots require Optuna visualization.")            print("      In Colab, install: !pip install optuna plotly")            print("      Then re-run this cell to see optimization history.")                        try:                import optuna                print("\n✓ Optuna available - plots can be generated")                print("  (Full plot integration coming in next update)")            except ImportError:                print("\n✗ Optuna not installed - plots unavailable")                # Save tuning results        try:            if 'EXPERIMENTS_DIR' in dir():                tuning_results_path = EXPERIMENTS_DIR / 'tuning_results.json'                import json                with open(tuning_results_path, 'w') as f:                    json.dump(TUNING_RESULTS, f, indent=2, default=str)                print(f"\n[Saved] Tuning results: {tuning_results_path}")        except Exception as e:            pass  # Silently skip if can't save                print(f"\n{'='*70}")        print(f"Hyperparameter tuning analysis complete")        print(f"{'='*70}")

---
# 6. PHASE 4: ENSEMBLE (Optional)

Combine multiple models for improved predictions.

In [None]:
#@title 6.1 Train Ensemble { display-mode: "form" }import osimport gcfrom pathlib import Pathimport numpy as npimport pandas as pd#@markdown ### Ensemble Training Optionsshow_base_model_validation = True  #@param {type: "boolean"}filter_by_cv_stability = False  #@param {type: "boolean"}show_ensemble_comparison = True  #@param {type: "boolean"}min_diversity_threshold = 0.1  #@param {type: "number"}# Ensure environment variables are definedif 'IS_COLAB' not in dir():    IS_COLAB = os.path.exists('/content')if 'PROJECT_ROOT' not in dir():    PROJECT_ROOT = Path('/content/research') if IS_COLAB else Path.cwd()if 'SPLITS_DIR' not in dir():    SPLITS_DIR = PROJECT_ROOT / 'data/splits/scaled'if 'EXPERIMENTS_DIR' not in dir():    EXPERIMENTS_DIR = PROJECT_ROOT / 'experiments/runs'if 'TRAINING_RESULTS' not in dir():    TRAINING_RESULTS = {}if 'GPU_AVAILABLE' not in dir():    import torch    GPU_AVAILABLE = torch.cuda.is_available()# Define clear_memory if not availableif 'clear_memory' not in dir():    def clear_memory():        gc.collect()        if GPU_AVAILABLE:            import torch            torch.cuda.empty_cache()# Initialize ensemble results dictENSEMBLE_RESULTS = {}# Filter out failed models from ensemble base modelssuccessful_models = [    model for model, data in TRAINING_RESULTS.items()    if data.get('run_id') != 'failed' and data.get('metrics')]# Check if any ensemble is enabledany_ensemble_enabled = TRAIN_VOTING or TRAIN_STACKING or TRAIN_BLENDINGif not any_ensemble_enabled:    print("[Skipped] No ensemble training enabled.")    print("Enable TRAIN_VOTING, TRAIN_STACKING, or TRAIN_BLENDING in Section 1.")elif len(successful_models) < 2:    print("[Error] Need at least 2 successfully trained models for ensemble.")    print(f"Successfully trained: {successful_models}")    if len(TRAINING_RESULTS) > len(successful_models):        failed = [m for m in TRAINING_RESULTS if m not in successful_models]        print(f"Failed models (excluded): {failed}")else:    print("=" * 70)    print(" PHASE 4: ENSEMBLE TRAINING")    print("=" * 70)        # Helper function to parse base models and validate    def parse_and_validate_base_models(base_models_str, ensemble_name):        """Parse comma-separated base models and validate availability."""        # Parse base models        base_model_names = [m.strip() for m in base_models_str.split(',') if m.strip()]                if show_base_model_validation:            print(f"\n[{ensemble_name.upper()}] Base Model Validation:")            print(f"  Requested: {base_model_names}")                # Validate: only use successfully trained models        valid_base_models = [            m for m in base_model_names             if m in successful_models        ]                invalid_models = [m for m in base_model_names if m not in successful_models]        if invalid_models:            print(f"  ⚠ Skipped (not trained/failed): {invalid_models}")                # Optionally filter by CV stability        if filter_by_cv_stability and 'CV_RESULTS' in dir() and CV_RESULTS:            stable_models = [                m for m in valid_base_models                if m in CV_RESULTS and CV_RESULTS[m].get('stability') in ['Excellent', 'Good']            ]            if len(stable_models) < len(valid_base_models):                unstable = [m for m in valid_base_models if m not in stable_models]                print(f"  ⚠ Filtered (low CV stability): {unstable}")                valid_base_models = stable_models                if show_base_model_validation:            print(f"  ✓ Valid base models: {valid_base_models}")                return valid_base_models        # Helper function to parse weights    def parse_weights(weights_str):        """Parse comma-separated weights string into list of floats."""        if not weights_str or not weights_str.strip():            return None        try:            weights = [float(w.strip()) for w in weights_str.split(',')]            return weights        except ValueError:            print(f"  ⚠ Invalid weights format: {weights_str}")            return None        try:        from src.phase1.stages.datasets.container import TimeSeriesDataContainer        from src.models.trainer import Trainer        from src.models.config import TrainerConfig                # Load data container        print(f"\n[Data Loading]")        print(f"  Splits directory: {SPLITS_DIR}")        print(f"  Horizon: {TRAINING_HORIZON}")                container = TimeSeriesDataContainer.load(SPLITS_DIR, TRAINING_HORIZON)        print(f"  ✓ Loaded: {container.X_train.shape[0]:,} train samples")                # ===================================================================        # TRAIN VOTING ENSEMBLE        # ===================================================================        if TRAIN_VOTING:            print("\n" + "=" * 70)            print(" VOTING ENSEMBLE")            print("=" * 70)                        valid_voting_models = parse_and_validate_base_models(                VOTING_BASE_MODELS,                 "voting"            )                        if len(valid_voting_models) < 2:                print(f"  ✗ Need at least 2 valid base models (got {len(valid_voting_models)})")                print("  Skipping Voting ensemble.")            else:                # Parse weights if provided                weights = parse_weights(VOTING_WEIGHTS) if VOTING_WEIGHTS else None                if weights and len(weights) != len(valid_voting_models):                    print(f"  ⚠ Weights count ({len(weights)}) != models count ({len(valid_voting_models)})")                    print("  Using equal weights instead.")                    weights = None                                # Create config                voting_config = TrainerConfig(                    model_name='voting',                    horizon=TRAINING_HORIZON,                    model_config={                        'base_model_names': valid_voting_models,                        'voting_type': 'soft',  # Soft voting (avg probabilities)                        'weights': weights                    },                    device='cuda' if GPU_AVAILABLE else 'cpu'                )                                print(f"\n  Base models: {valid_voting_models}")                if weights:                    print(f"  Weights: {weights}")                else:                    print(f"  Weights: Equal (1/{len(valid_voting_models)})")                print(f"  Voting type: soft")                                # Train                trainer = Trainer(voting_config)                print("\n  Training Voting ensemble...")                results = trainer.run(container)                                # Store results                ENSEMBLE_RESULTS['voting'] = results                                # Display metrics                metrics = results['metrics']                print(f"\n  ✓ Voting Ensemble Results:")                print(f"     Accuracy:  {metrics['accuracy']:.2%}")                print(f"     Macro F1:  {metrics['macro_f1']:.4f}")                print(f"     Precision: {metrics['macro_precision']:.4f}")                print(f"     Recall:    {metrics['macro_recall']:.4f}")                                clear_memory()                # ===================================================================        # TRAIN STACKING ENSEMBLE        # ===================================================================        if TRAIN_STACKING:            print("\n" + "=" * 70)            print(" STACKING ENSEMBLE")            print("=" * 70)                        valid_stacking_models = parse_and_validate_base_models(                STACKING_BASE_MODELS,                 "stacking"            )                        if len(valid_stacking_models) < 2:                print(f"  ✗ Need at least 2 valid base models (got {len(valid_stacking_models)})")                print("  Skipping Stacking ensemble.")            else:                # Create config                stacking_config = TrainerConfig(                    model_name='stacking',                    horizon=TRAINING_HORIZON,                    model_config={                        'base_model_names': valid_stacking_models,                        'meta_learner': STACKING_META_LEARNER,                        'n_folds': STACKING_N_FOLDS,                        'use_probas': True  # Use class probabilities                    },                    device='cuda' if GPU_AVAILABLE else 'cpu'                )                                print(f"\n  Base models: {valid_stacking_models}")                print(f"  Meta-learner: {STACKING_META_LEARNER}")                print(f"  CV folds: {STACKING_N_FOLDS}")                                # Train                trainer = Trainer(stacking_config)                print("\n  Training Stacking ensemble...")                print("  (Generating out-of-fold predictions...)")                results = trainer.run(container)                                # Store results                ENSEMBLE_RESULTS['stacking'] = results                                # Display metrics                metrics = results['metrics']                print(f"\n  ✓ Stacking Ensemble Results:")                print(f"     Accuracy:  {metrics['accuracy']:.2%}")                print(f"     Macro F1:  {metrics['macro_f1']:.4f}")                print(f"     Precision: {metrics['macro_precision']:.4f}")                print(f"     Recall:    {metrics['macro_recall']:.4f}")                                clear_memory()                # ===================================================================        # TRAIN BLENDING ENSEMBLE        # ===================================================================        if TRAIN_BLENDING:            print("\n" + "=" * 70)            print(" BLENDING ENSEMBLE")            print("=" * 70)                        valid_blending_models = parse_and_validate_base_models(                BLENDING_BASE_MODELS,                 "blending"            )                        if len(valid_blending_models) < 2:                print(f"  ✗ Need at least 2 valid base models (got {len(valid_blending_models)})")                print("  Skipping Blending ensemble.")            else:                # Create config                blending_config = TrainerConfig(                    model_name='blending',                    horizon=TRAINING_HORIZON,                    model_config={                        'base_model_names': valid_blending_models,                        'meta_learner': BLENDING_META_LEARNER,                        'holdout_ratio': BLENDING_HOLDOUT_RATIO                    },                    device='cuda' if GPU_AVAILABLE else 'cpu'                )                                print(f"\n  Base models: {valid_blending_models}")                print(f"  Meta-learner: {BLENDING_META_LEARNER}")                print(f"  Holdout ratio: {BLENDING_HOLDOUT_RATIO:.0%}")                                # Train                trainer = Trainer(blending_config)                print("\n  Training Blending ensemble...")                print("  (Using holdout set for meta-learner...)")                results = trainer.run(container)                                # Store results                ENSEMBLE_RESULTS['blending'] = results                                # Display metrics                metrics = results['metrics']                print(f"\n  ✓ Blending Ensemble Results:")                print(f"     Accuracy:  {metrics['accuracy']:.2%}")                print(f"     Macro F1:  {metrics['macro_f1']:.4f}")                print(f"     Precision: {metrics['macro_precision']:.4f}")                print(f"     Recall:    {metrics['macro_recall']:.4f}")                                clear_memory()                # ===================================================================        # ENSEMBLE vs BASE MODEL COMPARISON        # ===================================================================        if show_ensemble_comparison and ENSEMBLE_RESULTS:            print("\n" + "=" * 70)            print(" ENSEMBLE PERFORMANCE COMPARISON")            print("=" * 70)                        for ensemble_name, results in ENSEMBLE_RESULTS.items():                ensemble_f1 = results['metrics']['macro_f1']                ensemble_acc = results['metrics']['accuracy']                                # Get base model names from config                base_model_names = results['config']['base_model_names']                                # Find best base model                base_f1_scores = {                    m: TRAINING_RESULTS[m]['metrics']['macro_f1']                     for m in base_model_names                }                best_base_model = max(base_f1_scores, key=base_f1_scores.get)                best_base_f1 = base_f1_scores[best_base_model]                best_base_acc = TRAINING_RESULTS[best_base_model]['metrics']['accuracy']                                # Calculate improvements                f1_improvement = (ensemble_f1 - best_base_f1) / best_base_f1 * 100                acc_improvement = (ensemble_acc - best_base_acc) / best_base_acc * 100                                print(f"\n[{ensemble_name.upper()}]")                print(f"  Ensemble F1:    {ensemble_f1:.4f}")                print(f"  Best Base F1:   {best_base_f1:.4f} ({best_base_model})")                print(f"  F1 Improvement: {f1_improvement:+.2f}%")                print(f"  Acc Improvement: {acc_improvement:+.2f}%")                                if f1_improvement > 0:                    print(f"  ✓ Ensemble outperforms best base model")                elif f1_improvement > -1:                    print(f"  ≈ Ensemble comparable to best base model")                else:                    print(f"  ⚠ Ensemble underperforms best base model")                # Summary        print("\n" + "=" * 70)        print(f" ENSEMBLE TRAINING COMPLETE")        print("=" * 70)        print(f"\n  Ensembles trained: {len(ENSEMBLE_RESULTS)}")        if ENSEMBLE_RESULTS:            print(f"  Available: {list(ENSEMBLE_RESULTS.keys())}")            print("\n  ✓ Results stored in ENSEMBLE_RESULTS dict")            print("  ✓ Ready for ensemble analysis in next cell")        else:            print("  No ensembles successfully trained.")        except Exception as e:        print(f"\n✗ Ensemble training failed: {e}")        import traceback        traceback.print_exc()        ENSEMBLE_RESULTS = {}

In [None]:
#@title 6.2 Ensemble Analysis & Diversity { display-mode: "form" }import osimport numpy as npimport pandas as pdfrom pathlib import Pathimport matplotlib.pyplot as pltimport seaborn as sns#@markdown ### Analysis Optionsshow_diversity_metrics = True  #@param {type: "boolean"}show_base_contributions = True  #@param {type: "boolean"}show_disagreement_analysis = False  #@param {type: "boolean"}plot_contribution_charts = True  #@param {type: "boolean"}# Ensure environment variablesif 'ENSEMBLE_RESULTS' not in dir():    ENSEMBLE_RESULTS = {}if 'TRAINING_RESULTS' not in dir():    TRAINING_RESULTS = {}if not ENSEMBLE_RESULTS or not ENSEMBLE_RESULTS:    print("[Skipped] No ensemble models trained.")    print("Enable TRAIN_VOTING, TRAIN_STACKING, or TRAIN_BLENDING in Section 1")    print("and run Cell 6.1 to train ensembles.")else:    print("=" * 70)    print(" ENSEMBLE ANALYSIS & DIVERSITY")    print("=" * 70)        try:        from src.phase1.stages.datasets.container import TimeSeriesDataContainer                # Load data for predictions        if 'SPLITS_DIR' not in dir():            PROJECT_ROOT = Path('/content/research') if os.path.exists('/content') else Path('.')            SPLITS_DIR = PROJECT_ROOT / 'data/splits/scaled'                container = TimeSeriesDataContainer.load(SPLITS_DIR, TRAINING_HORIZON)                # ===================================================================        # DIVERSITY ANALYSIS        # ===================================================================        if show_diversity_metrics:            print("\n" + "-" * 70)            print(" DIVERSITY METRICS")            print("-" * 70)                        for ensemble_name, results in ENSEMBLE_RESULTS.items():                print(f"\n[{ensemble_name.upper()}]")                                base_model_names = results['config']['base_model_names']                print(f"  Base models: {base_model_names}")                                # Get predictions from each base model on validation set                base_predictions = {}                for model_name in base_model_names:                    if model_name in TRAINING_RESULTS:                        model_result = TRAINING_RESULTS[model_name]                        if 'val_predictions' in model_result:                            base_predictions[model_name] = model_result['val_predictions']                                if len(base_predictions) >= 2:                    # Calculate pairwise agreement                    model_names = list(base_predictions.keys())                    n_models = len(model_names)                                        agreements = []                    for i in range(n_models):                        for j in range(i + 1, n_models):                            pred_i = base_predictions[model_names[i]]                            pred_j = base_predictions[model_names[j]]                            agreement = np.mean(pred_i == pred_j)                            agreements.append(agreement)                                        avg_agreement = np.mean(agreements)                    diversity_score = 1 - avg_agreement                                        print(f"\n  Pairwise Agreement: {avg_agreement:.3f}")                    print(f"  Diversity Score:    {diversity_score:.3f}")                                        # Interpret diversity                    if diversity_score > 0.3:                        print(f"  ✓ Good diversity - models complement each other")                    elif diversity_score > 0.15:                        print(f"  ≈ Moderate diversity - some complementarity")                    else:                        print(f"  ⚠ Low diversity - models may be redundant")                                        # Q-statistic (measure of diversity for pairs)                    print(f"\n  Pairwise Diversity Details:")                    idx = 0                    for i in range(n_models):                        for j in range(i + 1, n_models):                            pred_i = base_predictions[model_names[i]]                            pred_j = base_predictions[model_names[j]]                            agreement = np.mean(pred_i == pred_j)                            print(f"    {model_names[i]} <-> {model_names[j]}: {agreement:.3f} agreement")                            idx += 1                else:                    print(f"  ⚠ Predictions not available for diversity analysis")                # ===================================================================        # BASE MODEL CONTRIBUTIONS        # ===================================================================        if show_base_contributions:            print("\n" + "-" * 70)            print(" BASE MODEL CONTRIBUTIONS")            print("-" * 70)                        contributions_data = []                        for ensemble_name, results in ENSEMBLE_RESULTS.items():                print(f"\n[{ensemble_name.upper()}]")                                base_model_names = results['config']['base_model_names']                                if ensemble_name == 'voting':                    # For voting: show weights                    weights = results['config'].get('weights')                    if weights:                        print(f"  Voting weights (explicit):")                        for model, weight in zip(base_model_names, weights):                            print(f"    {model}: {weight:.3f}")                            contributions_data.append({                                'Ensemble': 'Voting',                                'Model': model,                                'Contribution': weight                            })                    else:                        # Equal weights                        weight = 1.0 / len(base_model_names)                        print(f"  Voting weights (equal):")                        for model in base_model_names:                            print(f"    {model}: {weight:.3f}")                            contributions_data.append({                                'Ensemble': 'Voting',                                'Model': model,                                'Contribution': weight                            })                                elif ensemble_name in ['stacking', 'blending']:                    # For stacking/blending: show meta-learner importance                    # This would require access to meta-learner internals                    # For now, show equal contributions as placeholder                    print(f"  Meta-learner: {results['config'].get('meta_learner', 'unknown')}")                    print(f"  Base model contributions (estimated from performance):")                                        # Estimate contribution by individual model F1 scores                    contributions = {}                    for model in base_model_names:                        if model in TRAINING_RESULTS:                            f1 = TRAINING_RESULTS[model]['metrics']['macro_f1']                            contributions[model] = f1                                        # Normalize to sum to 1                    total = sum(contributions.values())                    if total > 0:                        for model in sorted(contributions, key=contributions.get, reverse=True):                            contrib = contributions[model] / total                            print(f"    {model}: {contrib:.3f} (based on F1)")                            contributions_data.append({                                'Ensemble': ensemble_name.capitalize(),                                'Model': model,                                'Contribution': contrib                            })                        # Plot contributions            if plot_contribution_charts and contributions_data:                df_contrib = pd.DataFrame(contributions_data)                                fig, axes = plt.subplots(1, len(ENSEMBLE_RESULTS), figsize=(5 * len(ENSEMBLE_RESULTS), 4))                if len(ENSEMBLE_RESULTS) == 1:                    axes = [axes]                                for idx, (ensemble_name, results) in enumerate(ENSEMBLE_RESULTS.items()):                    ensemble_data = df_contrib[df_contrib['Ensemble'] == ensemble_name.capitalize()]                                        axes[idx].barh(ensemble_data['Model'], ensemble_data['Contribution'])                    axes[idx].set_xlabel('Contribution')                    axes[idx].set_title(f'{ensemble_name.capitalize()} Ensemble')                    axes[idx].set_xlim(0, max(ensemble_data['Contribution']) * 1.1)                                plt.tight_layout()                plt.show()                # ===================================================================        # ENSEMBLE COMPARISON TABLE        # ===================================================================        print("\n" + "-" * 70)        print(" ENSEMBLE COMPARISON TABLE")        print("-" * 70)                comparison_data = []        for ensemble_name, results in ENSEMBLE_RESULTS.items():            metrics = results['metrics']            base_models = results['config']['base_model_names']                        # Calculate diversity if predictions available            diversity = 0.0            base_predictions = {}            for model_name in base_models:                if model_name in TRAINING_RESULTS:                    if 'val_predictions' in TRAINING_RESULTS[model_name]:                        base_predictions[model_name] = TRAINING_RESULTS[model_name]['val_predictions']                        if len(base_predictions) >= 2:                model_names = list(base_predictions.keys())                n_models = len(model_names)                agreements = []                for i in range(n_models):                    for j in range(i + 1, n_models):                        pred_i = base_predictions[model_names[i]]                        pred_j = base_predictions[model_names[j]]                        agreement = np.mean(pred_i == pred_j)                        agreements.append(agreement)                diversity = 1 - np.mean(agreements)                        # Best base model improvement            base_f1_scores = {                m: TRAINING_RESULTS[m]['metrics']['macro_f1']                 for m in base_models if m in TRAINING_RESULTS            }            if base_f1_scores:                best_base_model = max(base_f1_scores, key=base_f1_scores.get)                best_base_f1 = base_f1_scores[best_base_model]                improvement = (metrics['macro_f1'] - best_base_f1) / best_base_f1 * 100            else:                best_base_model = 'N/A'                improvement = 0.0                        comparison_data.append({                'Ensemble': ensemble_name.capitalize(),                'Accuracy': f"{metrics['accuracy']:.2%}",                'F1 Score': f"{metrics['macro_f1']:.4f}",                'Base Models': len(base_models),                'Diversity': f"{diversity:.3f}",                'Best Base': best_base_model,                'Improvement': f"{improvement:+.2f}%"            })                df_comparison = pd.DataFrame(comparison_data)        print("\n", df_comparison.to_string(index=False))                # ===================================================================        # RECOMMENDATION        # ===================================================================        print("\n" + "-" * 70)        print(" RECOMMENDATION")        print("-" * 70)                # Find best ensemble by F1        best_ensemble_name = max(            ENSEMBLE_RESULTS,             key=lambda x: ENSEMBLE_RESULTS[x]['metrics']['macro_f1']        )        best_ensemble = ENSEMBLE_RESULTS[best_ensemble_name]                print(f"\n  Best Ensemble: {best_ensemble_name.upper()}")        print(f"  Metrics:")        print(f"    - Accuracy: {best_ensemble['metrics']['accuracy']:.2%}")        print(f"    - F1 Score: {best_ensemble['metrics']['macro_f1']:.4f}")        print(f"    - Precision: {best_ensemble['metrics']['macro_precision']:.4f}")        print(f"    - Recall: {best_ensemble['metrics']['macro_recall']:.4f}")                # Reason        base_models = best_ensemble['config']['base_model_names']        print(f"\n  Reason: Highest F1 score among {len(ENSEMBLE_RESULTS)} ensembles")        print(f"  Base models: {base_models}")                # Check diversity        if show_diversity_metrics:            base_predictions = {}            for model_name in base_models:                if model_name in TRAINING_RESULTS and 'val_predictions' in TRAINING_RESULTS[model_name]:                    base_predictions[model_name] = TRAINING_RESULTS[model_name]['val_predictions']                        if len(base_predictions) >= 2:                model_names = list(base_predictions.keys())                n_models = len(model_names)                agreements = []                for i in range(n_models):                    for j in range(i + 1, n_models):                        pred_i = base_predictions[model_names[i]]                        pred_j = base_predictions[model_names[j]]                        agreement = np.mean(pred_i == pred_j)                        agreements.append(agreement)                diversity_score = 1 - np.mean(agreements)                                if diversity_score > 0.3:                    print(f"  ✓ Good diversity ({diversity_score:.3f}) - models complement each other")                elif diversity_score > 0.15:                    print(f"  ≈ Moderate diversity ({diversity_score:.3f})")                else:                    print(f"  ⚠ Low diversity ({diversity_score:.3f}) - consider different base models")                # ===================================================================        # DISAGREEMENT ANALYSIS        # ===================================================================        if show_disagreement_analysis:            print("\n" + "-" * 70)            print(" DISAGREEMENT ANALYSIS")            print("-" * 70)                        for ensemble_name, results in ENSEMBLE_RESULTS.items():                print(f"\n[{ensemble_name.upper()}]")                                base_model_names = results['config']['base_model_names']                                # Get predictions                base_predictions = {}                for model_name in base_model_names:                    if model_name in TRAINING_RESULTS and 'val_predictions' in TRAINING_RESULTS[model_name]:                        base_predictions[model_name] = TRAINING_RESULTS[model_name]['val_predictions']                                if len(base_predictions) >= 2:                    # Find samples where models disagree                    pred_arrays = [base_predictions[m] for m in base_predictions.keys()]                    pred_matrix = np.array(pred_arrays)                                        # Check disagreement (not all predictions are the same)                    disagreements = np.apply_along_axis(lambda x: len(np.unique(x)) > 1, axis=0, arr=pred_matrix)                    disagreement_rate = np.mean(disagreements)                                        print(f"  Disagreement rate: {disagreement_rate:.2%}")                    print(f"  Samples with disagreement: {np.sum(disagreements):,} / {len(disagreements):,}")                                        # Show a few examples                    disagreement_indices = np.where(disagreements)[0]                    if len(disagreement_indices) > 0:                        print(f"\n  Example disagreements (first 5):")                        for idx in disagreement_indices[:5]:                            predictions = {m: base_predictions[m][idx] for m in base_predictions.keys()}                            print(f"    Sample {idx}: {predictions}")                else:                    print(f"  ⚠ Predictions not available")                print("\n" + "=" * 70)        print(" ENSEMBLE ANALYSIS COMPLETE")        print("=" * 70)        except Exception as e:        print(f"\n✗ Ensemble analysis failed: {e}")        import traceback        traceback.print_exc()

---
# 7. RESULTS & EXPORT

Summary of all results and export options.

In [None]:
#@title 7.1 Final Summary { display-mode: "form" }

import os
from pathlib import Path

# Ensure environment variables are defined
if 'IS_COLAB' not in dir():
    IS_COLAB = os.path.exists('/content')
if 'PROJECT_ROOT' not in dir():
    PROJECT_ROOT = Path('/content/research') if IS_COLAB else Path.cwd()
if 'SPLITS_DIR' not in dir():
    SPLITS_DIR = PROJECT_ROOT / 'data/splits/scaled'
if 'EXPERIMENTS_DIR' not in dir():
    EXPERIMENTS_DIR = PROJECT_ROOT / 'experiments/runs'

print("=" * 70)
print(" PIPELINE SUMMARY")
print("=" * 70)

print(f"\n Configuration:")
print(f"   Symbol: {SYMBOL}")

# Show auto-detected date range (with safety checks)
if 'DATA_START' in dir() and DATA_START is not None:
    print(f"   Date Range: {DATA_START.strftime('%Y-%m-%d')} to {DATA_END.strftime('%Y-%m-%d')}")
    if 'DATA_START_YEAR' in dir() and 'DATA_END_YEAR' in dir():
        print(f"   Years: {DATA_START_YEAR} - {DATA_END_YEAR}")
else:
    print(f"   Date Range: Not detected (run Section 3.1)")

print(f"   Training Horizon: H{TRAINING_HORIZON}")

if 'TRAIN_LEN' in dir():
    print(f"\n Data:")
    print(f"   Train: {TRAIN_LEN:,} samples")
    if 'VAL_LEN' in dir():
        print(f"   Val: {VAL_LEN:,} samples")
    if 'TEST_LEN' in dir():
        print(f"   Test: {TEST_LEN:,} samples")

if 'TRAINING_RESULTS' in dir() and TRAINING_RESULTS:
    print(f"\n Model Results:")
    for model, data in sorted(TRAINING_RESULTS.items(), 
                              key=lambda x: x[1]['metrics'].get('macro_f1', 0), 
                              reverse=True):
        metrics = data['metrics']
        print(f"   {model}: Acc={metrics.get('accuracy', 0):.2%}, F1={metrics.get('macro_f1', 0):.4f}")
    
    best = max(TRAINING_RESULTS, key=lambda x: TRAINING_RESULTS[x]['metrics'].get('macro_f1', 0))
    print(f"\n Best Model: {best}")

print(f"\n Saved Artifacts:")
print(f"   Data: {SPLITS_DIR}")
print(f"   Models: {EXPERIMENTS_DIR}")

print("\n" + "=" * 70)
print(" PIPELINE COMPLETE")
print("=" * 70)

In [None]:
#@title 7.2 Export Model Package { display-mode: "form" }

import os
import shutil
import joblib
import json
from pathlib import Path
from datetime import datetime
import pandas as pd
import numpy as np

# ============================================================================
# CONFIGURATION
# ============================================================================

# Ensure environment variables are defined
if 'IS_COLAB' not in dir():
    IS_COLAB = os.path.exists('/content')
if 'PROJECT_ROOT' not in dir():
    PROJECT_ROOT = Path('/content/research') if IS_COLAB else Path.cwd()
if 'EXPERIMENTS_DIR' not in dir():
    EXPERIMENTS_DIR = PROJECT_ROOT / 'experiments/runs'
if 'RESULTS_DIR' not in dir():
    RESULTS_DIR = PROJECT_ROOT / 'experiments'
if 'TRAINING_RESULTS' not in dir():
    TRAINING_RESULTS = {}
if 'TEST_RESULTS' not in dir():
    TEST_RESULTS = {}
if 'CV_RESULTS' not in dir():
    CV_RESULTS = {}
if 'ENSEMBLE_RESULTS' not in dir():
    ENSEMBLE_RESULTS = {}

#@markdown ### Export Configuration

export_model = False  #@param {type: "boolean"}
#@markdown Enable to export model package

export_selection = "Best Model"  #@param ["Best Model", "All Models", "Ensembles Only", "Top 3 Models", "Custom Selection"]
#@markdown Select which models to export

custom_models_to_export = ""  #@param {type: "string"}
#@markdown Comma-separated model names (only used if Custom Selection)

export_format = "Standard Package"  #@param ["Standard Package", "Production Bundle", "Research Archive", "Minimal (Model Only)"]
#@markdown Export package type

#@markdown ### Export Options

include_onnx = False  #@param {type: "boolean"}
#@markdown Export to ONNX format for production (XGBoost, LightGBM, CatBoost only)

include_predictions = True  #@param {type: "boolean"}
#@markdown Include validation and test predictions

include_visualizations = True  #@param {type: "boolean"}
#@markdown Include generated plots and charts

include_model_card = True  #@param {type: "boolean"}
#@markdown Generate model cards with performance details

create_zip_archive = True  #@param {type: "boolean"}
#@markdown Create ZIP archive of export package

# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def get_models_to_export():
    """Determine which models to export based on selection."""
    all_results = {**TRAINING_RESULTS, **ENSEMBLE_RESULTS}

    if not all_results:
        return []

    if export_selection == "Best Model":
        best_model = max(all_results, key=lambda x: all_results[x]['metrics'].get('macro_f1', 0))
        return [best_model]

    elif export_selection == "All Models":
        return list(all_results.keys())

    elif export_selection == "Ensembles Only":
        return [m for m in all_results.keys() if 'ensemble' in m or 'voting' in m or 'stacking' in m or 'blending' in m]

    elif export_selection == "Top 3 Models":
        sorted_models = sorted(all_results.items(), key=lambda x: x[1]['metrics'].get('macro_f1', 0), reverse=True)
        return [m[0] for m in sorted_models[:3]]

    elif export_selection == "Custom Selection":
        if not custom_models_to_export:
            print("⚠ Custom selection requires model names in 'custom_models_to_export'")
            return []
        models = [m.strip() for m in custom_models_to_export.split(',')]
        valid_models = [m for m in models if m in all_results]
        if len(valid_models) < len(models):
            invalid = set(models) - set(valid_models)
            print(f"⚠ Invalid models: {invalid}")
        return valid_models

    return []


def generate_model_card(model_name, model_info, test_info=None, cv_info=None):
    """Generate model card in Markdown format."""
    metrics = model_info.get('metrics', {})
    config = model_info.get('config', {})

    card = f"""# Model Card: {model_name.upper()}

## Model Information
- **Type:** {model_info.get('model_type', 'Unknown')}
- **Symbol:** {SYMBOL if 'SYMBOL' in dir() else 'N/A'}
- **Horizon:** {TRAINING_HORIZON if 'TRAINING_HORIZON' in dir() else 'N/A'} bars
- **Date:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
- **Run ID:** {model_info.get('run_id', 'Unknown')}

## Performance Metrics

### Validation Set
- **Accuracy:** {metrics.get('accuracy', 0):.4f}
- **Macro F1:** {metrics.get('macro_f1', 0):.4f}
- **Precision:** {metrics.get('precision', 0):.4f}
- **Recall:** {metrics.get('recall', 0):.4f}

"""

    # Add test results if available
    if test_info:
        test_metrics = test_info.get('metrics', {})
        val_f1 = metrics.get('macro_f1', 0)
        test_f1 = test_metrics.get('macro_f1', 0)
        gap = ((test_f1 - val_f1) / val_f1 * 100) if val_f1 > 0 else 0

        card += f"""### Test Set
- **Accuracy:** {test_metrics.get('accuracy', 0):.4f}
- **Macro F1:** {test_metrics.get('macro_f1', 0):.4f}
- **Precision:** {test_metrics.get('precision', 0):.4f}
- **Recall:** {test_metrics.get('recall', 0):.4f}
- **Generalization Gap:** {gap:+.2f}%

"""

    # Add CV results if available
    if cv_info:
        cv_metrics = cv_info.get('cv_metrics', {})
        card += f"""### Cross-Validation
- **Mean F1:** {cv_metrics.get('mean_f1', 0):.4f} ± {cv_metrics.get('std_f1', 0):.4f}
- **Mean Accuracy:** {cv_metrics.get('mean_accuracy', 0):.4f} ± {cv_metrics.get('std_accuracy', 0):.4f}
- **Folds:** {cv_info.get('n_splits', 'N/A')}

"""

    # Add configuration
    if config:
        card += f"""## Configuration

```json
{json.dumps(config, indent=2)}
```

"""

    # Add feature information
    if 'feature_importance' in model_info:
        importance = model_info['feature_importance']
        top_features = sorted(importance.items(), key=lambda x: x[1], reverse=True)[:10]
        card += f"""## Top 10 Features

"""
        for i, (feature, score) in enumerate(top_features, 1):
            card += f"{i}. **{feature}**: {score:.4f}\n"
        card += "\n"

    # Add training details
    train_time = model_info.get('training_time_sec', 0)
    card += f"""## Training Details
- **Training Time:** {train_time:.2f}s
- **Model Size:** {model_info.get('model_size_mb', 'N/A')} MB
- **Framework:** {model_info.get('framework', 'Unknown')}

## Usage

```python
import joblib

# Load model
model = joblib.load('model.pkl')

# Make predictions
predictions = model.predict(X_test)
```

"""

    return card


def export_to_onnx(model, model_name, model_path, feature_names):
    """Export model to ONNX format (boosting models only)."""
    try:
        # Check if model type supports ONNX
        onnx_compatible = ['xgboost', 'lightgbm', 'catboost']
        if not any(m in model_name.lower() for m in onnx_compatible):
            return False, "Model type not compatible with ONNX"

        # Try to import ONNX libraries
        try:
            from skl2onnx import convert_sklearn
            from skl2onnx.common.data_types import FloatTensorType
            import onnx
        except ImportError:
            return False, "ONNX libraries not installed (skl2onnx, onnx)"

        # Load the model
        loaded_model = joblib.load(model_path)

        # Determine number of features
        n_features = len(feature_names) if feature_names else 150

        # Define input type
        initial_type = [('float_input', FloatTensorType([None, n_features]))]

        # Convert to ONNX
        onnx_model = convert_sklearn(loaded_model, initial_types=initial_type)

        # Save ONNX model
        onnx_path = model_path.parent / 'model.onnx'
        with open(onnx_path, 'wb') as f:
            f.write(onnx_model.SerializeToString())

        # Get file size
        size_mb = onnx_path.stat().st_size / 1e6

        return True, f"ONNX export successful ({size_mb:.2f} MB)"

    except Exception as e:
        return False, f"ONNX export failed: {str(e)}"


def create_manifest(export_dir, models_exported, export_info):
    """Create manifest.json with export metadata."""
    all_results = {**TRAINING_RESULTS, **ENSEMBLE_RESULTS}

    # Find best model
    best_model = max(all_results, key=lambda x: all_results[x]['metrics'].get('macro_f1', 0))
    best_f1 = all_results[best_model]['metrics'].get('macro_f1', 0)

    # Collect model formats
    formats = {}
    for model_name in models_exported:
        model_formats = ['pkl']
        onnx_path = export_dir / 'models' / model_name / 'model.onnx'
        if onnx_path.exists():
            model_formats.append('onnx')
        formats[model_name] = model_formats

    manifest = {
        'export_timestamp': datetime.now().isoformat(),
        'symbol': SYMBOL if 'SYMBOL' in dir() else 'N/A',
        'horizon': TRAINING_HORIZON if 'TRAINING_HORIZON' in dir() else 'N/A',
        'models_exported': models_exported,
        'best_model': best_model,
        'best_test_f1': best_f1,
        'export_format': export_format,
        'formats': formats,
        'data_stats': export_info.get('data_stats', {}),
        'export_options': {
            'include_onnx': include_onnx,
            'include_predictions': include_predictions,
            'include_visualizations': include_visualizations,
            'include_model_card': include_model_card
        }
    }

    manifest_path = export_dir / 'manifest.json'
    with open(manifest_path, 'w') as f:
        json.dump(manifest, f, indent=2)

    return manifest_path


def create_readme(export_dir, models_exported):
    """Create README.md with setup and usage instructions."""
    readme = f"""# ML Model Export Package

**Export Date:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
**Symbol:** {SYMBOL if 'SYMBOL' in dir() else 'N/A'}
**Horizon:** {TRAINING_HORIZON if 'TRAINING_HORIZON' in dir() else 'N/A'} bars

## Package Contents

This export package contains:

- **Models:** {len(models_exported)} trained model(s)
- **Predictions:** Validation and test set predictions
- **Metrics:** Training, validation, and test performance metrics
- **Visualizations:** Confusion matrices, feature importance, learning curves
- **Model Cards:** Detailed model documentation and performance analysis
- **Data Info:** Feature names, label mappings, data statistics

## Models Included

"""

    for model_name in models_exported:
        readme += f"- `{model_name}`\n"

    readme += """

## Directory Structure

```
├── models/              # Trained models (PKL, ONNX)
├── predictions/         # Model predictions (CSV)
├── metrics/             # Performance metrics (JSON)
├── visualizations/      # Plots and charts (PNG)
├── model_cards/         # Model documentation (MD)
├── data/                # Feature info and stats
├── manifest.json        # Export metadata
└── README.md            # This file
```

## Quick Start

### Load a Model

```python
import joblib

# Load model
model = joblib.load('models/xgboost/model.pkl')

# Make predictions
predictions = model.predict(X_test)
```

### Load ONNX Model (Production)

```python
import onnxruntime as ort

# Create inference session
session = ort.InferenceSession('models/xgboost/model.onnx')

# Run inference
input_name = session.get_inputs()[0].name
predictions = session.run(None, {input_name: X_test.astype('float32')})[0]
```

### Load Predictions

```python
import pandas as pd

# Load test predictions
test_preds = pd.read_csv('predictions/test_predictions.csv')
print(test_preds.head())
```

## Model Cards

Each model has a detailed model card in `model_cards/` with:
- Performance metrics (validation, test, CV)
- Configuration parameters
- Feature importance
- Training details
- Usage examples

## Performance Summary

See `metrics/test_metrics.json` for detailed performance metrics across all models.

## Support

For questions or issues:
1. Review model cards for specific model details
2. Check manifest.json for export metadata
3. Consult feature documentation in data/

---

Generated by ML Model Factory
"""

    readme_path = export_dir / 'README.md'
    with open(readme_path, 'w') as f:
        f.write(readme)

    return readme_path


def export_predictions(model_name, model_info, export_dir):
    """Export validation and test predictions to CSV."""
    pred_dir = export_dir / 'predictions' / model_name
    pred_dir.mkdir(parents=True, exist_ok=True)

    # Export validation predictions if available
    if 'val_predictions' in model_info:
        val_preds = model_info['val_predictions']
        val_df = pd.DataFrame({
            'index': range(len(val_preds['actual'])),
            'actual': val_preds['actual'],
            'predicted': val_preds['predicted']
        })
        if 'confidence' in val_preds:
            val_df['confidence'] = val_preds['confidence']
        val_df['correct'] = val_df['actual'] == val_df['predicted']

        val_path = pred_dir / 'val_predictions.csv'
        val_df.to_csv(val_path, index=False)

    # Export test predictions if available
    test_info = TEST_RESULTS.get(model_name, {})
    if 'predictions' in test_info:
        test_preds = test_info['predictions']
        test_df = pd.DataFrame({
            'index': range(len(test_preds['actual'])),
            'actual': test_preds['actual'],
            'predicted': test_preds['predicted']
        })
        if 'confidence' in test_preds:
            test_df['confidence'] = test_preds['confidence']
        test_df['correct'] = test_df['actual'] == test_df['predicted']

        test_path = pred_dir / 'test_predictions.csv'
        test_df.to_csv(test_path, index=False)

    # Create predictions summary
    summary = {
        'model_name': model_name,
        'val_samples': len(val_preds['actual']) if 'val_predictions' in model_info else 0,
        'test_samples': len(test_preds['actual']) if 'predictions' in test_info else 0,
        'val_accuracy': (val_df['correct'].sum() / len(val_df)) if 'val_predictions' in model_info else None,
        'test_accuracy': (test_df['correct'].sum() / len(test_df)) if 'predictions' in test_info else None
    }

    summary_path = pred_dir / 'predictions_summary.json'
    with open(summary_path, 'w') as f:
        json.dump(summary, f, indent=2)


# ============================================================================
# MAIN EXPORT LOGIC
# ============================================================================

if export_model:
    print("=" * 80)
    print("MODEL EXPORT PACKAGE")
    print("=" * 80)

    # Get models to export
    models_to_export = get_models_to_export()

    if not models_to_export:
        print("\n⚠ No models to export. Check your selection criteria.")
    else:
        print(f"\n📦 Exporting {len(models_to_export)} model(s): {', '.join(models_to_export)}")

        # Create export directory
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        symbol = SYMBOL if 'SYMBOL' in dir() else 'UNKNOWN'
        horizon = TRAINING_HORIZON if 'TRAINING_HORIZON' in dir() else 'XX'
        export_name = f"{timestamp}_{symbol}_H{horizon}"

        export_dir = RESULTS_DIR / 'exports' / export_name
        export_dir.mkdir(parents=True, exist_ok=True)

        print(f"\n📁 Export directory: {export_dir}")

        # Track export statistics
        export_stats = {
            'models_exported': 0,
            'onnx_exports': 0,
            'predictions_exported': 0,
            'visualizations_exported': 0,
            'model_cards_generated': 0,
            'errors': []
        }

        export_info = {
            'data_stats': {
                'train_samples': TRAINING_RESULTS.get(models_to_export[0], {}).get('train_samples', 0),
                'val_samples': TRAINING_RESULTS.get(models_to_export[0], {}).get('val_samples', 0),
                'test_samples': TEST_RESULTS.get(models_to_export[0], {}).get('test_samples', 0),
                'n_features': TRAINING_RESULTS.get(models_to_export[0], {}).get('n_features', 0)
            }
        }

        # Export each model
        for model_name in models_to_export:
            print(f"\n📊 Processing: {model_name}")

            all_results = {**TRAINING_RESULTS, **ENSEMBLE_RESULTS}
            model_info = all_results.get(model_name, {})

            if not model_info:
                print(f"  ⚠ No training results found for {model_name}")
                export_stats['errors'].append(f"{model_name}: No training results")
                continue

            run_id = model_info.get('run_id')
            if not run_id:
                print(f"  ⚠ No run_id found for {model_name}")
                export_stats['errors'].append(f"{model_name}: No run_id")
                continue

            # Create model directory
            model_dir = export_dir / 'models' / model_name
            model_dir.mkdir(parents=True, exist_ok=True)

            # Find and copy model file
            source_dir = EXPERIMENTS_DIR / run_id
            model_file = source_dir / 'model.pkl'

            if not model_file.exists():
                print(f"  ⚠ Model file not found: {model_file}")
                export_stats['errors'].append(f"{model_name}: Model file not found")
                continue

            # Copy model
            dest_model = model_dir / 'model.pkl'
            shutil.copy2(model_file, dest_model)
            model_size = dest_model.stat().st_size / 1e6
            print(f"  ✓ Model copied ({model_size:.2f} MB)")
            export_stats['models_exported'] += 1

            # Export to ONNX if requested
            if include_onnx:
                feature_names = model_info.get('feature_names', [])
                success, message = export_to_onnx(model_info, model_name, dest_model, feature_names)
                if success:
                    print(f"  ✓ ONNX: {message}")
                    export_stats['onnx_exports'] += 1
                else:
                    print(f"  ⚠ ONNX: {message}")

            # Save configuration
            config = model_info.get('config', {})
            if config:
                config_path = model_dir / 'config.json'
                with open(config_path, 'w') as f:
                    json.dump(config, f, indent=2)
                print(f"  ✓ Configuration saved")

            # Export predictions
            if include_predictions:
                try:
                    export_predictions(model_name, model_info, export_dir)
                    print(f"  ✓ Predictions exported")
                    export_stats['predictions_exported'] += 1
                except Exception as e:
                    print(f"  ⚠ Predictions export failed: {e}")

            # Generate model card
            if include_model_card:
                try:
                    card_dir = export_dir / 'model_cards'
                    card_dir.mkdir(parents=True, exist_ok=True)

                    test_info = TEST_RESULTS.get(model_name, {})
                    cv_info = CV_RESULTS.get(model_name, {})

                    card_content = generate_model_card(model_name, model_info, test_info, cv_info)
                    card_path = card_dir / f"{model_name}_card.md"

                    with open(card_path, 'w') as f:
                        f.write(card_content)

                    print(f"  ✓ Model card generated")
                    export_stats['model_cards_generated'] += 1
                except Exception as e:
                    print(f"  ⚠ Model card generation failed: {e}")

        # Copy visualizations
        if include_visualizations:
            print(f"\n🎨 Copying visualizations...")
            viz_dir = export_dir / 'visualizations'
            viz_dir.mkdir(parents=True, exist_ok=True)

            # Copy from experiments directory
            for model_name in models_to_export:
                model_info = {**TRAINING_RESULTS, **ENSEMBLE_RESULTS}.get(model_name, {})
                run_id = model_info.get('run_id')
                if run_id:
                    source_viz = EXPERIMENTS_DIR / run_id / 'visualizations'
                    if source_viz.exists():
                        dest_viz = viz_dir / model_name
                        shutil.copytree(source_viz, dest_viz, dirs_exist_ok=True)
                        viz_count = len(list(dest_viz.rglob('*.png')))
                        export_stats['visualizations_exported'] += viz_count

            if export_stats['visualizations_exported'] > 0:
                print(f"  ✓ {export_stats['visualizations_exported']} visualizations copied")

        # Export metrics
        print(f"\n📈 Exporting metrics...")
        metrics_dir = export_dir / 'metrics'
        metrics_dir.mkdir(parents=True, exist_ok=True)

        # Training metrics
        training_metrics = {m: TRAINING_RESULTS[m]['metrics'] for m in models_to_export if m in TRAINING_RESULTS}
        with open(metrics_dir / 'training_metrics.json', 'w') as f:
            json.dump(training_metrics, f, indent=2)

        # Test metrics
        test_metrics = {m: TEST_RESULTS[m]['metrics'] for m in models_to_export if m in TEST_RESULTS}
        if test_metrics:
            with open(metrics_dir / 'test_metrics.json', 'w') as f:
                json.dump(test_metrics, f, indent=2)

        # CV results
        cv_metrics = {m: CV_RESULTS[m] for m in models_to_export if m in CV_RESULTS}
        if cv_metrics:
            with open(metrics_dir / 'cv_results.json', 'w') as f:
                json.dump(cv_metrics, f, indent=2)

        print(f"  ✓ Metrics exported")

        # Export data info
        print(f"\n📊 Exporting data information...")
        data_dir = export_dir / 'data'
        data_dir.mkdir(parents=True, exist_ok=True)

        # Feature names
        if models_to_export:
            first_model = models_to_export[0]
            model_info = {**TRAINING_RESULTS, **ENSEMBLE_RESULTS}.get(first_model, {})
            feature_names = model_info.get('feature_names', [])

            if feature_names:
                with open(data_dir / 'feature_names.txt', 'w') as f:
                    f.write('\n'.join(feature_names))

        # Label mapping
        label_mapping = {-1: 'SHORT', 0: 'NEUTRAL', 1: 'LONG'}
        with open(data_dir / 'label_mapping.json', 'w') as f:
            json.dump(label_mapping, f, indent=2)

        # Data stats
        with open(data_dir / 'data_stats.json', 'w') as f:
            json.dump(export_info['data_stats'], f, indent=2)

        print(f"  ✓ Data info exported")

        # Create manifest
        print(f"\n📋 Creating manifest...")
        manifest_path = create_manifest(export_dir, models_to_export, export_info)
        print(f"  ✓ Manifest created: {manifest_path.name}")

        # Create README
        print(f"\n📝 Creating README...")
        readme_path = create_readme(export_dir, models_to_export)
        print(f"  ✓ README created: {readme_path.name}")

        # Calculate total size
        total_size = sum(f.stat().st_size for f in export_dir.rglob('*') if f.is_file())
        total_size_mb = total_size / 1e6

        # Create ZIP archive
        zip_path = None
        if create_zip_archive:
            print(f"\n📦 Creating ZIP archive...")
            zip_base = export_dir.parent / export_name
            zip_path = Path(shutil.make_archive(str(zip_base), 'zip', export_dir))
            zip_size_mb = zip_path.stat().st_size / 1e6
            print(f"  ✓ Archive created: {zip_path.name} ({zip_size_mb:.1f} MB)")

        # Print summary
        print("\n" + "=" * 80)
        print("EXPORT SUMMARY")
        print("=" * 80)
        print(f"\n📁 Export Path: {export_dir}")
        print(f"\n📊 Models Exported: {export_stats['models_exported']}")
        for model_name in models_to_export:
            model_info = {**TRAINING_RESULTS, **ENSEMBLE_RESULTS}.get(model_name, {})
            formats = ['PKL']
            if (export_dir / 'models' / model_name / 'model.onnx').exists():
                formats.append('ONNX')
            print(f"  ✓ {model_name} ({', '.join(formats)})")

        print(f"\n📦 Package Contents:")
        print(f"  ✓ Models: {export_stats['models_exported']}")
        if export_stats['onnx_exports'] > 0:
            print(f"  ✓ ONNX exports: {export_stats['onnx_exports']}")
        if include_predictions:
            print(f"  ✓ Predictions: Val + Test")
        print(f"  ✓ Metrics: Training, Test, CV")
        if export_stats['visualizations_exported'] > 0:
            print(f"  ✓ Visualizations: {export_stats['visualizations_exported']} plots")
        if export_stats['model_cards_generated'] > 0:
            print(f"  ✓ Model Cards: {export_stats['model_cards_generated']}")
        print(f"  ✓ Data Info: Features, labels, stats")
        print(f"  ✓ README: Setup and usage guide")

        print(f"\n💾 Total Size: {total_size_mb:.1f} MB", end='')
        if zip_path:
            zip_size_mb = zip_path.stat().st_size / 1e6
            print(f" (compressed: {zip_size_mb:.1f} MB)")
        else:
            print()

        if export_stats['errors']:
            print(f"\n⚠ Errors ({len(export_stats['errors'])}):")
            for error in export_stats['errors']:
                print(f"  - {error}")

        print(f"\n✅ Next Steps:")
        print(f"1. Extract ZIP to deployment environment")
        print(f"2. Review model cards for performance details")
        if export_stats['onnx_exports'] > 0:
            print(f"3. Use ONNX models for production inference")
        print(f"4. Check README for usage examples")

        # Colab download helper
        if IS_COLAB and create_zip_archive and zip_path:
            print(f"\n" + "=" * 80)
            print("DOWNLOAD TO LOCAL")
            print("=" * 80)
            download_export = False  #@param {type: "boolean"}

            if download_export:
                try:
                    from google.colab import files
                    files.download(str(zip_path))
                    print(f"\n✓ Download started: {zip_path.name}")
                except Exception as e:
                    print(f"\n⚠ Download failed: {e}")
                    print(f"Manual download from: {zip_path}")

        print("\n" + "=" * 80)

else:
    print("Model export skipped. Enable 'export_model' checkbox above to export.")

---
# Quick Reference

## Command Line Usage

```bash
# Train single model
python scripts/train_model.py --model xgboost --horizon 20

# Train neural model
python scripts/train_model.py --model lstm --horizon 20 --seq-len 60

# Run cross-validation
python scripts/run_cv.py --models xgboost,lightgbm --horizons 20 --n-splits 5

# Train ensemble
python scripts/train_model.py --model voting --horizon 20

# List all available models
python scripts/train_model.py --list-models
```

## Model Families

| Family | Models | Best For |
|--------|--------|----------|
| Boosting | XGBoost, LightGBM, CatBoost | Fast, accurate, tabular data |
| Classical | Random Forest, Logistic, SVM | Baselines, interpretability |
| Neural | LSTM, GRU, TCN | Sequential patterns, temporal dependencies |
| Ensemble | Voting, Stacking, Blending | Combined predictions, robustness |