# ML Model Factory - Unified Pipeline

**Single notebook for the entire ML pipeline.** Configure everything in Section 1, then run sequentially.

## Pipeline Phases
1. **Configuration** - All settings in one place
2. **Environment Setup** - Auto-detects Colab vs Local
3. **Phase 1: Data Pipeline** - Clean → Features → Labels → Splits → Scale
4. **Phase 2: Model Training** - Train any model type
5. **Phase 3: Cross-Validation** - Robust evaluation (optional)
6. **Phase 4: Ensemble** - Combine models (optional)
7. **Results & Export** - Summary and model export

---

# 1. MASTER CONFIGURATION

**Configure ALL settings here. No need to modify any other cells.**

In [None]:
#@title 1.1 Master Configuration Panel { display-mode: "form" }
#@markdown ## Data Configuration
#@markdown ---

#@markdown ### Contract Selection
SYMBOL = "SI"  #@param ["SI", "MES", "MGC", "ES", "GC", "NQ", "CL", "HG", "ZB", "ZN"]
#@markdown Select ONE contract. Each contract is trained in complete isolation.
#@markdown - **SI** = Silver, **MES** = Micro E-mini S&P, **MGC** = Micro Gold
#@markdown - **ES** = E-mini S&P, **GC** = Gold, **NQ** = E-mini Nasdaq
#@markdown - **CL** = Crude Oil, **HG** = Copper, **ZB/ZN** = Bonds

#@markdown ### Data Source (Google Drive path relative to My Drive)
DRIVE_DATA_PATH = "research/data/raw"  #@param {type: "string"}

#@markdown ### Date Range
#@markdown **Auto-detected from your parquet/CSV file.** No manual selection needed.
#@markdown The pipeline will read the actual date range from your data.

#@markdown ---
#@markdown ## Pipeline Configuration

#@markdown ### Label Horizons (bars)
HORIZONS = "5,10,15,20"  #@param {type: "string"}
#@markdown Comma-separated prediction horizons

#@markdown ### Train/Val/Test Split Ratios
TRAIN_RATIO = 0.70  #@param {type: "number"}
VAL_RATIO = 0.15  #@param {type: "number"}
TEST_RATIO = 0.15  #@param {type: "number"}

#@markdown ### Leakage Prevention
PURGE_BARS = 60  #@param {type: "integer"}
#@markdown Bars to purge around train/val boundary (3x max horizon)
EMBARGO_BARS = 1440  #@param {type: "integer"}
#@markdown Embargo period after validation (~5 days at 5-min)

#@markdown ---
#@markdown ## Model Training Configuration

#@markdown ### Training Horizon
TRAINING_HORIZON = 20  #@param [5, 10, 15, 20]
#@markdown Which horizon to train models on

#@markdown ### Model Selection
TRAIN_XGBOOST = True  #@param {type: "boolean"}
TRAIN_LIGHTGBM = True  #@param {type: "boolean"}
TRAIN_CATBOOST = True  #@param {type: "boolean"}
TRAIN_RANDOM_FOREST = False  #@param {type: "boolean"}
TRAIN_LOGISTIC = False  #@param {type: "boolean"}
TRAIN_SVM = False  #@param {type: "boolean"}
TRAIN_LSTM = False  #@param {type: "boolean"}
TRAIN_GRU = False  #@param {type: "boolean"}
TRAIN_TCN = False  #@param {type: "boolean"}

#@markdown ### Neural Network Settings
SEQUENCE_LENGTH = 60  #@param {type: "slider", min: 30, max: 120, step: 10}
BATCH_SIZE = 256  #@param [64, 128, 256, 512, 1024]
MAX_EPOCHS = 50  #@param {type: "integer"}
EARLY_STOPPING_PATIENCE = 10  #@param {type: "integer"}

#@markdown ### Boosting Settings
N_ESTIMATORS = 500  #@param {type: "integer"}
BOOSTING_EARLY_STOPPING = 50  #@param {type: "integer"}

#@markdown ---
#@markdown ## Optional Phases

#@markdown ### Cross-Validation
RUN_CROSS_VALIDATION = False  #@param {type: "boolean"}
CV_N_SPLITS = 5  #@param {type: "integer"}
CV_TUNE_HYPERPARAMS = False  #@param {type: "boolean"}
CV_N_TRIALS = 20  #@param {type: "integer"}

#@markdown ### Ensemble Training
TRAIN_ENSEMBLE = False  #@param {type: "boolean"}
ENSEMBLE_TYPE = "voting"  #@param ["voting", "stacking", "blending"]
ENSEMBLE_META_LEARNER = "logistic"  #@param ["logistic", "random_forest", "xgboost"]

#@markdown ---
#@markdown ## Execution Options

#@markdown ### What to Run
RUN_DATA_PIPELINE = True  #@param {type: "boolean"}
#@markdown Run Phase 1 data pipeline
RUN_MODEL_TRAINING = True  #@param {type: "boolean"}
#@markdown Run Phase 2 model training

#@markdown ### Memory Management
SAFE_MODE = False  #@param {type: "boolean"}
#@markdown Enable for low-memory environments (reduces batch size, limits iterations)

# ============================================================
# BUILD CONFIGURATION (DO NOT MODIFY BELOW)
# ============================================================

import os
from datetime import datetime

# Parse horizons
HORIZON_LIST = [int(h.strip()) for h in HORIZONS.split(',')]

# Build model list
MODELS_TO_TRAIN = []
if TRAIN_XGBOOST: MODELS_TO_TRAIN.append('xgboost')
if TRAIN_LIGHTGBM: MODELS_TO_TRAIN.append('lightgbm')
if TRAIN_CATBOOST: MODELS_TO_TRAIN.append('catboost')
if TRAIN_RANDOM_FOREST: MODELS_TO_TRAIN.append('random_forest')
if TRAIN_LOGISTIC: MODELS_TO_TRAIN.append('logistic')
if TRAIN_SVM: MODELS_TO_TRAIN.append('svm')
if TRAIN_LSTM: MODELS_TO_TRAIN.append('lstm')
if TRAIN_GRU: MODELS_TO_TRAIN.append('gru')
if TRAIN_TCN: MODELS_TO_TRAIN.append('tcn')

# Date range will be auto-detected from data file
DATA_START = None  # Auto-detected
DATA_END = None    # Auto-detected

# Safe mode adjustments
if SAFE_MODE:
    BATCH_SIZE = min(BATCH_SIZE, 64)
    N_ESTIMATORS = min(N_ESTIMATORS, 300)
    SEQUENCE_LENGTH = min(SEQUENCE_LENGTH, 30)

# Print configuration summary
print("=" * 70)
print(" ML PIPELINE CONFIGURATION")
print("=" * 70)
print(f"\n  Contract:        {SYMBOL}")
print(f"  Date Range:      Auto-detect from data file")
print(f"  Horizons:        {HORIZON_LIST}")
print(f"  Split Ratios:    {TRAIN_RATIO}/{VAL_RATIO}/{TEST_RATIO}")
print(f"  Training Horizon: H{TRAINING_HORIZON}")
print(f"  Models:          {MODELS_TO_TRAIN if MODELS_TO_TRAIN else 'None selected'}")
print(f"\n  Run Pipeline:    {RUN_DATA_PIPELINE}")
print(f"  Run Training:    {RUN_MODEL_TRAINING}")
print(f"  Cross-Validation: {RUN_CROSS_VALIDATION}")
print(f"  Ensemble:        {TRAIN_ENSEMBLE}")
print(f"  Safe Mode:       {SAFE_MODE}")
print("=" * 70)
print("\nConfiguration complete! Run the next cells sequentially.")

---
# 2. ENVIRONMENT SETUP

Auto-detects Colab vs Local environment and sets up paths.

In [None]:
#@title 2.1 Environment Detection & Setup { display-mode: "form" }

import os
import sys
import gc
from pathlib import Path

# ============================================================
# ENVIRONMENT DETECTION
# ============================================================
IS_COLAB = os.path.exists('/content')

print("=" * 70)
print(" ENVIRONMENT SETUP")
print("=" * 70)

if IS_COLAB:
    print("\n[Environment] Google Colab detected")
    
    # Mount Google Drive
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Clone/update repository
    REPO_PATH = Path('/content/research')
    if not REPO_PATH.exists():
        print("\n[Setup] Cloning repository...")
        !git clone https://github.com/Snehpatel101/research.git /content/research
    else:
        print("\n[Setup] Updating repository...")
        !cd /content/research && git pull --quiet
    
    # Set paths
    PROJECT_ROOT = REPO_PATH
    DRIVE_ROOT = Path('/content/drive/MyDrive')
    RAW_DATA_DIR = DRIVE_ROOT / DRIVE_DATA_PATH
    RESULTS_DIR = DRIVE_ROOT / 'research/experiments'
    
    os.chdir(PROJECT_ROOT)
    
else:
    print("\n[Environment] Local environment detected")
    
    PROJECT_ROOT = Path('/Users/sneh/research')
    DRIVE_ROOT = None
    RAW_DATA_DIR = PROJECT_ROOT / 'data/raw'
    RESULTS_DIR = PROJECT_ROOT / 'experiments'
    
    os.chdir(PROJECT_ROOT)

# Add to Python path
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# Create output directories
SPLITS_DIR = PROJECT_ROOT / 'data/splits/scaled'
EXPERIMENTS_DIR = RESULTS_DIR / 'runs'
EXPERIMENTS_DIR.mkdir(parents=True, exist_ok=True)

print(f"\n  Project Root:  {PROJECT_ROOT}")
print(f"  Raw Data:      {RAW_DATA_DIR}")
print(f"  Splits:        {SPLITS_DIR}")
print(f"  Experiments:   {EXPERIMENTS_DIR}")

In [None]:
#@title 2.2 Install Dependencies { display-mode: "form" }

if IS_COLAB:
    print("[Dependencies] Installing packages...")
    !pip install -q xgboost lightgbm catboost optuna ta pywavelets scikit-learn pandas numpy matplotlib tqdm pyarrow numba
    print("[Dependencies] Installation complete!")
else:
    print("[Dependencies] Local environment - assuming packages installed.")
    print("  If needed: pip install xgboost lightgbm catboost optuna ta pywavelets")

# Verify imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

print(f"\n  pandas: {pd.__version__}")
print(f"  numpy: {np.__version__}")

In [None]:
#@title 2.3 GPU Detection { display-mode: "form" }

import torch

GPU_AVAILABLE = torch.cuda.is_available()
GPU_NAME = None
GPU_MEMORY = 0

print("=" * 70)
print(" HARDWARE DETECTION")
print("=" * 70)

if GPU_AVAILABLE:
    props = torch.cuda.get_device_properties(0)
    GPU_NAME = props.name
    GPU_MEMORY = props.total_memory / (1024**3)
    
    print(f"\n  GPU: {GPU_NAME}")
    print(f"  Memory: {GPU_MEMORY:.1f} GB")
    print(f"  Compute: {props.major}.{props.minor}")
    
    # Adjust batch size based on GPU memory
    if GPU_MEMORY >= 40:
        RECOMMENDED_BATCH = 1024
    elif GPU_MEMORY >= 15:
        RECOMMENDED_BATCH = 512
    else:
        RECOMMENDED_BATCH = 256
    
    print(f"  Recommended batch: {RECOMMENDED_BATCH}")
else:
    print("\n  GPU: Not available (using CPU)")
    print("  Tip: Runtime -> Change runtime type -> GPU")
    RECOMMENDED_BATCH = 128

# Check for neural models without GPU
NEURAL_MODELS = {'lstm', 'gru', 'tcn'}
selected_neural = set(MODELS_TO_TRAIN) & NEURAL_MODELS
if selected_neural and not GPU_AVAILABLE:
    print(f"\n  [WARNING] Neural models selected but no GPU: {selected_neural}")
    print("  Training will be slow on CPU.")

In [None]:
#@title 2.4 Memory Utilities { display-mode: "form" }

import psutil
import gc

def print_memory_status(label: str = "Current"):
    """Print current RAM and GPU memory usage."""
    print(f"\n--- Memory: {label} ---")
    
    # RAM
    ram = psutil.virtual_memory()
    print(f"RAM: {ram.used/1e9:.1f}GB / {ram.total/1e9:.1f}GB ({ram.percent}%)")
    
    # GPU
    if GPU_AVAILABLE:
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"GPU: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")

def clear_memory():
    """Clear RAM and GPU memory."""
    gc.collect()
    if GPU_AVAILABLE:
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    print("Memory cleared.")

print("Memory utilities loaded.")
print_memory_status("Initial")

---
# 3. PHASE 1: DATA PIPELINE

Processes raw OHLCV data into training-ready datasets.

**Pipeline stages:**
1. Load raw 1-minute data
2. Clean and resample to 5-minute bars
3. Generate 150+ technical features
4. Apply triple-barrier labeling
5. Create train/val/test splits with purge/embargo
6. Scale features (train-only fit)

In [None]:
#@title 3.1 Verify Raw Data & Detect Date Range { display-mode: "form" }

import pandas as pd
from pathlib import Path

print("=" * 70)
print(" RAW DATA VERIFICATION")
print("=" * 70)
print(f"\nLooking for {SYMBOL} data in: {RAW_DATA_DIR}")

# Find data file
RAW_DATA_FILE = None
patterns = [
    f"{SYMBOL}_1m.parquet", f"{SYMBOL}_1m.csv",
    f"{SYMBOL}.parquet", f"{SYMBOL}.csv",
    f"{SYMBOL}_1min.parquet", f"{SYMBOL}_1min.csv",
]

for pattern in patterns:
    path = RAW_DATA_DIR / pattern
    if path.exists():
        RAW_DATA_FILE = path
        break

if RAW_DATA_FILE:
    size_mb = RAW_DATA_FILE.stat().st_size / 1e6
    print(f"\n  Found: {RAW_DATA_FILE.name} ({size_mb:.1f} MB)")
    
    # Load and validate
    if RAW_DATA_FILE.suffix == '.parquet':
        df_raw = pd.read_parquet(RAW_DATA_FILE)
    else:
        df_raw = pd.read_csv(RAW_DATA_FILE)
    
    print(f"  Rows: {len(df_raw):,}")
    print(f"  Columns: {list(df_raw.columns)}")
    
    # Validate OHLCV columns
    required = {'open', 'high', 'low', 'close', 'volume'}
    found = {c.lower() for c in df_raw.columns}
    if required.issubset(found):
        print("  OHLCV columns: OK")
    else:
        missing = required - found
        print(f"  [ERROR] Missing columns: {missing}")
    
    # ============================================================
    # AUTO-DETECT DATE RANGE FROM DATA
    # ============================================================
    date_col = None
    for c in df_raw.columns:
        if 'date' in c.lower() or 'time' in c.lower():
            date_col = c
            break
    
    if date_col:
        df_raw[date_col] = pd.to_datetime(df_raw[date_col])
        
        # Store globally for pipeline use
        DATA_START = df_raw[date_col].min()
        DATA_END = df_raw[date_col].max()
        DATA_START_YEAR = DATA_START.year
        DATA_END_YEAR = DATA_END.year
        
        print(f"\n  [AUTO-DETECTED DATE RANGE]")
        print(f"  Start: {DATA_START} ({DATA_START_YEAR})")
        print(f"  End:   {DATA_END} ({DATA_END_YEAR})")
        print(f"  Span:  {(DATA_END - DATA_START).days} days ({DATA_END_YEAR - DATA_START_YEAR + 1} years)")
    else:
        print("  [WARNING] No datetime column found - using index")
        DATA_START = None
        DATA_END = None
        DATA_START_YEAR = 2019
        DATA_END_YEAR = 2024
    
    del df_raw
    gc.collect()
    
    print("\n  Data verified and ready for processing!")
else:
    print(f"\n  [ERROR] No data file found for {SYMBOL}!")
    print(f"  Expected location: {RAW_DATA_DIR}")
    print(f"  Expected files: {SYMBOL}_1m.csv or {SYMBOL}_1m.parquet")
    print(f"\n  Available files in directory:")
    if RAW_DATA_DIR.exists():
        for f in RAW_DATA_DIR.iterdir():
            if f.suffix in ['.csv', '.parquet']:
                print(f"    - {f.name}")
    else:
        print(f"    Directory does not exist!")
    
    RAW_DATA_FILE = None
    DATA_START = None
    DATA_END = None

In [None]:
#@title 3.2 Run Data Pipeline { display-mode: "form" }

if not RUN_DATA_PIPELINE:
    print("[Skipped] Data pipeline disabled in configuration.")
    print("Set RUN_DATA_PIPELINE = True in Section 1 to enable.")
elif RAW_DATA_FILE is None:
    print("[Error] No raw data file found. Cannot run pipeline.")
else:
    import time
    from datetime import datetime
    
    print("=" * 70)
    print(" PHASE 1: DATA PIPELINE")
    print("=" * 70)
    print(f"\n  Symbol: {SYMBOL}")
    
    # Use auto-detected date range
    if DATA_START is not None and DATA_END is not None:
        print(f"  Date Range: {DATA_START.strftime('%Y-%m-%d')} to {DATA_END.strftime('%Y-%m-%d')} (auto-detected)")
        start_date_str = DATA_START.strftime('%Y-%m-%d')
        end_date_str = DATA_END.strftime('%Y-%m-%d')
    else:
        print(f"  Date Range: Full dataset (no filter)")
        start_date_str = None
        end_date_str = None
    
    print(f"  Horizons: {HORIZON_LIST}")
    
    start_time = time.time()
    
    try:
        from src.phase1.pipeline_config import PipelineConfig
        from src.pipeline.runner import PipelineRunner
        
        # Configure pipeline with auto-detected dates
        config = PipelineConfig(
            symbols=[SYMBOL],
            project_root=PROJECT_ROOT,
            label_horizons=HORIZON_LIST,
            train_ratio=TRAIN_RATIO,
            val_ratio=VAL_RATIO,
            test_ratio=TEST_RATIO,
            purge_bars=PURGE_BARS,
            embargo_bars=EMBARGO_BARS,
            start_date=start_date_str,
            end_date=end_date_str,
            allow_batch_symbols=False,  # Single-contract architecture
        )
        
        # Run pipeline
        runner = PipelineRunner(config)
        success = runner.run()
        
        elapsed = time.time() - start_time
        
        if success:
            print(f"\n  Pipeline completed in {elapsed/60:.1f} minutes")
            
            # Verify output
            if (SPLITS_DIR / 'train_scaled.parquet').exists():
                for split in ['train', 'val', 'test']:
                    df = pd.read_parquet(SPLITS_DIR / f'{split}_scaled.parquet')
                    print(f"  {split}: {len(df):,} samples")
                    del df
                gc.collect()
                print("\n  Data ready for training!")
        else:
            print("\n  [ERROR] Pipeline failed. Check logs above.")
        
        del runner, config
        clear_memory()
        
    except Exception as e:
        print(f"\n  [ERROR] Pipeline failed: {e}")
        import traceback
        traceback.print_exc()

In [None]:
#@title 3.3 Verify Processed Data { display-mode: "form" }

print("=" * 70)
print(" PROCESSED DATA VERIFICATION")
print("=" * 70)

# Check for pre-processed data (local) or pipeline output (Colab)
if not IS_COLAB:
    # Local: check pre-processed data
    local_splits = PROJECT_ROOT / 'data/splits/final_correct/scaled'
    if (local_splits / 'train_scaled.parquet').exists():
        SPLITS_DIR = local_splits
        print(f"\nUsing pre-processed data: {SPLITS_DIR}")

if (SPLITS_DIR / 'train_scaled.parquet').exists():
    # Load metadata without keeping DataFrames
    train_df = pd.read_parquet(SPLITS_DIR / 'train_scaled.parquet')
    
    FEATURE_COLS = [c for c in train_df.columns 
                   if not c.startswith(('label_', 'sample_weight', 'quality_', 'datetime', 'symbol'))]
    LABEL_COLS = [c for c in train_df.columns if c.startswith('label_')]
    TRAIN_LEN = len(train_df)
    
    # Label distribution
    label_dists = {}
    for col in LABEL_COLS:
        label_dists[col] = train_df[col].value_counts().sort_index().to_dict()
    
    del train_df
    
    # Get val/test sizes
    val_df = pd.read_parquet(SPLITS_DIR / 'val_scaled.parquet')
    VAL_LEN = len(val_df)
    del val_df
    
    test_df = pd.read_parquet(SPLITS_DIR / 'test_scaled.parquet')
    TEST_LEN = len(test_df)
    del test_df
    
    gc.collect()
    
    print(f"\nDataset Summary:")
    print(f"  Train: {TRAIN_LEN:,} samples")
    print(f"  Val:   {VAL_LEN:,} samples")
    print(f"  Test:  {TEST_LEN:,} samples")
    print(f"  Total: {TRAIN_LEN + VAL_LEN + TEST_LEN:,} samples")
    print(f"\n  Features: {len(FEATURE_COLS)}")
    print(f"  Labels: {LABEL_COLS}")
    
    print(f"\nLabel Distribution (train):")
    for col, dist in label_dists.items():
        total = sum(dist.values())
        long_pct = dist.get(1, 0) / total * 100
        neutral_pct = dist.get(0, 0) / total * 100
        short_pct = dist.get(-1, 0) / total * 100
        print(f"  {col}: Long={long_pct:.1f}% | Neutral={neutral_pct:.1f}% | Short={short_pct:.1f}%")
    
    DATA_READY = True
    print("\n  Data verified and ready for training!")
else:
    print("\n[ERROR] Processed data not found!")
    print(f"  Expected: {SPLITS_DIR}/train_scaled.parquet")
    print("  Run Section 3.2 to process raw data.")
    DATA_READY = False

---
# 4. PHASE 2: MODEL TRAINING

Train selected models on the processed data.

In [None]:
#@title 4.1 Train Models { display-mode: "form" }

if not RUN_MODEL_TRAINING:
    print("[Skipped] Model training disabled in configuration.")
elif not DATA_READY:
    print("[Error] Data not ready. Run Section 3 first.")
elif not MODELS_TO_TRAIN:
    print("[Error] No models selected. Enable models in Section 1.")
else:
    import time
    import json
    
    print("=" * 70)
    print(" PHASE 2: MODEL TRAINING")
    print("=" * 70)
    print(f"\n  Models: {MODELS_TO_TRAIN}")
    print(f"  Horizon: H{TRAINING_HORIZON}")
    
    TRAINING_RESULTS = {}
    
    try:
        from src.models import ModelRegistry, Trainer, TrainerConfig
        from src.phase1.stages.datasets.container import TimeSeriesDataContainer
        
        # Load data container
        print("\nLoading data...")
        container = TimeSeriesDataContainer.from_parquet_dir(
            path=SPLITS_DIR,
            horizon=TRAINING_HORIZON
        )
        print(f"  Train: {container.splits['train'].n_samples:,}")
        print(f"  Val: {container.splits['val'].n_samples:,}")
        
        # Train each model
        for i, model_name in enumerate(MODELS_TO_TRAIN, 1):
            print(f"\n{'='*60}")
            print(f" [{i}/{len(MODELS_TO_TRAIN)}] Training: {model_name.upper()}")
            print("=" * 60)
            
            clear_memory()
            start_time = time.time()
            
            # Configure model
            if model_name in ['lstm', 'gru', 'tcn']:
                config = TrainerConfig(
                    model_name=model_name,
                    horizon=TRAINING_HORIZON,
                    sequence_length=SEQUENCE_LENGTH,
                    batch_size=BATCH_SIZE,
                    max_epochs=MAX_EPOCHS,
                    early_stopping_patience=EARLY_STOPPING_PATIENCE,
                    output_dir=EXPERIMENTS_DIR,
                    device="cuda" if GPU_AVAILABLE else "cpu",
                )
            elif model_name == 'catboost':
                config = TrainerConfig(
                    model_name=model_name,
                    horizon=TRAINING_HORIZON,
                    output_dir=EXPERIMENTS_DIR,
                    model_config={
                        "iterations": N_ESTIMATORS,
                        "early_stopping_rounds": BOOSTING_EARLY_STOPPING,
                        "use_gpu": False,
                        "task_type": "CPU",
                        "verbose": False,
                    },
                )
            else:
                config = TrainerConfig(
                    model_name=model_name,
                    horizon=TRAINING_HORIZON,
                    output_dir=EXPERIMENTS_DIR,
                    model_config={
                        "n_estimators": N_ESTIMATORS,
                        "early_stopping_rounds": BOOSTING_EARLY_STOPPING,
                    } if model_name in ['xgboost', 'lightgbm'] else None,
                )
            
            # Train
            trainer = Trainer(config)
            results = trainer.run(container)
            elapsed = time.time() - start_time
            
            # Store results
            metrics = results.get('evaluation_metrics', {})
            TRAINING_RESULTS[model_name] = {
                'metrics': metrics,
                'time': elapsed,
                'run_id': results.get('run_id', 'unknown'),
            }
            
            print(f"\n  Accuracy: {metrics.get('accuracy', 0):.2%}")
            print(f"  Macro F1: {metrics.get('macro_f1', 0):.4f}")
            print(f"  Time: {elapsed:.1f}s")
            
            del trainer, config
            clear_memory()
        
        # Save results
        results_file = EXPERIMENTS_DIR / 'training_results.json'
        with open(results_file, 'w') as f:
            json.dump(TRAINING_RESULTS, f, indent=2)
        print(f"\nResults saved to: {results_file}")
        
        del container
        clear_memory()
        
    except Exception as e:
        print(f"\n[ERROR] Training failed: {e}")
        import traceback
        traceback.print_exc()
        clear_memory()

In [None]:
#@title 4.2 Compare Models { display-mode: "form" }

if TRAINING_RESULTS:
    print("=" * 70)
    print(" MODEL COMPARISON")
    print("=" * 70)
    
    # Build comparison table
    rows = []
    for model, data in TRAINING_RESULTS.items():
        metrics = data.get('metrics', {})
        rows.append({
            'Model': model,
            'Accuracy': metrics.get('accuracy', 0),
            'Macro F1': metrics.get('macro_f1', 0),
            'Weighted F1': metrics.get('weighted_f1', 0),
            'Time (s)': data.get('time', 0),
        })
    
    comparison_df = pd.DataFrame(rows)
    comparison_df = comparison_df.sort_values('Macro F1', ascending=False)
    
    print("\n")
    print(comparison_df.to_string(index=False))
    
    # Best model
    best_model = comparison_df.iloc[0]['Model']
    best_f1 = comparison_df.iloc[0]['Macro F1']
    print(f"\n  Best Model: {best_model} (F1: {best_f1:.4f})")
    
    # Visualization
    if len(TRAINING_RESULTS) > 1:
        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
        
        # Accuracy comparison
        sorted_df = comparison_df.sort_values('Accuracy', ascending=True)
        axes[0].barh(sorted_df['Model'], sorted_df['Accuracy'], color='steelblue')
        axes[0].set_xlabel('Accuracy')
        axes[0].set_title('Model Accuracy')
        axes[0].set_xlim(0, 1)
        
        # Training time
        sorted_df = comparison_df.sort_values('Time (s)', ascending=True)
        axes[1].barh(sorted_df['Model'], sorted_df['Time (s)'], color='coral')
        axes[1].set_xlabel('Training Time (seconds)')
        axes[1].set_title('Training Time')
        
        plt.tight_layout()
        plt.show()
else:
    print("No training results available.")
    print("Run Section 4.1 to train models.")

---
# 5. PHASE 3: CROSS-VALIDATION (Optional)

Run purged K-fold cross-validation for robust model evaluation.

In [None]:
#@title 5.1 Run Cross-Validation { display-mode: "form" }

if not RUN_CROSS_VALIDATION:
    print("[Skipped] Cross-validation disabled in configuration.")
    print("Set RUN_CROSS_VALIDATION = True in Section 1 to enable.")
else:
    print("=" * 70)
    print(" PHASE 3: CROSS-VALIDATION")
    print("=" * 70)
    
    try:
        from src.cross_validation import PurgedKFold, PurgedKFoldConfig
        from src.phase1.stages.datasets.container import TimeSeriesDataContainer
        from sklearn.metrics import f1_score
        
        # Load data
        container = TimeSeriesDataContainer.from_parquet_dir(
            path=SPLITS_DIR,
            horizon=TRAINING_HORIZON
        )
        
        X, y, _ = container.get_sklearn_arrays('train')
        print(f"\nData: {X.shape[0]:,} samples, {X.shape[1]} features")
        
        # Configure CV
        cv_config = PurgedKFoldConfig(
            n_splits=CV_N_SPLITS,
            purge_bars=PURGE_BARS,
            embargo_bars=EMBARGO_BARS,
        )
        cv = PurgedKFold(cv_config)
        
        print(f"CV: {CV_N_SPLITS} folds, purge={PURGE_BARS}, embargo={EMBARGO_BARS}")
        
        # Run CV for best model
        best_model = TRAINING_RESULTS and max(TRAINING_RESULTS, key=lambda x: TRAINING_RESULTS[x]['metrics'].get('macro_f1', 0))
        if not best_model:
            best_model = 'xgboost'
        
        print(f"\nRunning CV for: {best_model}")
        
        fold_scores = []
        for fold_idx, (train_idx, val_idx) in enumerate(tqdm(cv.split(X, y), total=CV_N_SPLITS, desc="CV Folds")):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            model = ModelRegistry.create(best_model, config={
                'n_estimators': N_ESTIMATORS,
                'early_stopping_rounds': BOOSTING_EARLY_STOPPING,
            })
            model.fit(X_train, y_train, X_val, y_val)
            
            predictions = model.predict(X_val)
            f1 = f1_score(y_val, predictions.class_predictions, average='macro')
            fold_scores.append(f1)
            
            del model
            clear_memory()
        
        print(f"\nCV Results for {best_model}:")
        print(f"  Mean F1: {np.mean(fold_scores):.4f} (+/- {np.std(fold_scores):.4f})")
        print(f"  Fold scores: {[f'{s:.4f}' for s in fold_scores]}")
        
        del container, X, y
        clear_memory()
        
    except Exception as e:
        print(f"\n[ERROR] Cross-validation failed: {e}")
        import traceback
        traceback.print_exc()

---
# 6. PHASE 4: ENSEMBLE (Optional)

Combine multiple models for improved predictions.

In [None]:
#@title 6.1 Train Ensemble { display-mode: "form" }

if not TRAIN_ENSEMBLE:
    print("[Skipped] Ensemble training disabled in configuration.")
    print("Set TRAIN_ENSEMBLE = True in Section 1 to enable.")
elif len(TRAINING_RESULTS) < 2:
    print("[Error] Need at least 2 trained models for ensemble.")
    print(f"Currently trained: {list(TRAINING_RESULTS.keys())}")
else:
    print("=" * 70)
    print(f" PHASE 4: {ENSEMBLE_TYPE.upper()} ENSEMBLE")
    print("=" * 70)
    
    base_models = list(TRAINING_RESULTS.keys())
    print(f"\n  Base models: {base_models}")
    print(f"  Meta-learner: {ENSEMBLE_META_LEARNER}")
    
    try:
        from src.phase1.stages.datasets.container import TimeSeriesDataContainer
        
        # Load data
        container = TimeSeriesDataContainer.from_parquet_dir(
            path=SPLITS_DIR,
            horizon=TRAINING_HORIZON
        )
        
        # Configure ensemble
        config = TrainerConfig(
            model_name=ENSEMBLE_TYPE,
            horizon=TRAINING_HORIZON,
            output_dir=EXPERIMENTS_DIR,
            model_config={
                "base_model_names": base_models,
                "meta_learner": ENSEMBLE_META_LEARNER,
            }
        )
        
        # Train ensemble
        trainer = Trainer(config)
        results = trainer.run(container)
        
        metrics = results.get('evaluation_metrics', {})
        print(f"\nEnsemble Results:")
        print(f"  Accuracy: {metrics.get('accuracy', 0):.2%}")
        print(f"  Macro F1: {metrics.get('macro_f1', 0):.4f}")
        
        # Compare to best single model
        best_single = max(TRAINING_RESULTS.values(), key=lambda x: x['metrics'].get('macro_f1', 0))
        best_f1 = best_single['metrics'].get('macro_f1', 0)
        ensemble_f1 = metrics.get('macro_f1', 0)
        
        improvement = (ensemble_f1 - best_f1) / best_f1 * 100 if best_f1 > 0 else 0
        print(f"\n  Improvement over best single model: {improvement:+.1f}%")
        
        del trainer, config, container
        clear_memory()
        
    except Exception as e:
        print(f"\n[ERROR] Ensemble training failed: {e}")
        import traceback
        traceback.print_exc()

---
# 7. RESULTS & EXPORT

Summary of all results and export options.

In [None]:
#@title 7.1 Final Summary { display-mode: "form" }

print("=" * 70)
print(" PIPELINE SUMMARY")
print("=" * 70)

print(f"\n Configuration:")
print(f"   Symbol: {SYMBOL}")

# Show auto-detected date range
if 'DATA_START' in dir() and DATA_START is not None:
    print(f"   Date Range: {DATA_START.strftime('%Y-%m-%d')} to {DATA_END.strftime('%Y-%m-%d')}")
    print(f"   Years: {DATA_START_YEAR} - {DATA_END_YEAR}")
else:
    print(f"   Date Range: Not detected (run Section 3.1)")

print(f"   Training Horizon: H{TRAINING_HORIZON}")

if 'TRAIN_LEN' in dir():
    print(f"\n Data:")
    print(f"   Train: {TRAIN_LEN:,} samples")
    print(f"   Val: {VAL_LEN:,} samples")
    print(f"   Test: {TEST_LEN:,} samples")

if 'TRAINING_RESULTS' in dir() and TRAINING_RESULTS:
    print(f"\n Model Results:")
    for model, data in sorted(TRAINING_RESULTS.items(), 
                              key=lambda x: x[1]['metrics'].get('macro_f1', 0), 
                              reverse=True):
        metrics = data['metrics']
        print(f"   {model}: Acc={metrics.get('accuracy', 0):.2%}, F1={metrics.get('macro_f1', 0):.4f}")
    
    best = max(TRAINING_RESULTS, key=lambda x: TRAINING_RESULTS[x]['metrics'].get('macro_f1', 0))
    print(f"\n Best Model: {best}")

print(f"\n Saved Artifacts:")
print(f"   Data: {SPLITS_DIR}")
print(f"   Models: {EXPERIMENTS_DIR}")

print("\n" + "=" * 70)
print(" PIPELINE COMPLETE")
print("=" * 70)

In [None]:
#@title 7.2 Export Best Model { display-mode: "form" }

export_model = False  #@param {type: "boolean"}
export_format = "pickle"  #@param ["pickle", "joblib", "onnx"]

if export_model and TRAINING_RESULTS:
    from pathlib import Path
    import joblib
    
    # Find best model
    best_model_name = max(TRAINING_RESULTS, key=lambda x: TRAINING_RESULTS[x]['metrics'].get('macro_f1', 0))
    best_run_id = TRAINING_RESULTS[best_model_name].get('run_id', 'unknown')
    
    print(f"Exporting best model: {best_model_name}")
    print(f"Run ID: {best_run_id}")
    
    # Find model files
    model_dir = EXPERIMENTS_DIR / best_run_id
    if model_dir.exists():
        export_dir = RESULTS_DIR / 'exports'
        export_dir.mkdir(parents=True, exist_ok=True)
        
        export_path = export_dir / f"{best_model_name}_{SYMBOL}_H{TRAINING_HORIZON}.{export_format}"
        
        # Copy model files
        import shutil
        shutil.copytree(model_dir, export_dir / best_run_id, dirs_exist_ok=True)
        
        print(f"\nExported to: {export_dir}")
    else:
        print(f"Model directory not found: {model_dir}")
else:
    print("Model export skipped. Enable checkbox above to export.")

---
# Quick Reference

## Command Line Usage

```bash
# Train single model
python scripts/train_model.py --model xgboost --horizon 20

# Train neural model
python scripts/train_model.py --model lstm --horizon 20 --seq-len 60

# Run cross-validation
python scripts/run_cv.py --models xgboost,lightgbm --horizons 20 --n-splits 5

# Train ensemble
python scripts/train_model.py --model voting --horizon 20

# List all available models
python scripts/train_model.py --list-models
```

## Model Families

| Family | Models | Best For |
|--------|--------|----------|
| Boosting | XGBoost, LightGBM, CatBoost | Fast, accurate, tabular data |
| Classical | Random Forest, Logistic, SVM | Baselines, interpretability |
| Neural | LSTM, GRU, TCN | Sequential patterns, temporal dependencies |
| Ensemble | Voting, Stacking, Blending | Combined predictions, robustness |