# ML Model Factory - Complete Pipeline & TrainingThis notebook runs the **complete ML pipeline** from raw data to trained models.## What This Notebook Does1. **Setup** - Mount Drive (for results), clone GitHub repo (for code & data)2. **Phase 1** - Data pipeline (clean -> features -> labels -> splits)3. **Phase 2** - Model training (single or multiple models)4. **Phase 3** - Cross-validation (optional)5. **Phase 4** - Ensemble training (optional)## Data Flow- **Data Source:** `/content/research/` (cloned from GitHub)- **Results Saved:** `/content/drive/MyDrive/research/` (Google Drive for persistence)## Memory MonitoringThis notebook includes memory monitoring utilities to help diagnose kernel crashes:- **Cell 1.4**: Memory monitoring functions (`print_memory_status`, `clear_gpu_memory`)- **Cell 1.5**: Clear memory utility (run when RAM/GPU usage is high)- **Data/Training cells**: Include before/after memory checks---

## 1. Environment Setup

In [None]:
#@title 1.1 Mount Google Drive & Clone Repository { display-mode: "form" }
#@markdown Run this cell to mount your Google Drive and set up the project.

import os
import sys
from pathlib import Path

# Mount Google Drive (for saving results only)
from google.colab import drive
drive.mount('/content/drive')

# Clone or pull repository
if not Path('/content/research').exists():
    print("Cloning repository...")
    !git clone https://github.com/Snehpatel101/research.git /content/research
else:
    print("Pulling latest changes...")
    !cd /content/research && git pull

# Change to project directory
os.chdir('/content/research')

# Create Drive directories for saving results
for d in ["experiments/runs", "results"]:
    Path('/content/drive/MyDrive/research', d).mkdir(parents=True, exist_ok=True)

print("\n" + "=" * 60)
print(" PATH CONFIGURATION")
print("=" * 60)
print(f"\nProject directory: {os.getcwd()}")
print(f"Data source: /content/research (from GitHub)")
print(f"Results saved to: /content/drive/MyDrive/research (Google Drive)")
print("=" * 60)

In [None]:
#@title 1.2.1 Install Dependencies { display-mode: "form" }
#@markdown Installs all required packages for the ML pipeline.

import sys

# Add project to Python path
sys.path.insert(0, '/content/research')

# Install required packages
!pip install xgboost lightgbm catboost optuna ta pywavelets scikit-learn pandas numpy -q

# Verify PyTorch with CUDA
import torch
if torch.cuda.is_available():
    print(f"PyTorch: {torch.__version__} with CUDA {torch.version.cuda}")
else:
    print(f"PyTorch: {torch.__version__} (CPU only)")

print(f"\nProject path added: /content/research")
print("Dependencies installed!")

In [None]:
#@title 1.2.2 Crash Recovery Helper { display-mode: "form" }
#@markdown **Run this cell after kernel restart to recover session state.**
#@markdown 
#@markdown This cell:
#@markdown - Detects if this is a fresh start or recovery
#@markdown - Loads cached results from disk (survives kernel crashes)
#@markdown - Restores essential variables (HORIZON, MODELS_TO_TRAIN, etc.)
#@markdown - Prints recovery status

import sys
import os
from pathlib import Path
import json

# Ensure project path is set
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')

print("=" * 60)
print(" CRASH RECOVERY HELPER")
print("=" * 60)

# Define cache paths
CACHE_DIR = Path('/content/drive/MyDrive/research/experiments')
TRAINING_CACHE = CACHE_DIR / '.training_results_cache.json'
SESSION_CACHE = CACHE_DIR / '.session_state_cache.json'
CHECKPOINT_CACHE = CACHE_DIR / '.training_checkpoint.json'

# Detection: Is this a fresh start or recovery?
is_recovery = False
recovery_sources = []

# Check for training results cache
if TRAINING_CACHE.exists():
    is_recovery = True
    recovery_sources.append("training_results")
    
# Check for session state cache
if SESSION_CACHE.exists():
    is_recovery = True
    recovery_sources.append("session_state")

# Check for training checkpoint
if CHECKPOINT_CACHE.exists():
    is_recovery = True
    recovery_sources.append("training_checkpoint")

if is_recovery:
    print(f"\n[RECOVERY MODE] Found cached data from previous session")
    print(f"  Sources: {', '.join(recovery_sources)}")
else:
    print("\n[FRESH START] No previous session data found")

# Initialize core variables with defaults
HORIZON = 20
MODELS_TO_TRAIN = ['xgboost']
SEQ_LEN = 60
TRAINING_RESULTS = {}

# Load session state if available
if SESSION_CACHE.exists():
    try:
        with open(SESSION_CACHE) as f:
            session_state = json.load(f)
        
        HORIZON = session_state.get('horizon', 20)
        MODELS_TO_TRAIN = session_state.get('models_to_train', ['xgboost'])
        SEQ_LEN = session_state.get('seq_len', 60)
        
        print(f"\n[RESTORED] Session variables:")
        print(f"  HORIZON = {HORIZON}")
        print(f"  MODELS_TO_TRAIN = {MODELS_TO_TRAIN}")
        print(f"  SEQ_LEN = {SEQ_LEN}")
    except Exception as e:
        print(f"\n[WARNING] Could not load session state: {e}")

# Load training results if available
if TRAINING_CACHE.exists():
    try:
        with open(TRAINING_CACHE) as f:
            TRAINING_RESULTS = json.load(f)
        
        print(f"\n[RESTORED] Training results ({len(TRAINING_RESULTS)} models):")
        for model, data in TRAINING_RESULTS.items():
            metrics = data.get('metrics', {})
            acc = metrics.get('accuracy', 0)
            f1 = metrics.get('macro_f1', 0)
            print(f"  - {model}: Acc={acc:.2%}, F1={f1:.4f}")
    except Exception as e:
        print(f"\n[WARNING] Could not load training results: {e}")

# Check for incomplete training (checkpoint)
if CHECKPOINT_CACHE.exists():
    try:
        with open(CHECKPOINT_CACHE) as f:
            checkpoint = json.load(f)
        
        completed = checkpoint.get('completed_models', [])
        pending = checkpoint.get('pending_models', [])
        
        if pending:
            print(f"\n[CHECKPOINT] Incomplete training detected:")
            print(f"  Completed: {completed}")
            print(f"  Pending: {pending}")
            print(f"\n  Resume training in Section 4.2 - it will skip completed models.")
    except Exception as e:
        print(f"\n[WARNING] Could not load checkpoint: {e}")

# Detect GPU
try:
    import torch
    GPU_AVAILABLE = torch.cuda.is_available()
    if GPU_AVAILABLE:
        props = torch.cuda.get_device_properties(0)
        GPU_NAME = props.name
        GPU_MEMORY = props.total_memory / (1024**3)
        
        if GPU_MEMORY >= 40:
            RECOMMENDED_BATCH_SIZE = 1024
            MIXED_PRECISION = True
        elif GPU_MEMORY >= 15:
            RECOMMENDED_BATCH_SIZE = 512
            MIXED_PRECISION = True
        else:
            RECOMMENDED_BATCH_SIZE = 256
            MIXED_PRECISION = props.major >= 7
        
        print(f"\n[GPU] {GPU_NAME} ({GPU_MEMORY:.1f} GB)")
    else:
        RECOMMENDED_BATCH_SIZE = 256
        MIXED_PRECISION = False
        print(f"\n[GPU] Not available - using CPU")
except:
    GPU_AVAILABLE = False
    RECOMMENDED_BATCH_SIZE = 256
    MIXED_PRECISION = False
    print(f"\n[GPU] Detection failed - using CPU")

print("\n" + "=" * 60)
print(" RECOVERY STATUS")
print("=" * 60)
if TRAINING_RESULTS:
    print(f"\n  Ready to continue! You can:")
    print(f"    - Skip to Section 4.3 to compare existing results")
    print(f"    - Run Section 4.2 to train remaining models (completed will be skipped)")
    print(f"    - Run Section 4.4 to evaluate on test set")
else:
    print(f"\n  No previous results. Start fresh from Section 3 or 4.")
print("=" * 60)

In [None]:
#@title 1.3 Detect Hardware & Configure { display-mode: "form" }
#@markdown Detects GPU and configures optimal settings.

import sys
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')

import torch
import platform

print("=" * 60)
print(" HARDWARE DETECTION")
print("=" * 60)

# System info
print(f"\nSystem: {platform.system()} {platform.release()}")
print(f"Python: {sys.version.split()[0]}")

# GPU detection
GPU_AVAILABLE = torch.cuda.is_available()
GPU_NAME = None
GPU_MEMORY = 0
RECOMMENDED_BATCH_SIZE = 256
MIXED_PRECISION = False

if GPU_AVAILABLE:
    props = torch.cuda.get_device_properties(0)
    GPU_NAME = props.name
    GPU_MEMORY = props.total_memory / (1024**3)
    
    print(f"\nGPU: {GPU_NAME}")
    print(f"Memory: {GPU_MEMORY:.1f} GB")
    print(f"Compute Capability: {props.major}.{props.minor}")
    
    if GPU_MEMORY >= 40:  # A100
        RECOMMENDED_BATCH_SIZE = 1024
        MIXED_PRECISION = True
    elif GPU_MEMORY >= 15:  # T4/V100
        RECOMMENDED_BATCH_SIZE = 512
        MIXED_PRECISION = True
    else:
        RECOMMENDED_BATCH_SIZE = 256
        MIXED_PRECISION = props.major >= 7
    
    print(f"\nRecommended batch size: {RECOMMENDED_BATCH_SIZE}")
    print(f"Mixed precision: {'Enabled' if MIXED_PRECISION else 'Disabled'}")
else:
    print("\nNo GPU detected - will use CPU")
    print("Tip: Runtime -> Change runtime type -> GPU")

# Verify model registry
print("\n" + "=" * 60)
print(" AVAILABLE MODELS")
print("=" * 60)

try:
    from src.models import ModelRegistry
    models = ModelRegistry.list_models()
    for family, model_list in models.items():
        print(f"\n{family.upper()}:")
        for m in model_list:
            gpu_req = "GPU" if m in ['lstm', 'gru', 'tcn'] else "CPU"
            print(f"  - {m} ({gpu_req})")
except Exception as e:
    print(f"Error loading models: {e}")

print("\n" + "=" * 60)

In [None]:
#@title 1.4 Memory Monitoring Utilities { display-mode: "form" }
#@markdown Defines functions for monitoring RAM and GPU memory usage.
#@markdown Run this cell to enable memory diagnostics throughout the notebook.

import sys
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')

try:
    import psutil
    import torch
    
    def print_memory_status(label: str = "Current"):
        """Print current memory usage for RAM and GPU.
        
        Args:
            label: A descriptive label for the memory snapshot (e.g., "Before loading")
        """
        print(f"\n--- Memory Status: {label} ---")
        
        # RAM usage
        try:
            ram = psutil.virtual_memory()
            ram_used_gb = ram.used / 1e9
            ram_total_gb = ram.total / 1e9
            ram_available_gb = ram.available / 1e9
            print(f"RAM: {ram_used_gb:.1f}GB used / {ram_total_gb:.1f}GB total ({ram.percent}%)")
            print(f"     {ram_available_gb:.1f}GB available")
            
            # Warning threshold
            if ram.percent > 85:
                print("     [WARNING] RAM usage above 85% - risk of OOM!")
        except Exception as e:
            print(f"RAM: Error reading - {e}")
        
        # GPU usage
        try:
            if torch.cuda.is_available():
                allocated = torch.cuda.memory_allocated() / 1e9
                reserved = torch.cuda.memory_reserved() / 1e9
                total = torch.cuda.get_device_properties(0).total_memory / 1e9
                free = total - reserved
                print(f"GPU: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved / {total:.1f}GB total")
                print(f"     {free:.2f}GB free")
                
                # Warning threshold
                if reserved / total > 0.85:
                    print("     [WARNING] GPU memory above 85% - risk of OOM!")
            else:
                print("GPU: Not available")
        except Exception as e:
            print(f"GPU: Error reading - {e}")
        
        print("-" * 40)

    def clear_gpu_memory():
        """Clear GPU memory cache and run garbage collection."""
        import gc
        
        print("\nClearing memory...")
        
        # Python garbage collection
        gc_collected = gc.collect()
        print(f"  Python GC: collected {gc_collected} objects")
        
        # GPU cache
        if torch.cuda.is_available():
            before_reserved = torch.cuda.memory_reserved() / 1e9
            torch.cuda.empty_cache()
            after_reserved = torch.cuda.memory_reserved() / 1e9
            freed = before_reserved - after_reserved
            print(f"  GPU cache: cleared {freed:.2f}GB")
            torch.cuda.synchronize()
        else:
            print("  GPU: Not available")
        
        print("  Done!")

    def get_memory_dict():
        """Return memory stats as a dictionary for logging."""
        stats = {}
        
        try:
            ram = psutil.virtual_memory()
            stats['ram_used_gb'] = ram.used / 1e9
            stats['ram_total_gb'] = ram.total / 1e9
            stats['ram_percent'] = ram.percent
        except:
            pass
        
        try:
            if torch.cuda.is_available():
                stats['gpu_allocated_gb'] = torch.cuda.memory_allocated() / 1e9
                stats['gpu_reserved_gb'] = torch.cuda.memory_reserved() / 1e9
                stats['gpu_total_gb'] = torch.cuda.get_device_properties(0).total_memory / 1e9
        except:
            pass
        
        return stats

    # Test the functions
    print("Memory monitoring utilities loaded successfully!")
    print_memory_status("Initial State")
    
    MEMORY_UTILS_LOADED = True

except ImportError as e:
    print(f"Warning: Could not load memory utilities - {e}")
    print("Install with: !pip install psutil")
    MEMORY_UTILS_LOADED = False
except Exception as e:
    print(f"Error loading memory utilities: {e}")
    MEMORY_UTILS_LOADED = False

In [None]:
#@title 1.5 Clear Memory (Run When Needed) { display-mode: "form" }
#@markdown **Run this cell to free up RAM and GPU memory.**
#@markdown 
#@markdown Use this cell when:
#@markdown - Memory usage is high (above 80%)
#@markdown - Before training a large model
#@markdown - After encountering OOM errors (then restart kernel)
#@markdown - Between training different model types

import sys
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')

import gc

print("=" * 60)
print(" CLEARING MEMORY")
print("=" * 60)

# Show before state
try:
    if 'print_memory_status' in dir():
        print_memory_status("Before Cleanup")
except:
    pass

# 1. Delete large DataFrames if they exist
large_vars_deleted = []
for var_name in ['TRAIN_DF', 'VAL_DF', 'TEST_DF', 'train_df', 'val_df', 'test_df', 
                 'container', 'trainer', 'model', 'X_train', 'X_val', 'X_test',
                 'y_train', 'y_val', 'y_test', 'predictions']:
    if var_name in dir():
        try:
            exec(f"del {var_name}")
            large_vars_deleted.append(var_name)
        except:
            pass

if large_vars_deleted:
    print(f"\nDeleted variables: {', '.join(large_vars_deleted)}")
else:
    print("\nNo large variables to delete")

# 2. Run Python garbage collection
gc_collected = gc.collect()
print(f"Python GC: collected {gc_collected} objects")

# 3. Clear GPU memory
try:
    import torch
    if torch.cuda.is_available():
        before_reserved = torch.cuda.memory_reserved() / 1e9
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
        after_reserved = torch.cuda.memory_reserved() / 1e9
        freed = before_reserved - after_reserved
        print(f"GPU cache: cleared {freed:.2f}GB")
    else:
        print("GPU: Not available")
except ImportError:
    print("GPU: PyTorch not installed")
except Exception as e:
    print(f"GPU: Error - {e}")

# 4. Second pass GC (catches circular references)
gc_collected_2 = gc.collect()
if gc_collected_2 > 0:
    print(f"Python GC (2nd pass): collected {gc_collected_2} more objects")

# Show after state
try:
    if 'print_memory_status' in dir():
        print_memory_status("After Cleanup")
except:
    pass

print("\n" + "=" * 60)
print(" CLEANUP COMPLETE")
print("=" * 60)
print("\nNote: If memory is still high, consider:")
print("  1. Runtime -> Restart runtime (preserves Drive data)")
print("  2. After restart, run cells 1.1-1.4 to restore environment")
print("  3. Run cell 4.0 to recover training results from cache")

---## 3. Phase 1: Data Pipeline

In [None]:
#@title 3.2 Run Data Pipeline OR Use Existing Data { display-mode: "form" }
#@markdown Choose whether to run the full pipeline or use existing processed data.

import sys
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')

data_source = "Use existing processed data"  #@param ["Run full pipeline (requires raw data)", "Use existing processed data"]

from pathlib import Path
import time
import gc

# Memory check before loading (if utilities available)
try:
    if 'print_memory_status' in dir():
        print_memory_status("Before Data Loading")
except:
    pass

# CORRECT: Data from GitHub clone (not Google Drive)
splits_dir = Path('/content/research/data/splits/scaled')
train_file = splits_dir / "train_scaled.parquet"

if data_source == "Use existing processed data":
    if train_file.exists():
        import pandas as pd
        
        # Load one at a time and print info to save memory
        print("Found existing processed data!")
        print(f"  Location: {splits_dir}")
        
        # Only load metadata (row counts) without keeping full DataFrames in memory
        for split_name in ["train", "val", "test"]:
            split_file = splits_dir / f"{split_name}_scaled.parquet"
            if split_file.exists():
                # Use read_parquet with columns=[] to get row count without loading data
                temp_df = pd.read_parquet(split_file)
                print(f"  {split_name.capitalize()}: {len(temp_df):,} samples")
                del temp_df
        
        # Force garbage collection
        gc.collect()
        
        print("\nData verified - proceeding to model training!")
        print("(DataFrames not held in memory - will be loaded by TimeSeriesDataContainer)")
    else:
        print("ERROR: Processed data not found!")
        print(f"  Expected: {splits_dir}/")
        print("\nMake sure the GitHub repo contains processed data files:")
        print("  - train_scaled.parquet")
        print("  - val_scaled.parquet")
        print("  - test_scaled.parquet")
else:
    raw_dir = Path('/content/research/data/raw')
    raw_files = list(raw_dir.glob("*.parquet")) + list(raw_dir.glob("*.csv")) if raw_dir.exists() else []
    
    if not raw_files:
        print("ERROR: No raw data files found!")
        print(f"  Expected: {raw_dir}/MES_1m.parquet or .csv")
    else:
        print("Running Phase 1 Data Pipeline...")
        print("=" * 60)
        start_time = time.time()
        
        try:
            from src.phase1.pipeline_config import PipelineConfig
            from src.pipeline.runner import PipelineRunner
            
            config = PipelineConfig(
                symbols=SYMBOLS,
                project_root=Path('/content/research'),
                label_horizons=HORIZONS,
                train_ratio=TRAIN_RATIO,
                val_ratio=VAL_RATIO,
                test_ratio=TEST_RATIO,
            )
            
            runner = PipelineRunner(config)
            success = runner.run()
            
            # Clean up pipeline artifacts
            del runner, config
            gc.collect()
            
            elapsed = time.time() - start_time
            print("\n" + "=" * 60)
            if success:
                print(f"Pipeline completed in {elapsed/60:.1f} minutes!")
            else:
                print("Pipeline failed. Check errors above.")
        except Exception as e:
            print(f"\nError: {e}")
            import traceback
            traceback.print_exc()

# Memory check after loading
try:
    if 'print_memory_status' in dir():
        print_memory_status("After Data Loading")
except:
    pass

In [None]:
#@title 3.3 Verify Processed Data { display-mode: "form" }
#@markdown Loads and displays dataset info WITHOUT keeping DataFrames in memory.

import sys
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')

import pandas as pd
from pathlib import Path
import gc

# Memory check before loading
try:
    if 'print_memory_status' in dir():
        print_memory_status("Before Loading DataFrames")
except:
    pass

# CORRECT: Data from GitHub clone (not Google Drive)
splits_dir = Path('/content/research/data/splits/scaled')

print("Verifying processed datasets...")
print("=" * 60)

try:
    # Load train to get column info (then delete)
    train_df = pd.read_parquet(splits_dir / "train_scaled.parquet")
    
    # Extract column info before deleting
    feature_cols = [c for c in train_df.columns if not c.startswith(('label_', 'sample_weight', 'quality_score', 'datetime', 'symbol'))]
    label_cols = [c for c in train_df.columns if c.startswith('label_')]
    train_len = len(train_df)
    
    # Get label distribution before deleting
    label_distributions = {}
    for col in label_cols:
        label_distributions[col] = train_df[col].value_counts().sort_index().to_dict()
    
    # Delete train_df to free memory
    del train_df
    gc.collect()
    
    # Get row counts from val/test without holding in memory
    val_df = pd.read_parquet(splits_dir / "val_scaled.parquet")
    val_len = len(val_df)
    del val_df
    gc.collect()
    
    test_df = pd.read_parquet(splits_dir / "test_scaled.parquet")
    test_len = len(test_df)
    del test_df
    gc.collect()
    
    # Print summary
    print(f"\nDataset sizes:")
    print(f"  Train: {train_len:,} samples")
    print(f"  Val:   {val_len:,} samples")
    print(f"  Test:  {test_len:,} samples")
    print(f"  Total: {train_len + val_len + test_len:,} samples")
    
    print(f"\nFeatures: {len(feature_cols)}")
    print(f"Labels: {label_cols}")
    
    print(f"\nLabel distribution (train):")
    for col, dist in label_distributions.items():
        print(f"  {col}: Long={dist.get(1, 0):,} | Neutral={dist.get(0, 0):,} | Short={dist.get(-1, 0):,}")
    
    # Store metadata for downstream cells (NOT the DataFrames)
    FEATURE_COLS = feature_cols
    TRAIN_LEN = train_len
    VAL_LEN = val_len
    TEST_LEN = test_len
    
    print("\nData verified! (DataFrames freed from memory)")
    print("TimeSeriesDataContainer will load data on-demand during training.")
    
    # Memory check after cleanup
    try:
        if 'print_memory_status' in dir():
            print_memory_status("After Cleanup")
    except:
        pass
    
except FileNotFoundError:
    print("Processed data not found. Run Section 3.2 first.")

---## 4. Phase 2: Model Training

In [None]:
#@title 4.1 Training Mode Selection { display-mode: "form" }
#@markdown Choose your training mode and models.

# FIRST: Set defaults for variables that may not exist (handles kernel restart)
import sys
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')

import gc

# Check GPU availability (in case cell 1.3 wasn't run)
if 'GPU_AVAILABLE' not in dir():
    try:
        import torch
        GPU_AVAILABLE = torch.cuda.is_available()
        if GPU_AVAILABLE:
            props = torch.cuda.get_device_properties(0)
            GPU_NAME = props.name
            GPU_MEMORY = props.total_memory / (1024**3)
            print(f"[Auto-detected] GPU: {GPU_NAME} ({GPU_MEMORY:.1f} GB)")
        else:
            GPU_MEMORY = 0
            print("[Auto-detected] No GPU - will use CPU")
    except:
        GPU_AVAILABLE = False
        GPU_MEMORY = 0
        print("[Auto-detected] PyTorch not available - will use CPU")

# GPU Memory estimation utilities
def estimate_gpu_memory_needed(model_name, n_samples, n_features, seq_len, batch_size):
    """Estimate GPU memory needed for training in GB.
    
    Args:
        model_name: Name of the model (lstm, gru, tcn)
        n_samples: Number of training samples
        n_features: Number of input features
        seq_len: Sequence length for temporal models
        batch_size: Training batch size
        
    Returns:
        Estimated GPU memory needed in GB
    """
    if model_name in ['lstm', 'gru']:
        # LSTM/GRU: hidden_size=128, 2 layers, bidirectional possible
        hidden_size = 128
        num_layers = 2
        # Parameters: 4 * hidden_size * (input_size + hidden_size + 1) * num_layers
        params_mb = (4 * hidden_size * (n_features + hidden_size + 1) * num_layers * 4) / 1e6
        # Activations per batch: batch * seq_len * hidden_size * 4 (float32)
        activations_mb = batch_size * seq_len * hidden_size * 4 / 1e6
        # Total: params + activations * 3 (forward + backward + optimizer states)
        return (params_mb + activations_mb * 3) / 1024  # Convert to GB
    elif model_name == 'tcn':
        # TCN: more parameters due to dilated convolutions
        hidden_size = 256
        num_channels = [64, 128, 256]
        params_mb = 5  # Rough estimate: ~5M parameters
        activations_mb = batch_size * seq_len * hidden_size * 4 / 1e6
        return (params_mb + activations_mb * 3) / 1024  # Convert to GB
    return 0

def get_safe_batch_size(model_name, n_samples, n_features, seq_len, available_gb, base_batch_size=256):
    """Calculate safe batch size to avoid OOM.
    
    Args:
        model_name: Name of the model
        n_samples: Number of training samples
        n_features: Number of input features
        seq_len: Sequence length
        available_gb: Available GPU memory in GB
        base_batch_size: Starting batch size to try
        
    Returns:
        Safe batch size that should fit in GPU memory
    """
    if model_name not in ['lstm', 'gru', 'tcn']:
        return base_batch_size
    
    # Reserve 2GB for PyTorch overhead and other allocations
    usable_gb = max(1, available_gb - 2)
    
    # Start with base batch size and reduce if needed
    batch_size = base_batch_size
    while batch_size >= 16:
        estimated = estimate_gpu_memory_needed(model_name, n_samples, n_features, seq_len, batch_size)
        if estimated < usable_gb:
            return batch_size
        batch_size = batch_size // 2
    
    return 16  # Minimum batch size

# Store utilities for other cells
GPU_MEMORY_UTILS = {
    'estimate_gpu_memory_needed': estimate_gpu_memory_needed,
    'get_safe_batch_size': get_safe_batch_size,
}

training_mode = "Single Model"  #@param ["Single Model", "Multi-Model (Sequential)"]

#@markdown ---
#@markdown ### Single Model Options
single_model = "xgboost"  #@param ["xgboost", "lightgbm", "catboost", "random_forest", "logistic", "svm", "lstm", "gru", "tcn"]

#@markdown ---
#@markdown ### Multi-Model Options
train_boosting = True  #@param {type: "boolean"}
#@markdown XGBoost, LightGBM, CatBoost
train_classical = False  #@param {type: "boolean"}
#@markdown Random Forest, Logistic, SVM
train_neural = False  #@param {type: "boolean"}
#@markdown LSTM, GRU, TCN (requires GPU)

#@markdown ---
#@markdown ### Training Parameters
horizon = 20  #@param [5, 10, 15, 20]
sequence_length = 60  #@param {type: "slider", min: 30, max: 120, step: 10}

# Build model list
if training_mode == "Single Model":
    MODELS_TO_TRAIN = [single_model]
else:
    MODELS_TO_TRAIN = []
    if train_boosting:
        MODELS_TO_TRAIN.extend(['xgboost', 'lightgbm', 'catboost'])
    if train_classical:
        MODELS_TO_TRAIN.extend(['random_forest', 'logistic', 'svm'])
    if train_neural and GPU_AVAILABLE:
        MODELS_TO_TRAIN.extend(['lstm', 'gru', 'tcn'])
    elif train_neural and not GPU_AVAILABLE:
        print("WARNING: Neural models skipped (no GPU)")

HORIZON = horizon
SEQ_LEN = sequence_length

print(f"Training Mode: {training_mode}")
print(f"Models to train: {MODELS_TO_TRAIN}")
print(f"Horizon: H{HORIZON}")
if any(m in ['lstm', 'gru', 'tcn'] for m in MODELS_TO_TRAIN):
    print(f"Sequence length: {SEQ_LEN}")
    if GPU_AVAILABLE:
        print(f"GPU Memory: {GPU_MEMORY:.1f} GB")
        if GPU_MEMORY < 10:
            print("  [NOTE] Low GPU memory - batch size will be auto-reduced for neural models")

In [None]:
#@title 4.0.1 Quick Recover (After Kernel Crash) { display-mode: "form" }
#@markdown **One-click recovery after kernel crash or restart.**
#@markdown 
#@markdown This cell restores:
#@markdown - All training results from previous session
#@markdown - Session variables (HORIZON, MODELS_TO_TRAIN, etc.)
#@markdown - GPU configuration
#@markdown - Checkpoint state for incomplete training

import sys
import os
from pathlib import Path
import json

# Ensure project path
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')
os.chdir('/content/research')

print("=" * 60)
print(" QUICK RECOVERY")
print("=" * 60)

# Define cache paths
CACHE_DIR = Path('/content/drive/MyDrive/research/experiments')
TRAINING_CACHE = CACHE_DIR / '.training_results_cache.json'
SESSION_CACHE = CACHE_DIR / '.session_state_cache.json'
CHECKPOINT_CACHE = CACHE_DIR / '.training_checkpoint.json'

recovery_success = True
recovered_items = []

# 1. Recover TRAINING_RESULTS
TRAINING_RESULTS = {}
if TRAINING_CACHE.exists():
    try:
        with open(TRAINING_CACHE) as f:
            TRAINING_RESULTS = json.load(f)
        recovered_items.append(f"TRAINING_RESULTS ({len(TRAINING_RESULTS)} models)")
    except Exception as e:
        print(f"[WARNING] Could not load training results: {e}")
        recovery_success = False
else:
    print("[INFO] No training results cache found")

# 2. Recover session state
HORIZON = 20
MODELS_TO_TRAIN = ['xgboost']
SEQ_LEN = 60

if SESSION_CACHE.exists():
    try:
        with open(SESSION_CACHE) as f:
            session = json.load(f)
        HORIZON = session.get('horizon', 20)
        MODELS_TO_TRAIN = session.get('models_to_train', ['xgboost'])
        SEQ_LEN = session.get('seq_len', 60)
        recovered_items.append(f"Session (H{HORIZON}, {len(MODELS_TO_TRAIN)} models)")
    except Exception as e:
        print(f"[WARNING] Could not load session state: {e}")
        recovery_success = False
else:
    print("[INFO] No session cache found - using defaults")

# 3. Detect GPU
try:
    import torch
    GPU_AVAILABLE = torch.cuda.is_available()
    if GPU_AVAILABLE:
        props = torch.cuda.get_device_properties(0)
        GPU_NAME = props.name
        GPU_MEMORY = props.total_memory / (1024**3)
        
        if GPU_MEMORY >= 40:
            RECOMMENDED_BATCH_SIZE = 1024
            MIXED_PRECISION = True
        elif GPU_MEMORY >= 15:
            RECOMMENDED_BATCH_SIZE = 512
            MIXED_PRECISION = True
        else:
            RECOMMENDED_BATCH_SIZE = 256
            MIXED_PRECISION = props.major >= 7
        
        recovered_items.append(f"GPU ({GPU_NAME})")
    else:
        RECOMMENDED_BATCH_SIZE = 256
        MIXED_PRECISION = False
        recovered_items.append("CPU mode")
except Exception as e:
    GPU_AVAILABLE = False
    RECOMMENDED_BATCH_SIZE = 256
    MIXED_PRECISION = False
    print(f"[WARNING] GPU detection failed: {e}")

# 4. Check for incomplete training
incomplete_training = None
if CHECKPOINT_CACHE.exists():
    try:
        with open(CHECKPOINT_CACHE) as f:
            checkpoint = json.load(f)
        pending = checkpoint.get('pending_models', [])
        if pending:
            incomplete_training = checkpoint
            recovered_items.append(f"Checkpoint ({len(pending)} pending)")
    except:
        pass

# Summary
print(f"\n[RECOVERED] {len(recovered_items)} items:")
for item in recovered_items:
    print(f"  - {item}")

print(f"\n" + "-" * 60)
print("RESTORED VARIABLES:")
print("-" * 60)
print(f"  HORIZON = {HORIZON}")
print(f"  MODELS_TO_TRAIN = {MODELS_TO_TRAIN}")
print(f"  SEQ_LEN = {SEQ_LEN}")
print(f"  GPU_AVAILABLE = {GPU_AVAILABLE}")
print(f"  RECOMMENDED_BATCH_SIZE = {RECOMMENDED_BATCH_SIZE}")
print(f"  MIXED_PRECISION = {MIXED_PRECISION}")

if TRAINING_RESULTS:
    print(f"\n" + "-" * 60)
    print("TRAINING RESULTS:")
    print("-" * 60)
    for model, data in TRAINING_RESULTS.items():
        if not model.endswith('_FAILED'):
            metrics = data.get('metrics', {})
            acc = metrics.get('accuracy', 0)
            f1 = metrics.get('macro_f1', 0)
            print(f"  {model}: Accuracy={acc:.2%}, Macro F1={f1:.4f}")

if incomplete_training:
    print(f"\n" + "-" * 60)
    print("INCOMPLETE TRAINING DETECTED:")
    print("-" * 60)
    print(f"  Completed: {incomplete_training.get('completed_models', [])}")
    print(f"  Pending: {incomplete_training.get('pending_models', [])}")
    print(f"\n  Run Section 4.2 to resume training!")

print(f"\n" + "=" * 60)
print(" NEXT STEPS")
print("=" * 60)
if TRAINING_RESULTS:
    print("  - Section 4.3: Compare results")
    print("  - Section 4.4: Evaluate on test set")
    print("  - Section 4.2: Train more models (completed will be skipped)")
else:
    print("  - Section 4.1: Configure training")
    print("  - Section 4.2: Start training")
print("=" * 60)

In [None]:
#@title 4.0.2 Clear Cache (Force Retrain) { display-mode: "form" }
#@markdown **Clear cached results to retrain models from scratch.**
#@markdown 
#@markdown Use this when you want to:
#@markdown - Retrain models with different parameters
#@markdown - Start fresh after changing data
#@markdown - Clear failed training attempts

from pathlib import Path
import json

clear_training_results = False  #@param {type: "boolean"}
#@markdown Clear all trained model results

clear_session_state = False  #@param {type: "boolean"}
#@markdown Clear session variables (HORIZON, MODELS_TO_TRAIN, etc.)

clear_checkpoints = True  #@param {type: "boolean"}
#@markdown Clear incomplete training checkpoints

print("=" * 60)
print(" CACHE MANAGEMENT")
print("=" * 60)

CACHE_DIR = Path('/content/drive/MyDrive/research/experiments')
TRAINING_CACHE = CACHE_DIR / '.training_results_cache.json'
SESSION_CACHE = CACHE_DIR / '.session_state_cache.json'
CHECKPOINT_CACHE = CACHE_DIR / '.training_checkpoint.json'

cleared = []

if clear_training_results:
    if TRAINING_CACHE.exists():
        # Backup before clearing
        backup_path = CACHE_DIR / '.training_results_cache.backup.json'
        TRAINING_CACHE.rename(backup_path)
        cleared.append(f"Training results (backed up to {backup_path.name})")
        
        # Also clear in-memory variable
        if 'TRAINING_RESULTS' in dir():
            TRAINING_RESULTS = {}
    else:
        print("  No training results cache to clear")

if clear_session_state:
    if SESSION_CACHE.exists():
        SESSION_CACHE.unlink()
        cleared.append("Session state")
    else:
        print("  No session state cache to clear")

if clear_checkpoints:
    if CHECKPOINT_CACHE.exists():
        CHECKPOINT_CACHE.unlink()
        cleared.append("Training checkpoint")
    else:
        print("  No checkpoint to clear")

if cleared:
    print(f"\n[CLEARED] {len(cleared)} cache(s):")
    for item in cleared:
        print(f"  - {item}")
    print("\nYou can now run Section 4.2 to retrain all models.")
else:
    print("\nNo caches were cleared.")
    print("Enable the checkboxes above to clear specific caches.")

# Show current cache status
print(f"\n" + "-" * 60)
print("CURRENT CACHE STATUS:")
print("-" * 60)

if TRAINING_CACHE.exists():
    try:
        with open(TRAINING_CACHE) as f:
            results = json.load(f)
        print(f"  Training results: {len(results)} model(s)")
    except:
        print(f"  Training results: exists (unreadable)")
else:
    print(f"  Training results: empty")

if SESSION_CACHE.exists():
    print(f"  Session state: saved")
else:
    print(f"  Session state: empty")

if CHECKPOINT_CACHE.exists():
    try:
        with open(CHECKPOINT_CACHE) as f:
            cp = json.load(f)
        pending = len(cp.get('pending_models', []))
        print(f"  Checkpoint: {pending} model(s) pending")
    except:
        print(f"  Checkpoint: exists (unreadable)")
else:
    print(f"  Checkpoint: none")

print("=" * 60)

In [None]:
#@title 4.2 Train Models { display-mode: "form" }
#@markdown Execute model training based on your selections.

import sys
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')

import time
import gc
from pathlib import Path
import json

print("=" * 60)
print(" MODEL TRAINING")
print("=" * 60)

# Use defaults if not configured (handles kernel restart or skipped cells)
HORIZON = HORIZON if 'HORIZON' in dir() else 20
MODELS_TO_TRAIN = MODELS_TO_TRAIN if 'MODELS_TO_TRAIN' in dir() else ['xgboost']
SEQ_LEN = SEQ_LEN if 'SEQ_LEN' in dir() else 60
RECOMMENDED_BATCH_SIZE = RECOMMENDED_BATCH_SIZE if 'RECOMMENDED_BATCH_SIZE' in dir() else 256
GPU_AVAILABLE = GPU_AVAILABLE if 'GPU_AVAILABLE' in dir() else False
MIXED_PRECISION = MIXED_PRECISION if 'MIXED_PRECISION' in dir() else False
GPU_MEMORY = GPU_MEMORY if 'GPU_MEMORY' in dir() else 0

# Initialize TRAINING_RESULTS if not exists
if 'TRAINING_RESULTS' not in dir():
    TRAINING_RESULTS = {}

# GPU memory utilities (define locally if not available from 4.1)
if 'GPU_MEMORY_UTILS' not in dir():
    def estimate_gpu_memory_needed(model_name, n_samples, n_features, seq_len, batch_size):
        """Estimate GPU memory needed for training in GB."""
        if model_name in ['lstm', 'gru']:
            hidden_size = 128
            num_layers = 2
            params_mb = (4 * hidden_size * (n_features + hidden_size + 1) * num_layers * 4) / 1e6
            activations_mb = batch_size * seq_len * hidden_size * 4 / 1e6
            return (params_mb + activations_mb * 3) / 1024
        elif model_name == 'tcn':
            hidden_size = 256
            params_mb = 5
            activations_mb = batch_size * seq_len * hidden_size * 4 / 1e6
            return (params_mb + activations_mb * 3) / 1024
        return 0
    
    def get_safe_batch_size(model_name, n_samples, n_features, seq_len, available_gb, base_batch_size=256):
        """Calculate safe batch size to avoid OOM."""
        if model_name not in ['lstm', 'gru', 'tcn']:
            return base_batch_size
        usable_gb = max(1, available_gb - 2)
        batch_size = base_batch_size
        while batch_size >= 16:
            estimated = estimate_gpu_memory_needed(model_name, n_samples, n_features, seq_len, batch_size)
            if estimated < usable_gb:
                return batch_size
            batch_size = batch_size // 2
        return 16
else:
    estimate_gpu_memory_needed = GPU_MEMORY_UTILS['estimate_gpu_memory_needed']
    get_safe_batch_size = GPU_MEMORY_UTILS['get_safe_batch_size']

def clear_gpu_memory():
    """Clear GPU memory cache and run garbage collection."""
    gc.collect()
    try:
        import torch
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.synchronize()
    except:
        pass

print(f"\nConfiguration (using defaults if not set):")
print(f"  HORIZON: {HORIZON}")
print(f"  MODELS_TO_TRAIN: {MODELS_TO_TRAIN}")
print(f"  SEQ_LEN: {SEQ_LEN}")
print(f"  GPU_AVAILABLE: {GPU_AVAILABLE}")
if GPU_AVAILABLE:
    print(f"  GPU_MEMORY: {GPU_MEMORY:.1f} GB")

# Memory check before training
try:
    if 'print_memory_status' in dir():
        print_memory_status("Before Training")
except:
    pass

try:
    from src.models import ModelRegistry, Trainer, TrainerConfig
    from src.phase1.stages.datasets.container import TimeSeriesDataContainer
    
    # CORRECT: Load data from GitHub clone (not Google Drive)
    print(f"\nLoading data for horizon H{HORIZON}...")
    container = TimeSeriesDataContainer.from_parquet_dir(
        path=Path('/content/research/data/splits/scaled'),
        horizon=HORIZON
    )
    n_samples = container.splits['train'].n_samples
    n_features = container.n_features
    print(f"  Train samples: {n_samples:,}")
    print(f"  Val samples: {container.splits['val'].n_samples:,}")
    print(f"  Features: {n_features}")
    
    # Memory check after loading container
    try:
        if 'print_memory_status' in dir():
            print_memory_status("After Loading Container")
    except:
        pass
    
    for i, model_name in enumerate(MODELS_TO_TRAIN, 1):
        print(f"\n{'='*60}")
        print(f" [{i}/{len(MODELS_TO_TRAIN)}] Training: {model_name.upper()}")
        print("=" * 60)
        
        # Memory check before each model
        try:
            if 'print_memory_status' in dir():
                print_memory_status(f"Before {model_name}")
        except:
            pass
        
        # Clear GPU memory before each model
        clear_gpu_memory()
        
        start_time = time.time()
        
        # Configure - save results to Google Drive
        if model_name in ['lstm', 'gru', 'tcn']:
            # Calculate safe batch size for neural models
            if GPU_AVAILABLE and GPU_MEMORY > 0:
                safe_batch_size = get_safe_batch_size(
                    model_name, n_samples, n_features, SEQ_LEN, GPU_MEMORY, RECOMMENDED_BATCH_SIZE
                )
                estimated_mem = estimate_gpu_memory_needed(
                    model_name, n_samples, n_features, SEQ_LEN, safe_batch_size
                )
                print(f"  GPU Memory Check:")
                print(f"    Available: {GPU_MEMORY:.1f} GB")
                print(f"    Estimated needed: {estimated_mem:.2f} GB")
                print(f"    Batch size: {safe_batch_size} (auto-adjusted from {RECOMMENDED_BATCH_SIZE})")
                
                if safe_batch_size < RECOMMENDED_BATCH_SIZE:
                    print(f"    [WARNING] Reduced batch size due to limited GPU memory")
            else:
                safe_batch_size = min(128, RECOMMENDED_BATCH_SIZE)  # Conservative for CPU
            
            config = TrainerConfig(
                model_name=model_name,
                horizon=HORIZON,
                sequence_length=SEQ_LEN,
                batch_size=safe_batch_size,
                max_epochs=50,
                early_stopping_patience=10,
                output_dir=Path('/content/drive/MyDrive/research/experiments/runs'),
                device="cuda" if GPU_AVAILABLE else "cpu",
                mixed_precision=MIXED_PRECISION,
            )
            
            # Try training with OOM recovery
            trainer = Trainer(config)
            try:
                results = trainer.run(container)
            except RuntimeError as e:
                if "out of memory" in str(e).lower() or "CUDA" in str(e):
                    print(f"\n  [OOM] GPU out of memory! Retrying with smaller batch size...")
                    clear_gpu_memory()
                    
                    # Memory check after OOM
                    try:
                        if 'print_memory_status' in dir():
                            print_memory_status("After OOM Cleanup")
                    except:
                        pass
                    
                    # Retry with halved batch size
                    retry_batch_size = max(16, safe_batch_size // 2)
                    print(f"  [OOM] Reducing batch size: {safe_batch_size} -> {retry_batch_size}")
                    
                    config = TrainerConfig(
                        model_name=model_name,
                        horizon=HORIZON,
                        sequence_length=SEQ_LEN,
                        batch_size=retry_batch_size,
                        max_epochs=50,
                        early_stopping_patience=10,
                        output_dir=Path('/content/drive/MyDrive/research/experiments/runs'),
                        device="cuda" if GPU_AVAILABLE else "cpu",
                        mixed_precision=MIXED_PRECISION,
                    )
                    trainer = Trainer(config)
                    results = trainer.run(container)
                else:
                    raise  # Re-raise if not an OOM error
        else:
            config = TrainerConfig(
                model_name=model_name,
                horizon=HORIZON,
                output_dir=Path('/content/drive/MyDrive/research/experiments/runs'),
            )
            trainer = Trainer(config)
            results = trainer.run(container)
        
        elapsed = time.time() - start_time
        
        TRAINING_RESULTS[model_name] = {
            'metrics': results.get('evaluation_metrics', {}),
            'time': elapsed,
            'run_id': results.get('run_id', 'unknown'),
        }
        
        metrics = results.get('evaluation_metrics', {})
        print(f"\n  Results:")
        print(f"    Accuracy: {metrics.get('accuracy', 0):.2%}")
        print(f"    Macro F1: {metrics.get('macro_f1', 0):.4f}")
        print(f"    Time: {elapsed:.1f}s")
        
        # Clean up trainer to free memory
        del trainer, config
        
        # Save cache after each model (in case of kernel restart)
        results_cache = Path('/content/drive/MyDrive/research/experiments/.training_results_cache.json')
        results_cache.parent.mkdir(parents=True, exist_ok=True)
        with open(results_cache, 'w') as f:
            json.dump(TRAINING_RESULTS, f, indent=2)
        
        # Clear GPU memory after each model
        clear_gpu_memory()
        
        # Memory check after each model
        try:
            if 'print_memory_status' in dir():
                print_memory_status(f"After {model_name}")
        except:
            pass
        
except Exception as e:
    print(f"\nError during training: {e}")
    import traceback
    traceback.print_exc()
    
    # Memory check on error (helps diagnose OOM)
    try:
        if 'print_memory_status' in dir():
            print_memory_status("At Error")
    except:
        pass
    
    # Clear GPU memory on error
    clear_gpu_memory()

# Final save
if TRAINING_RESULTS:
    results_cache = Path('/content/drive/MyDrive/research/experiments/.training_results_cache.json')
    results_cache.parent.mkdir(parents=True, exist_ok=True)
    with open(results_cache, 'w') as f:
        json.dump(TRAINING_RESULTS, f, indent=2)
    print(f"\nResults cached to: {results_cache}")

# Final GPU cleanup
clear_gpu_memory()

# Memory check after training
try:
    if 'print_memory_status' in dir():
        print_memory_status("After All Training")
except:
    pass

print("\n" + "=" * 60)
print(" TRAINING COMPLETE")
print("=" * 60)

In [None]:
#@title 4.3 Compare Results { display-mode: "form" }
#@markdown Display comparison of all trained models.

import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import json

# Initialize TRAINING_RESULTS if not exists
if 'TRAINING_RESULTS' not in dir():
    TRAINING_RESULTS = {}

# Try to load from cache if empty
if not TRAINING_RESULTS:
    print("No results in memory. Checking disk cache...")
    results_cache = Path('/content/drive/MyDrive/research/experiments/.training_results_cache.json')
    if results_cache.exists():
        try:
            with open(results_cache) as f:
                TRAINING_RESULTS = json.load(f)
            print(f"Loaded {len(TRAINING_RESULTS)} model(s) from cache.\n")
        except Exception as e:
            print(f"Could not load cache: {e}")

if TRAINING_RESULTS:
    print("Model Comparison")
    print("=" * 60)
    
    rows = []
    for model, data in TRAINING_RESULTS.items():
        metrics = data.get('metrics', {})
        rows.append({
            'Model': model,
            'Accuracy': metrics.get('accuracy', 0),
            'Macro F1': metrics.get('macro_f1', 0),
            'Weighted F1': metrics.get('weighted_f1', 0),
            'Time (s)': data.get('time', 0),
        })
    
    comparison_df = pd.DataFrame(rows)
    comparison_df = comparison_df.sort_values('Macro F1', ascending=False)
    print(comparison_df.to_string(index=False))
    
    if len(TRAINING_RESULTS) > 1:
        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
        
        comparison_df_sorted = comparison_df.sort_values('Accuracy', ascending=True)
        axes[0].barh(comparison_df_sorted['Model'], comparison_df_sorted['Accuracy'])
        axes[0].set_xlabel('Accuracy')
        axes[0].set_title('Model Accuracy Comparison')
        axes[0].set_xlim(0, 1)
        
        comparison_df_sorted = comparison_df.sort_values('Time (s)', ascending=True)
        axes[1].barh(comparison_df_sorted['Model'], comparison_df_sorted['Time (s)'])
        axes[1].set_xlabel('Training Time (seconds)')
        axes[1].set_title('Training Time Comparison')
        
        plt.tight_layout()
        plt.show()
    
    best_model = comparison_df.iloc[0]['Model']
    print(f"\nBest model: {best_model}")
else:
    print("No training results available.")
    print("\nOptions:")
    print("  1. Run Section 4.2 to train models")
    print("  2. Run Section 4.0 to check for cached results")
    print("  3. Check Google Drive for previous runs: /content/drive/MyDrive/research/experiments/runs/")

In [None]:
#@title 4.4 Evaluate on Test Set { display-mode: "form" }
#@markdown Evaluate the best model on the held-out test set.

import sys
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score
from pathlib import Path
import json
import gc

# Initialize and recover TRAINING_RESULTS if needed
if 'TRAINING_RESULTS' not in dir():
    TRAINING_RESULTS = {}

# Try to load from cache if empty
if not TRAINING_RESULTS:
    print("No results in memory. Checking disk cache...")
    results_cache = Path('/content/drive/MyDrive/research/experiments/.training_results_cache.json')
    if results_cache.exists():
        try:
            with open(results_cache) as f:
                TRAINING_RESULTS = json.load(f)
            print(f"Loaded {len(TRAINING_RESULTS)} model(s) from cache.\n")
        except Exception as e:
            print(f"Could not load cache: {e}")

# Use default HORIZON if not set
HORIZON = HORIZON if 'HORIZON' in dir() else 20

if not TRAINING_RESULTS:
    print("No trained models found.")
    print("\nOptions:")
    print("  1. Run Section 4.2 to train models")
    print("  2. Run Section 4.0 to recover from cache")
    print("  3. Manually load a model from disk (see code below)")
    print("\n# Manual model loading example:")
    print("# from src.models import ModelRegistry")
    print("# model = ModelRegistry.create('xgboost')")
    print("# model.load(Path('/content/drive/MyDrive/research/experiments/runs/<run_id>/checkpoints/best_model'))")
else:
    # Find best model
    best_model_name = max(TRAINING_RESULTS, key=lambda m: TRAINING_RESULTS[m].get('metrics', {}).get('macro_f1', 0))
    best_run_id = TRAINING_RESULTS[best_model_name].get('run_id', 'unknown')
    
    print("=" * 60)
    print(f" TEST SET EVALUATION: {best_model_name.upper()}")
    print("=" * 60)
    print(f"Using horizon: H{HORIZON}")
    print(f"Run ID: {best_run_id}")
    
    try:
        # Load test data only (not full container)
        from src.phase1.stages.datasets.container import TimeSeriesDataContainer
        container = TimeSeriesDataContainer.from_parquet_dir(
            path=Path('/content/research/data/splits/scaled'),
            horizon=HORIZON
        )
        X_test, y_test, _ = container.get_sklearn_arrays('test')
        
        # Free container after extracting test data
        del container
        gc.collect()
        
        # Load model and predict
        from src.models import ModelRegistry
        model = ModelRegistry.create(best_model_name)
        model_path = Path(f'/content/drive/MyDrive/research/experiments/runs/{best_run_id}/checkpoints/best_model')
        
        if not model_path.exists():
            print(f"\nModel checkpoint not found at: {model_path}")
            print("Searching for alternative checkpoints...")
            experiments_dir = Path('/content/drive/MyDrive/research/experiments/runs')
            if experiments_dir.exists():
                for run_dir in sorted(experiments_dir.iterdir(), reverse=True):
                    alt_path = run_dir / 'checkpoints' / 'best_model'
                    if alt_path.exists():
                        print(f"Found: {alt_path}")
                        model_path = alt_path
                        break
        
        model.load(model_path)
        
        predictions = model.predict(X_test)
        y_pred = predictions.class_predictions
        
        # Free model after prediction
        del model
        gc.collect()
        
        # Calculate metrics
        accuracy = accuracy_score(y_test, y_pred)
        macro_f1 = f1_score(y_test, y_pred, average='macro')
        
        print(f"\nTest Set Results:")
        print(f"  Samples: {len(y_test):,}")
        print(f"  Accuracy: {accuracy:.2%}")
        print(f"  Macro F1: {macro_f1:.4f}")
        
        # Classification report
        print(f"\nClassification Report:")
        class_names = ['Short', 'Neutral', 'Long']
        print(classification_report(y_test, y_pred, target_names=class_names))
        
        # Confusion matrix
        cm = confusion_matrix(y_test, y_pred)
        fig, ax = plt.subplots(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                    xticklabels=class_names, yticklabels=class_names, ax=ax)
        ax.set_xlabel('Predicted')
        ax.set_ylabel('Actual')
        ax.set_title(f'Test Set Confusion Matrix - {best_model_name.upper()}')
        plt.tight_layout()
        plt.show()
        
        # Close figure to free memory
        plt.close(fig)
        gc.collect()
        
        print(f"\nModel loaded from: {model_path}")
        
    except FileNotFoundError as e:
        print(f"\nFile not found: {e}")
        print("Make sure the data and model files exist.")
    except Exception as e:
        print(f"\nError during evaluation: {e}")
        import traceback
        traceback.print_exc()
    finally:
        # Ensure cleanup happens even on error
        gc.collect()

---## 5. Phase 3: Cross-Validation (Optional)Run cross-validation for robust model evaluation. (Coming soon)

---## 6. Phase 4: Ensemble Training (Optional)

In [None]:
#@title 6.1 Train Ensemble { display-mode: "form" }
#@markdown Combine multiple models into an ensemble for improved predictions.

import sys
if '/content/research' not in sys.path:
    sys.path.insert(0, '/content/research')

import gc

train_ensemble = False  #@param {type: "boolean"}
ensemble_type = "blending"  #@param ["voting", "stacking", "blending"]
base_models = "xgboost,lightgbm,catboost"  #@param {type: "string"}
meta_learner = "logistic"  #@param ["logistic", "random_forest", "xgboost"]

# GPU memory utilities (define locally if not available)
def estimate_gpu_memory_needed(model_name, n_samples, n_features, seq_len, batch_size):
    """Estimate GPU memory needed for training in GB."""
    if model_name in ['lstm', 'gru']:
        hidden_size = 128
        num_layers = 2
        params_mb = (4 * hidden_size * (n_features + hidden_size + 1) * num_layers * 4) / 1e6
        activations_mb = batch_size * seq_len * hidden_size * 4 / 1e6
        return (params_mb + activations_mb * 3) / 1024
    elif model_name == 'tcn':
        hidden_size = 256
        params_mb = 5
        activations_mb = batch_size * seq_len * hidden_size * 4 / 1e6
        return (params_mb + activations_mb * 3) / 1024
    return 0

def get_safe_batch_size(model_name, n_samples, n_features, seq_len, available_gb, base_batch_size=256):
    """Calculate safe batch size to avoid OOM."""
    if model_name not in ['lstm', 'gru', 'tcn']:
        return base_batch_size
    usable_gb = max(1, available_gb - 2)
    batch_size = base_batch_size
    while batch_size >= 16:
        estimated = estimate_gpu_memory_needed(model_name, n_samples, n_features, seq_len, batch_size)
        if estimated < usable_gb:
            return batch_size
        batch_size = batch_size // 2
    return 16

def clear_gpu_memory():
    """Clear GPU memory cache and run garbage collection."""
    gc.collect()
    try:
        import torch
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            torch.cuda.synchronize()
    except:
        pass

if train_ensemble:
    import time
    from pathlib import Path
    
    # Parse base models
    base_model_list = [m.strip() for m in base_models.split(',')]
    
    print("=" * 60)
    print(f" {ensemble_type.upper()} ENSEMBLE TRAINING")
    print("=" * 60)
    print(f"Base models: {', '.join(base_model_list)}")
    print(f"Meta-learner: {meta_learner}")
    
    # Check for neural models and GPU memory
    neural_models = [m for m in base_model_list if m in ['lstm', 'gru', 'tcn']]
    if neural_models:
        import torch
        if torch.cuda.is_available():
            gpu_mem = torch.cuda.get_device_properties(0).total_memory / (1024**3)
            print(f"GPU Memory: {gpu_mem:.1f} GB")
            if gpu_mem < 10:
                print(f"  [NOTE] Low GPU memory - batch sizes will be auto-reduced")
        else:
            print("  [WARNING] No GPU - neural models will use CPU (slow)")
    print()
    
    # Store results for comparison
    base_model_results = {}
    ensemble_start_time = time.time()
    
    try:
        from src.models import ModelRegistry, Trainer, TrainerConfig
        from src.phase1.stages.datasets.container import TimeSeriesDataContainer
        
        # Get GPU info
        import torch
        GPU_AVAILABLE = torch.cuda.is_available()
        GPU_MEMORY = torch.cuda.get_device_properties(0).total_memory / (1024**3) if GPU_AVAILABLE else 0
        
        # Load data from GitHub clone
        print("Loading data...")
        container = TimeSeriesDataContainer.from_parquet_dir(
            path=Path('/content/research/data/splits/scaled'),
            horizon=HORIZON
        )
        n_samples = container.splits['train'].n_samples
        n_features = container.n_features
        print(f"  Samples: train={n_samples:,}, "
              f"val={container.splits['val'].n_samples:,}")
        print(f"  Features: {n_features}")
        print()
        
        # Train each base model with progress
        print("-" * 60)
        print(" Training Base Models")
        print("-" * 60)
        
        for i, model_name in enumerate(base_model_list, 1):
            print(f"[{i}/{len(base_model_list)}] Training {model_name}...", end=" ", flush=True)
            
            # Clear GPU memory before each model
            clear_gpu_memory()
            
            model_start = time.time()
            
            try:
                # Configure base model
                if model_name in ['lstm', 'gru', 'tcn']:
                    # Calculate safe batch size
                    seq_len = SEQ_LEN if 'SEQ_LEN' in dir() else 60
                    base_batch = RECOMMENDED_BATCH_SIZE if 'RECOMMENDED_BATCH_SIZE' in dir() else 256
                    safe_batch_size = get_safe_batch_size(
                        model_name, n_samples, n_features, seq_len, GPU_MEMORY, base_batch
                    )
                    
                    if safe_batch_size < base_batch:
                        print(f"(batch={safe_batch_size}) ", end="", flush=True)
                    
                    config = TrainerConfig(
                        model_name=model_name,
                        horizon=HORIZON,
                        sequence_length=seq_len,
                        batch_size=safe_batch_size,
                        max_epochs=50,
                        early_stopping_patience=10,
                        output_dir=Path('/content/drive/MyDrive/research/experiments/runs'),
                        device="cuda" if GPU_AVAILABLE else "cpu",
                        mixed_precision=MIXED_PRECISION if 'MIXED_PRECISION' in dir() else False,
                    )
                    
                    # Try training with OOM recovery
                    trainer = Trainer(config)
                    try:
                        results = trainer.run(container)
                    except RuntimeError as e:
                        if "out of memory" in str(e).lower() or "CUDA" in str(e):
                            print(f"OOM, retrying...", end=" ", flush=True)
                            clear_gpu_memory()
                            
                            # Retry with halved batch size
                            retry_batch_size = max(16, safe_batch_size // 2)
                            config = TrainerConfig(
                                model_name=model_name,
                                horizon=HORIZON,
                                sequence_length=seq_len,
                                batch_size=retry_batch_size,
                                max_epochs=50,
                                early_stopping_patience=10,
                                output_dir=Path('/content/drive/MyDrive/research/experiments/runs'),
                                device="cuda" if GPU_AVAILABLE else "cpu",
                                mixed_precision=MIXED_PRECISION if 'MIXED_PRECISION' in dir() else False,
                            )
                            trainer = Trainer(config)
                            results = trainer.run(container)
                        else:
                            raise
                else:
                    config = TrainerConfig(
                        model_name=model_name,
                        horizon=HORIZON,
                        output_dir=Path('/content/drive/MyDrive/research/experiments/runs'),
                    )
                    trainer = Trainer(config)
                    results = trainer.run(container)
                
                model_elapsed = time.time() - model_start
                metrics = results.get('evaluation_metrics', {})
                
                base_model_results[model_name] = {
                    'accuracy': metrics.get('accuracy', 0),
                    'macro_f1': metrics.get('macro_f1', 0),
                    'weighted_f1': metrics.get('weighted_f1', 0),
                    'time': model_elapsed,
                    'run_id': results.get('run_id', 'unknown'),
                }
                
                print(f"done ({model_elapsed:.1f}s) - Acc: {metrics.get('accuracy', 0):.1%}")
                
                # Clear GPU memory after each model
                clear_gpu_memory()
                
            except Exception as e:
                print(f"FAILED: {e}")
                base_model_results[model_name] = {
                    'accuracy': 0, 'macro_f1': 0, 'weighted_f1': 0,
                    'time': time.time() - model_start, 'error': str(e)
                }
                clear_gpu_memory()
        
        print()
        
        # Clear GPU memory before ensemble training
        clear_gpu_memory()
        
        # Train ensemble (meta-learner)
        print("-" * 60)
        print(" Training Ensemble Meta-Learner")
        print("-" * 60)
        print(f"Training {ensemble_type} with {meta_learner} meta-learner...", end=" ", flush=True)
        
        meta_start = time.time()
        
        # Configure ensemble
        ensemble_config = TrainerConfig(
            model_name=ensemble_type,
            horizon=HORIZON,
            output_dir=Path('/content/drive/MyDrive/research/experiments/runs'),
            model_config={
                "base_model_names": base_model_list,
                "meta_learner": meta_learner,
            }
        )
        
        ensemble_trainer = Trainer(ensemble_config)
        ensemble_results = ensemble_trainer.run(container)
        
        meta_elapsed = time.time() - meta_start
        ensemble_metrics = ensemble_results.get('evaluation_metrics', {})
        
        print(f"done ({meta_elapsed:.1f}s)")
        
        # Store ensemble results
        ensemble_accuracy = ensemble_metrics.get('accuracy', 0)
        ensemble_macro_f1 = ensemble_metrics.get('macro_f1', 0)
        ensemble_weighted_f1 = ensemble_metrics.get('weighted_f1', 0)
        
        total_elapsed = time.time() - ensemble_start_time
        
        # Display results
        print()
        print("=" * 60)
        print(" ENSEMBLE RESULTS")
        print("=" * 60)
        print(f"Accuracy:     {ensemble_accuracy:.2%}")
        print(f"Macro F1:     {ensemble_macro_f1:.4f}")
        print(f"Weighted F1:  {ensemble_weighted_f1:.4f}")
        print()
        
        # Comparison table
        print("-" * 60)
        print(" Comparison: Ensemble vs Base Models")
        print("-" * 60)
        print(f"{'Model':<15} {'Accuracy':>10} {'Macro F1':>10} {'Time':>10}")
        print("-" * 45)
        
        # Find best accuracy
        all_accuracies = {k: v['accuracy'] for k, v in base_model_results.items()}
        all_accuracies[ensemble_type.upper()] = ensemble_accuracy
        best_model = max(all_accuracies, key=all_accuracies.get)
        
        # Print base model results
        for model_name, data in base_model_results.items():
            acc_str = f"{data['accuracy']:.2%}" if data['accuracy'] > 0 else "ERROR"
            f1_str = f"{data['macro_f1']:.4f}" if data['macro_f1'] > 0 else "N/A"
            time_str = f"{data['time']:.1f}s"
            marker = " <-- Best!" if model_name == best_model else ""
            print(f"{model_name:<15} {acc_str:>10} {f1_str:>10} {time_str:>10}{marker}")
        
        # Print ensemble result (highlighted)
        marker = " <-- Best!" if ensemble_type.upper() == best_model else ""
        print(f"{ensemble_type.upper():<15} {ensemble_accuracy:>9.2%} {ensemble_macro_f1:>10.4f} {meta_elapsed:>9.1f}s{marker}")
        print("-" * 45)
        
        # Calculate improvement
        best_base_acc = max(v['accuracy'] for v in base_model_results.values() if v['accuracy'] > 0)
        improvement = ensemble_accuracy - best_base_acc
        
        print()
        if improvement > 0:
            print(f"Ensemble improvement: +{improvement:.2%} over best base model")
        elif improvement < 0:
            print(f"Ensemble underperformed by: {abs(improvement):.2%}")
        else:
            print("Ensemble matched best base model performance")
        
        print(f"Total training time: {total_elapsed:.1f}s")
        print()
        print("=" * 60)
        
        # Store in TRAINING_RESULTS for later comparison
        if 'TRAINING_RESULTS' not in dir():
            TRAINING_RESULTS = {}
        
        # Add base models to training results
        for model_name, data in base_model_results.items():
            if 'error' not in data:
                TRAINING_RESULTS[model_name] = {
                    'metrics': {
                        'accuracy': data['accuracy'],
                        'macro_f1': data['macro_f1'],
                        'weighted_f1': data['weighted_f1'],
                    },
                    'time': data['time'],
                    'run_id': data.get('run_id', 'unknown'),
                }
        
        # Add ensemble to training results
        TRAINING_RESULTS[f"{ensemble_type}_ensemble"] = {
            'metrics': {
                'accuracy': ensemble_accuracy,
                'macro_f1': ensemble_macro_f1,
                'weighted_f1': ensemble_weighted_f1,
            },
            'time': total_elapsed,
            'run_id': ensemble_results.get('run_id', 'unknown'),
            'base_models': base_model_list,
            'meta_learner': meta_learner,
        }
        
        print(f"Results stored in TRAINING_RESULTS['{ensemble_type}_ensemble']")
        
        # Final GPU cleanup
        clear_gpu_memory()
        
    except ImportError as e:
        print(f"\nImport Error: {e}")
        print("Make sure all required modules are available.")
        print("Try running: !pip install xgboost lightgbm catboost scikit-learn")
        clear_gpu_memory()
        
    except FileNotFoundError as e:
        print(f"\nData Error: {e}")
        print("Processed data not found. Run Section 3.2 first to prepare data.")
        clear_gpu_memory()
        
    except Exception as e:
        print(f"\nUnexpected Error: {e}")
        import traceback
        traceback.print_exc()
        print()
        print("Troubleshooting tips:")
        print("  1. Verify data exists: !ls /content/research/data/splits/scaled/")
        print("  2. Check model registry: from src.models import ModelRegistry; print(ModelRegistry.list_all())")
        print("  3. Try training base models individually first (Section 4.2)")
        clear_gpu_memory()
        
else:
    print("Ensemble training skipped.")
    print("Enable 'train_ensemble' checkbox above to run.")
    print()
    print("Available ensemble types:")
    print("  - voting: Weighted average of base model predictions")
    print("  - stacking: Train meta-learner on out-of-fold predictions")
    print("  - blending: Train meta-learner on holdout set predictions")

---## 7. Save Results & Next Steps

In [None]:
#@title 7.1 Summary & Saved Artifacts { display-mode: "form" }
#@markdown Display summary and location of all saved files.

from pathlib import Path

print("=" * 60)
print(" SESSION SUMMARY")
print("=" * 60)

# Data summary
print("\n DATA (from GitHub clone):")
splits_dir = Path('/content/research/data/splits/scaled')
if splits_dir.exists():
    for f in splits_dir.glob("*.parquet"):
        size_mb = f.stat().st_size / 1e6
        print(f"  {f.name}: {size_mb:.1f} MB")

# Training results
print("\n TRAINED MODELS (saved to Google Drive):")
experiments_dir = Path('/content/drive/MyDrive/research/experiments/runs')
if experiments_dir.exists():
    runs = list(experiments_dir.iterdir())
    for run_dir in sorted(runs)[-5:]:
        if run_dir.is_dir():
            print(f"  {run_dir.name}")

# Next steps
print("\n NEXT STEPS:")
print("  1. Review model metrics in Google Drive: experiments/runs/")
print("  2. Try different model configurations")
print("  3. Run cross-validation for robust evaluation")
print("  4. Train ensemble for best performance")
print("  5. Export best model for production")

print("\n" + "=" * 60)
print(" Data loaded from: /content/research")
print(" Results saved to: /content/drive/MyDrive/research")
print("=" * 60)

---## Appendix: Quick Commands```bash# Train single model!python scripts/train_model.py --model xgboost --horizon 20# Train neural model  !python scripts/train_model.py --model lstm --horizon 20 --seq-len 60# Run cross-validation!python scripts/run_cv.py --models xgboost,lightgbm --horizons 20 --n-splits 5# Train ensemble!python scripts/train_model.py --model voting --horizon 20# List all models!python scripts/train_model.py --list-models```