# ML Model Factory - Complete Pipeline & Training

This notebook runs the **complete ML pipeline** from raw data to trained models.

## What This Notebook Does
1. **Setup** - Mount Drive, install dependencies, detect GPU
2. **Phase 1** - Data pipeline (clean → features → labels → splits)
3. **Phase 2** - Model training (single or multiple models)
4. **Phase 3** - Cross-validation (optional)
5. **Phase 4** - Ensemble training (optional)

## Quick Start
1. Upload your data to `My Drive/research/data/raw/`
2. Run cells in order (or use Runtime → Run all)
3. Choose your training options in Section 4

---

## 1. Environment Setup

In [None]:
#@title 1.1 Mount Google Drive & Clone Repository { display-mode: "form" }
#@markdown Run this cell to mount your Google Drive and set up the project.

import os
import sys
from pathlib import Path

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Project paths
DRIVE_PROJECT = '/content/drive/MyDrive/research'
LOCAL_PROJECT = '/content/research'

# Clone or pull repository
if not Path(LOCAL_PROJECT).exists():
    print("Cloning repository...")
    !git clone https://github.com/Snehpatel101/research.git {LOCAL_PROJECT}
else:
    print("Pulling latest changes...")
    !cd {LOCAL_PROJECT} && git pull

# Change to project directory
os.chdir(LOCAL_PROJECT)
sys.path.insert(0, LOCAL_PROJECT)

print(f"\nProject directory: {os.getcwd()}")
print(f"Drive project: {DRIVE_PROJECT}")

In [None]:
#@title 1.2 Install Dependencies { display-mode: "form" }
#@markdown Installs all required packages for the ML pipeline.

print("Installing dependencies...")

# Install the package
!pip install -e . -q

# Additional packages for Colab
!pip install xgboost lightgbm catboost optuna -q

# Verify PyTorch with CUDA
import torch
if torch.cuda.is_available():
    print(f"PyTorch: {torch.__version__} with CUDA {torch.version.cuda}")
else:
    print(f"PyTorch: {torch.__version__} (CPU only)")

print("\nDependencies installed!")

In [None]:
#@title 1.3 Detect Hardware & Configure { display-mode: "form" }
#@markdown Detects GPU and configures optimal settings.

import torch
import platform

print("=" * 60)
print(" HARDWARE DETECTION")
print("=" * 60)

# System info
print(f"\nSystem: {platform.system()} {platform.release()}")
print(f"Python: {sys.version.split()[0]}")

# GPU detection
GPU_AVAILABLE = torch.cuda.is_available()
GPU_NAME = None
GPU_MEMORY = 0
RECOMMENDED_BATCH_SIZE = 256
MIXED_PRECISION = False

if GPU_AVAILABLE:
    props = torch.cuda.get_device_properties(0)
    GPU_NAME = props.name
    GPU_MEMORY = props.total_memory / (1024**3)
    
    print(f"\nGPU: {GPU_NAME}")
    print(f"Memory: {GPU_MEMORY:.1f} GB")
    print(f"Compute Capability: {props.major}.{props.minor}")
    
    # Set batch size based on memory
    if GPU_MEMORY >= 40:  # A100
        RECOMMENDED_BATCH_SIZE = 1024
        MIXED_PRECISION = True
    elif GPU_MEMORY >= 15:  # T4/V100
        RECOMMENDED_BATCH_SIZE = 512
        MIXED_PRECISION = True
    else:
        RECOMMENDED_BATCH_SIZE = 256
        MIXED_PRECISION = props.major >= 7
    
    print(f"\nRecommended batch size: {RECOMMENDED_BATCH_SIZE}")
    print(f"Mixed precision: {'Enabled' if MIXED_PRECISION else 'Disabled'}")
else:
    print("\nNo GPU detected - will use CPU")
    print("Tip: Runtime -> Change runtime type -> GPU")

# Verify model registry
print("\n" + "=" * 60)
print(" AVAILABLE MODELS")
print("=" * 60)

try:
    from src.models import ModelRegistry
    models = ModelRegistry.list_models()
    for family, model_list in models.items():
        print(f"\n{family.upper()}:")
        for m in model_list:
            gpu_req = "GPU" if m in ['lstm', 'gru', 'tcn'] else "CPU"
            print(f"  - {m} ({gpu_req})")
except Exception as e:
    print(f"Error loading models: {e}")

print("\n" + "=" * 60)

---
## 2. Data Verification

In [None]:
#@title 2.1 Check Raw Data Files { display-mode: "form" }
#@markdown Verifies your OHLCV data files exist in Google Drive.

import pandas as pd
from pathlib import Path

RAW_DATA_DIR = Path(DRIVE_PROJECT) / "data" / "raw"
RAW_DATA_DIR.mkdir(parents=True, exist_ok=True)

print(f"Looking for data in: {RAW_DATA_DIR}")
print("=" * 60)

# Find all data files
parquet_files = list(RAW_DATA_DIR.glob("*.parquet"))
csv_files = list(RAW_DATA_DIR.glob("*.csv"))
all_files = parquet_files + csv_files

AVAILABLE_SYMBOLS = []

if all_files:
    print("\nFound data files:")
    for f in all_files:
        try:
            if f.suffix == '.parquet':
                df = pd.read_parquet(f)
            else:
                df = pd.read_csv(f)
            
            # Extract symbol from filename
            symbol = f.stem.split('_')[0].upper()
            AVAILABLE_SYMBOLS.append(symbol)
            
            size_mb = f.stat().st_size / 1e6
            print(f"\n  {f.name}")
            print(f"    Symbol: {symbol}")
            print(f"    Rows: {len(df):,}")
            print(f"    Size: {size_mb:.1f} MB")
            print(f"    Columns: {list(df.columns)}")
            if 'datetime' in df.columns or df.index.name == 'datetime':
                if 'datetime' in df.columns:
                    df['datetime'] = pd.to_datetime(df['datetime'])
                    print(f"    Date range: {df['datetime'].min()} to {df['datetime'].max()}")
        except Exception as e:
            print(f"  {f.name}: Error - {e}")
    
    AVAILABLE_SYMBOLS = list(set(AVAILABLE_SYMBOLS))
    print(f"\nAvailable symbols: {AVAILABLE_SYMBOLS}")
else:
    print("\nNo data files found!")
    print("\nPlease upload your OHLCV data to:")
    print(f"  {RAW_DATA_DIR}/")
    print("\nExpected format:")
    print("  - MES_1m.parquet or MES_1m.csv")
    print("  - Columns: datetime, open, high, low, close, volume")

In [None]:
#@title 2.2 Create Directory Structure { display-mode: "form" }
#@markdown Creates all required directories for the pipeline.

from pathlib import Path

directories = [
    "data/raw",
    "data/clean",
    "data/features",
    "data/labels",
    "data/final",
    "data/splits/scaled",
    "data/stacking",
    "config/ga_results",
    "runs",
    "results",
    "experiments/runs",
]

print("Creating directories...")
for d in directories:
    path = Path(DRIVE_PROJECT) / d
    path.mkdir(parents=True, exist_ok=True)
    print(f"  {d}/")

print("\nDirectory structure ready!")

---
## 3. Phase 1: Data Pipeline

In [None]:
#@title 3.1 Configure Pipeline { display-mode: "form" }
#@markdown Configure the data processing pipeline.

#@markdown ### Symbol Selection
symbols = "MES"  #@param {type: "string"}
#@markdown Comma-separated symbols (e.g., "MES,MGC")

#@markdown ### Timeframe
target_timeframe = "5min"  #@param ["5min", "10min", "15min", "30min", "1H"]

#@markdown ### Label Horizons
horizons = "5,10,15,20"  #@param {type: "string"}
#@markdown Comma-separated horizons (bars ahead)

#@markdown ### Train/Val/Test Split
train_ratio = 0.70  #@param {type: "slider", min: 0.5, max: 0.8, step: 0.05}
val_ratio = 0.15  #@param {type: "slider", min: 0.1, max: 0.25, step: 0.05}

# Parse inputs
SYMBOLS = [s.strip().upper() for s in symbols.split(',')]
HORIZONS = [int(h.strip()) for h in horizons.split(',')]
TRAIN_RATIO = train_ratio
VAL_RATIO = val_ratio
TEST_RATIO = round(1.0 - train_ratio - val_ratio, 2)

print("Pipeline Configuration:")
print(f"  Symbols: {SYMBOLS}")
print(f"  Timeframe: {target_timeframe}")
print(f"  Horizons: {HORIZONS}")
print(f"  Train/Val/Test: {TRAIN_RATIO}/{VAL_RATIO}/{TEST_RATIO}")

In [None]:
#@title 3.2 Run Data Pipeline { display-mode: "form" }
#@markdown Executes the complete Phase 1 data pipeline.
#@markdown This will take ~10-15 minutes for 40k samples.

skip_if_exists = True  #@param {type: "boolean"}
#@markdown Skip pipeline if processed data already exists

from pathlib import Path
import time

# Check if data already processed
splits_dir = Path(DRIVE_PROJECT) / "data" / "splits" / "scaled"
train_file = splits_dir / "train_scaled.parquet"

if skip_if_exists and train_file.exists():
    print("Processed data already exists!")
    print(f"  Location: {splits_dir}")
    
    import pandas as pd
    train_df = pd.read_parquet(train_file)
    print(f"  Train samples: {len(train_df):,}")
    print(f"  Features: {len([c for c in train_df.columns if not c.startswith('label_')])}")
    print("\nSkipping pipeline. Uncheck 'skip_if_exists' to rerun.")
else:
    print("Running Phase 1 Data Pipeline...")
    print("=" * 60)
    
    start_time = time.time()
    
    try:
        from src.phase1.pipeline_config import PipelineConfig
        from src.pipeline.runner import PipelineRunner
        
        config = PipelineConfig(
            symbols=SYMBOLS,
            project_root=Path(DRIVE_PROJECT),
            target_timeframe=target_timeframe,
            label_horizons=HORIZONS,
            train_ratio=TRAIN_RATIO,
            val_ratio=VAL_RATIO,
            test_ratio=TEST_RATIO,
        )
        
        runner = PipelineRunner(config)
        success = runner.run()
        
        elapsed = time.time() - start_time
        
        print("\n" + "=" * 60)
        if success:
            print(f"Pipeline completed in {elapsed/60:.1f} minutes!")
        else:
            print("Pipeline failed. Check errors above.")
            
    except Exception as e:
        print(f"\nError: {e}")
        import traceback
        traceback.print_exc()

In [None]:
#@title 3.3 Verify Processed Data { display-mode: "form" }
#@markdown Loads and displays the processed datasets.

import pandas as pd
from pathlib import Path

splits_dir = Path(DRIVE_PROJECT) / "data" / "splits" / "scaled"

print("Loading processed datasets...")
print("=" * 60)

try:
    train_df = pd.read_parquet(splits_dir / "train_scaled.parquet")
    val_df = pd.read_parquet(splits_dir / "val_scaled.parquet")
    test_df = pd.read_parquet(splits_dir / "test_scaled.parquet")
    
    print(f"\nDataset sizes:")
    print(f"  Train: {len(train_df):,} samples")
    print(f"  Val:   {len(val_df):,} samples")
    print(f"  Test:  {len(test_df):,} samples")
    print(f"  Total: {len(train_df) + len(val_df) + len(test_df):,} samples")
    
    # Count features
    feature_cols = [c for c in train_df.columns if not c.startswith(('label_', 'sample_weight', 'quality_score', 'datetime', 'symbol'))]
    label_cols = [c for c in train_df.columns if c.startswith('label_')]
    
    print(f"\nFeatures: {len(feature_cols)}")
    print(f"Labels: {label_cols}")
    
    # Label distribution
    print(f"\nLabel distribution (train):")
    for col in label_cols:
        dist = train_df[col].value_counts().sort_index()
        print(f"  {col}: Long={dist.get(1, 0):,} | Neutral={dist.get(0, 0):,} | Short={dist.get(-1, 0):,}")
    
    # Store for later use
    TRAIN_DF = train_df
    VAL_DF = val_df
    TEST_DF = test_df
    FEATURE_COLS = feature_cols
    
    print("\nData ready for model training!")
    
except FileNotFoundError:
    print("Processed data not found. Run Section 3.2 first.")

---
## 4. Phase 2: Model Training

Choose between **Single Model** or **Multi-Model** training.

In [None]:
#@title 4.1 Training Mode Selection { display-mode: "form" }
#@markdown Choose your training mode and models.

training_mode = "Single Model"  #@param ["Single Model", "Multi-Model (Sequential)", "Multi-Model (Compare All)"]

#@markdown ---
#@markdown ### Single Model Options
single_model = "xgboost"  #@param ["xgboost", "lightgbm", "catboost", "random_forest", "logistic", "svm", "lstm", "gru", "tcn"]

#@markdown ---
#@markdown ### Multi-Model Options
train_boosting = True  #@param {type: "boolean"}
#@markdown XGBoost, LightGBM, CatBoost
train_classical = False  #@param {type: "boolean"}
#@markdown Random Forest, Logistic, SVM
train_neural = False  #@param {type: "boolean"}
#@markdown LSTM, GRU, TCN (requires GPU)

#@markdown ---
#@markdown ### Training Parameters
horizon = 20  #@param [5, 10, 15, 20]
sequence_length = 60  #@param {type: "slider", min: 30, max: 120, step: 10}
#@markdown Sequence length for neural models

# Build model list
if training_mode == "Single Model":
    MODELS_TO_TRAIN = [single_model]
elif training_mode == "Multi-Model (Compare All)":
    MODELS_TO_TRAIN = []
    if train_boosting:
        MODELS_TO_TRAIN.extend(['xgboost', 'lightgbm', 'catboost'])
    if train_classical:
        MODELS_TO_TRAIN.extend(['random_forest', 'logistic', 'svm'])
    if train_neural and GPU_AVAILABLE:
        MODELS_TO_TRAIN.extend(['lstm', 'gru', 'tcn'])
    elif train_neural and not GPU_AVAILABLE:
        print("WARNING: Neural models skipped (no GPU)")
else:
    MODELS_TO_TRAIN = []
    if train_boosting:
        MODELS_TO_TRAIN.extend(['xgboost', 'lightgbm', 'catboost'])
    if train_classical:
        MODELS_TO_TRAIN.extend(['random_forest', 'logistic', 'svm'])
    if train_neural and GPU_AVAILABLE:
        MODELS_TO_TRAIN.extend(['lstm', 'gru', 'tcn'])

HORIZON = horizon
SEQ_LEN = sequence_length

print(f"Training Mode: {training_mode}")
print(f"Models to train: {MODELS_TO_TRAIN}")
print(f"Horizon: H{HORIZON}")
if any(m in ['lstm', 'gru', 'tcn'] for m in MODELS_TO_TRAIN):
    print(f"Sequence length: {SEQ_LEN}")

In [None]:
#@title 4.2 Train Models { display-mode: "form" }
#@markdown Execute model training based on your selections.

import time
from pathlib import Path

print("=" * 60)
print(" MODEL TRAINING")
print("=" * 60)

# Results storage
TRAINING_RESULTS = {}

try:
    from src.models import ModelRegistry, Trainer, TrainerConfig
    from src.phase1.stages.datasets.container import TimeSeriesDataContainer
    
    # Load data container
    print(f"\nLoading data for horizon H{HORIZON}...")
    container = TimeSeriesDataContainer.from_parquet_dir(
        path=Path(DRIVE_PROJECT) / "data" / "splits" / "scaled",
        horizon=HORIZON
    )
    print(f"  Train samples: {container.splits['train'].n_samples:,}")
    print(f"  Val samples: {container.splits['val'].n_samples:,}")
    print(f"  Features: {container.n_features}")
    
    # Train each model
    for i, model_name in enumerate(MODELS_TO_TRAIN, 1):
        print(f"\n{'='*60}")
        print(f" [{i}/{len(MODELS_TO_TRAIN)}] Training: {model_name.upper()}")
        print("=" * 60)
        
        start_time = time.time()
        
        # Configure based on model type
        model_config = {}
        if model_name in ['lstm', 'gru', 'tcn']:
            config = TrainerConfig(
                model_name=model_name,
                horizon=HORIZON,
                sequence_length=SEQ_LEN,
                batch_size=RECOMMENDED_BATCH_SIZE,
                max_epochs=50,
                early_stopping_patience=10,
                output_dir=Path(DRIVE_PROJECT) / "experiments" / "runs",
                device="cuda" if GPU_AVAILABLE else "cpu",
                mixed_precision=MIXED_PRECISION,
            )
        else:
            config = TrainerConfig(
                model_name=model_name,
                horizon=HORIZON,
                output_dir=Path(DRIVE_PROJECT) / "experiments" / "runs",
            )
        
        # Train
        trainer = Trainer(config)
        results = trainer.run(container)
        
        elapsed = time.time() - start_time
        
        # Store results
        TRAINING_RESULTS[model_name] = {
            'metrics': results.get('evaluation_metrics', {}),
            'time': elapsed,
            'run_id': results.get('run_id', 'unknown'),
        }
        
        # Display results
        metrics = results.get('evaluation_metrics', {})
        print(f"\n  Results:")
        print(f"    Accuracy: {metrics.get('accuracy', 0):.2%}")
        print(f"    Macro F1: {metrics.get('macro_f1', 0):.4f}")
        print(f"    Time: {elapsed:.1f}s")
        
except Exception as e:
    print(f"\nError during training: {e}")
    import traceback
    traceback.print_exc()

print("\n" + "=" * 60)
print(" TRAINING COMPLETE")
print("=" * 60)

In [None]:
#@title 4.3 Compare Results { display-mode: "form" }
#@markdown Display comparison of all trained models.

import pandas as pd
import matplotlib.pyplot as plt

if TRAINING_RESULTS:
    print("Model Comparison")
    print("=" * 60)
    
    # Build comparison table
    rows = []
    for model, data in TRAINING_RESULTS.items():
        metrics = data['metrics']
        rows.append({
            'Model': model,
            'Accuracy': metrics.get('accuracy', 0),
            'Macro F1': metrics.get('macro_f1', 0),
            'Weighted F1': metrics.get('weighted_f1', 0),
            'Time (s)': data['time'],
        })
    
    comparison_df = pd.DataFrame(rows)
    comparison_df = comparison_df.sort_values('Macro F1', ascending=False)
    
    # Display table
    print(comparison_df.to_string(index=False))
    
    # Plot if multiple models
    if len(TRAINING_RESULTS) > 1:
        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
        
        # Accuracy bar chart
        comparison_df_sorted = comparison_df.sort_values('Accuracy', ascending=True)
        axes[0].barh(comparison_df_sorted['Model'], comparison_df_sorted['Accuracy'])
        axes[0].set_xlabel('Accuracy')
        axes[0].set_title('Model Accuracy Comparison')
        axes[0].set_xlim(0, 1)
        
        # Training time bar chart
        comparison_df_sorted = comparison_df.sort_values('Time (s)', ascending=True)
        axes[1].barh(comparison_df_sorted['Model'], comparison_df_sorted['Time (s)'])
        axes[1].set_xlabel('Training Time (seconds)')
        axes[1].set_title('Training Time Comparison')
        
        plt.tight_layout()
        plt.show()
    
    # Best model
    best_model = comparison_df.iloc[0]['Model']
    print(f"\nBest model: {best_model}")
else:
    print("No training results yet. Run Section 4.2 first.")

---
## 5. Phase 3: Cross-Validation (Optional)

In [None]:
#@title 5.1 Run Cross-Validation { display-mode: "form" }
#@markdown Run purged k-fold cross-validation for robust evaluation.

run_cv = False  #@param {type: "boolean"}
cv_model = "xgboost"  #@param ["xgboost", "lightgbm", "catboost", "random_forest"]
n_splits = 5  #@param {type: "slider", min: 3, max: 10, step: 1}

if run_cv:
    print(f"Running {n_splits}-fold cross-validation for {cv_model}...")
    print("=" * 60)
    
    !python scripts/run_cv.py \
        --models {cv_model} \
        --horizons {HORIZON} \
        --n-splits {n_splits} \
        --data-dir {DRIVE_PROJECT}/data/splits/scaled \
        --output-dir {DRIVE_PROJECT}/results/cv
else:
    print("Cross-validation skipped. Enable 'run_cv' to run.")

---
## 6. Phase 4: Ensemble Training (Optional)

In [None]:
#@title 6.1 Train Ensemble { display-mode: "form" }
#@markdown Combine multiple models into an ensemble.

train_ensemble = False  #@param {type: "boolean"}
ensemble_type = "voting"  #@param ["voting", "stacking", "blending"]
base_models = "xgboost,lightgbm,catboost"  #@param {type: "string"}

if train_ensemble:
    print(f"Training {ensemble_type} ensemble...")
    print(f"Base models: {base_models}")
    print("=" * 60)
    
    try:
        from src.models import ModelRegistry, Trainer, TrainerConfig
        from src.phase1.stages.datasets.container import TimeSeriesDataContainer
        
        container = TimeSeriesDataContainer.from_parquet_dir(
            path=Path(DRIVE_PROJECT) / "data" / "splits" / "scaled",
            horizon=HORIZON
        )
        
        config = TrainerConfig(
            model_name=ensemble_type,
            horizon=HORIZON,
            output_dir=Path(DRIVE_PROJECT) / "experiments" / "runs",
            model_config={
                "base_model_names": [m.strip() for m in base_models.split(',')],
            }
        )
        
        trainer = Trainer(config)
        results = trainer.run(container)
        
        metrics = results.get('evaluation_metrics', {})
        print(f"\nEnsemble Results:")
        print(f"  Accuracy: {metrics.get('accuracy', 0):.2%}")
        print(f"  Macro F1: {metrics.get('macro_f1', 0):.4f}")
        
    except Exception as e:
        print(f"Error: {e}")
        import traceback
        traceback.print_exc()
else:
    print("Ensemble training skipped. Enable 'train_ensemble' to run.")

---
## 7. Save Results & Next Steps

In [None]:
#@title 7.1 Summary & Saved Artifacts { display-mode: "form" }
#@markdown Display summary and location of all saved files.

from pathlib import Path
import json

print("=" * 60)
print(" SESSION SUMMARY")
print("=" * 60)

# Data summary
print("\n DATA:")
splits_dir = Path(DRIVE_PROJECT) / "data" / "splits" / "scaled"
if splits_dir.exists():
    for f in splits_dir.glob("*.parquet"):
        size_mb = f.stat().st_size / 1e6
        print(f"  {f.name}: {size_mb:.1f} MB")

# Training results
print("\n TRAINED MODELS:")
experiments_dir = Path(DRIVE_PROJECT) / "experiments" / "runs"
if experiments_dir.exists():
    runs = list(experiments_dir.iterdir())
    for run_dir in sorted(runs)[-5:]:  # Last 5 runs
        if run_dir.is_dir():
            print(f"  {run_dir.name}")

# Next steps
print("\n NEXT STEPS:")
print("  1. Review model metrics in experiments/runs/")
print("  2. Try different model configurations")
print("  3. Run cross-validation for robust evaluation")
print("  4. Train ensemble for best performance")
print("  5. Export best model for production")

print("\n" + "=" * 60)
print(f" Results saved to: {DRIVE_PROJECT}")
print("=" * 60)

In [None]:
#@title 7.2 Quick Test: Load Trained Model { display-mode: "form" }
#@markdown Load a trained model and make predictions.

from pathlib import Path

experiments_dir = Path(DRIVE_PROJECT) / "experiments" / "runs"

if experiments_dir.exists():
    runs = sorted([d for d in experiments_dir.iterdir() if d.is_dir()])
    
    if runs:
        latest_run = runs[-1]
        print(f"Latest run: {latest_run.name}")
        
        # List contents
        for item in latest_run.rglob("*"):
            if item.is_file():
                rel_path = item.relative_to(latest_run)
                size = item.stat().st_size / 1024
                print(f"  {rel_path}: {size:.1f} KB")
    else:
        print("No training runs found.")
else:
    print("Experiments directory not found.")

---
## Appendix: Quick Commands

```bash
# Train single model
!python scripts/train_model.py --model xgboost --horizon 20

# Train neural model
!python scripts/train_model.py --model lstm --horizon 20 --seq-len 60

# Run cross-validation
!python scripts/run_cv.py --models xgboost,lightgbm --horizons 20 --n-splits 5

# Train ensemble
!python scripts/train_model.py --model voting --horizon 20

# List all models
!python scripts/train_model.py --list-models
```