# ML Model Factory - Quickstart Guide

This notebook provides a quick introduction to training ML models on OHLCV time series data.

**What you'll learn:**
1. Setting up the environment (Colab/local)
2. Loading and preparing data
3. Training an XGBoost model
4. Evaluating results and visualizing predictions

**Requirements:**
- Google Colab with GPU runtime (T4 or better) OR local machine with GPU
- ~2GB RAM for sample data

## Step 1: Environment Setup

Run this cell first to set up the environment. It will:
- Install required packages
- Detect GPU/CPU
- Configure display settings

In [None]:
# Clone repository (uncomment if running in fresh Colab)
# !git clone https://github.com/YOUR_USERNAME/Research.git
# %cd Research

# Install dependencies
!pip install -q xgboost lightgbm catboost scikit-learn pandas numpy matplotlib tqdm pyarrow

# Add project to path
import sys
sys.path.insert(0, '.')

In [None]:
# Setup notebook environment
from src.utils.notebook import setup_notebook, display_metrics, plot_confusion_matrix

env = setup_notebook()
print(f"\nGPU Available: {env['gpu_available']}")
if env['gpu_available']:
    print(f"GPU Name: {env['gpu_name']}")

## Step 2: Load Sample Data

We'll generate synthetic OHLCV data for this quickstart. In production, you'd load your own data.

In [None]:
from src.utils.notebook import download_sample_data
import pandas as pd

# Generate sample data
sample_paths = download_sample_data(output_dir="data/sample", symbols=["SAMPLE"])

# Load and inspect
df = pd.read_parquet(sample_paths["SAMPLE"])
print(f"\nData shape: {df.shape}")
print(f"Date range: {df['datetime'].min()} to {df['datetime'].max()}")
df.head()

## Step 3: Feature Engineering

Generate technical indicators from the OHLCV data. This is a simplified version - the full pipeline has 150+ features.

In [None]:
import numpy as np

def compute_features(df):
    """Compute technical features for model training."""
    df = df.copy()
    
    # Returns
    df['log_return'] = np.log(df['close'] / df['close'].shift(1))
    for p in [5, 10, 20]:
        df[f'return_{p}'] = df['close'].pct_change(p)
    
    # Moving averages
    for p in [10, 20, 50]:
        df[f'sma_{p}'] = df['close'].rolling(p).mean()
        df[f'close_to_sma_{p}'] = df['close'] / df[f'sma_{p}'] - 1
    
    # RSI
    delta = df['close'].diff()
    gain = delta.where(delta > 0, 0).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    rs = gain / loss.replace(0, np.inf)
    df['rsi'] = 100 - (100 / (1 + rs))
    
    # ATR
    tr = pd.concat([
        df['high'] - df['low'],
        abs(df['high'] - df['close'].shift(1)),
        abs(df['low'] - df['close'].shift(1))
    ], axis=1).max(axis=1)
    df['atr_14'] = tr.rolling(14).mean()
    df['atr_pct'] = df['atr_14'] / df['close']
    
    # Bollinger Bands
    sma20 = df['close'].rolling(20).mean()
    std20 = df['close'].rolling(20).std()
    df['bb_position'] = (df['close'] - (sma20 - 2*std20)) / (4*std20)
    
    # Volume features
    df['volume_sma'] = df['volume'].rolling(20).mean()
    df['volume_ratio'] = df['volume'] / df['volume_sma']
    
    # Volatility
    df['volatility_20'] = df['log_return'].rolling(20).std()
    
    return df.dropna()

df_features = compute_features(df)
print(f"Features computed: {len(df_features.columns) - 6} features")
print(f"Samples after feature computation: {len(df_features):,}")

## Step 4: Create Labels

Create trading labels using a simplified triple-barrier approach:
- **Long (+1)**: Price increases by more than 1 ATR
- **Short (-1)**: Price decreases by more than 1 ATR
- **Neutral (0)**: Neither barrier hit within horizon

In [None]:
def create_labels(df, horizon=20, k_up=1.5, k_down=1.0):
    """Create triple-barrier labels."""
    df = df.copy()
    
    labels = np.zeros(len(df))
    
    for i in range(len(df) - horizon):
        entry = df['close'].iloc[i]
        atr = df['atr_14'].iloc[i]
        
        upper = entry + k_up * atr
        lower = entry - k_down * atr
        
        for j in range(1, horizon + 1):
            if i + j >= len(df):
                break
            high = df['high'].iloc[i + j]
            low = df['low'].iloc[i + j]
            
            if high >= upper:
                labels[i] = 1  # Long
                break
            if low <= lower:
                labels[i] = -1  # Short
                break
    
    df['label'] = labels
    return df

df_labeled = create_labels(df_features, horizon=20)

# Check distribution
label_dist = df_labeled['label'].value_counts().sort_index()
print("Label Distribution:")
print(f"  Short (-1): {label_dist.get(-1, 0):,} ({label_dist.get(-1, 0)/len(df_labeled)*100:.1f}%)")
print(f"  Neutral (0): {label_dist.get(0, 0):,} ({label_dist.get(0, 0)/len(df_labeled)*100:.1f}%)")
print(f"  Long (+1): {label_dist.get(1, 0):,} ({label_dist.get(1, 0)/len(df_labeled)*100:.1f}%)")

## Step 5: Train/Test Split

Split data chronologically with proper purge/embargo to prevent data leakage.

In [None]:
# Define feature columns
exclude_cols = ['datetime', 'symbol', 'open', 'high', 'low', 'close', 'volume', 'label']
feature_cols = [c for c in df_labeled.columns if c not in exclude_cols]

print(f"Features: {len(feature_cols)}")
print(f"Sample features: {feature_cols[:10]}")

# Split data (70/15/15 with purge)
n = len(df_labeled)
train_end = int(n * 0.70)
val_end = int(n * 0.85)
purge = 60  # Purge period

train_df = df_labeled.iloc[:train_end - purge]
val_df = df_labeled.iloc[train_end + purge:val_end - purge]
test_df = df_labeled.iloc[val_end + purge:]

# Prepare arrays
X_train = train_df[feature_cols].values
y_train = train_df['label'].values.astype(int) + 1  # Convert to 0,1,2

X_val = val_df[feature_cols].values
y_val = val_df['label'].values.astype(int) + 1

X_test = test_df[feature_cols].values
y_test = test_df['label'].values.astype(int) + 1

print(f"\nSplit sizes:")
print(f"  Train: {len(X_train):,}")
print(f"  Val: {len(X_val):,}")
print(f"  Test: {len(X_test):,}")

## Step 6: Train XGBoost Model

Train an XGBoost classifier using the Model Factory.

In [None]:
from src.models import ModelRegistry
from tqdm.auto import tqdm
import time

# List available models
print("Available Models:")
for family, models in ModelRegistry.list_models().items():
    print(f"  {family}: {', '.join(models)}")

In [None]:
# Create XGBoost model with custom config
model_config = {
    'n_estimators': 200,
    'max_depth': 6,
    'learning_rate': 0.1,
    'early_stopping_rounds': 20,
    'eval_metric': 'mlogloss',
}

model = ModelRegistry.create('xgboost', config=model_config)
print(f"Created model: {model}")
print(f"Model family: {model.model_family}")
print(f"Requires scaling: {model.requires_scaling}")
print(f"Requires sequences: {model.requires_sequences}")

In [None]:
# Train the model
print("Training XGBoost model...")
start_time = time.time()

training_metrics = model.fit(
    X_train=X_train,
    y_train=y_train,
    X_val=X_val,
    y_val=y_val,
)

training_time = time.time() - start_time
print(f"\nTraining completed in {training_time:.1f}s")
print(f"Best epoch: {training_metrics.best_epoch}")
print(f"Train F1: {training_metrics.train_f1:.4f}")
print(f"Val F1: {training_metrics.val_f1:.4f}")

## Step 7: Evaluate on Test Set

In [None]:
from sklearn.metrics import classification_report, accuracy_score

# Get predictions
predictions = model.predict(X_test)

# Evaluation metrics
y_pred = predictions.class_predictions
y_proba = predictions.class_probabilities
confidence = predictions.confidence

print("Test Set Evaluation:")
print(f"  Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"  Mean Confidence: {confidence.mean():.4f}")
print()
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Short', 'Neutral', 'Long']))

In [None]:
# Plot confusion matrix
plot_confusion_matrix(
    y_test, y_pred,
    labels=['Short', 'Neutral', 'Long'],
    title='XGBoost - Test Set Confusion Matrix'
)

## Step 8: Feature Importance

In [None]:
import matplotlib.pyplot as plt

# Get feature importance
importance = model.get_feature_importance()

if importance:
    # Sort by importance
    sorted_importance = dict(sorted(importance.items(), key=lambda x: x[1], reverse=True)[:15])
    
    # Plot
    fig, ax = plt.subplots(figsize=(10, 6))
    features = list(sorted_importance.keys())
    values = list(sorted_importance.values())
    
    ax.barh(features[::-1], values[::-1], color='steelblue')
    ax.set_xlabel('Importance')
    ax.set_title('Top 15 Feature Importances')
    plt.tight_layout()
    plt.show()
else:
    print("Feature importance not available")

## Step 9: Save Model

In [None]:
from pathlib import Path

# Save model
save_path = Path('models/xgboost_quickstart')
save_path.mkdir(parents=True, exist_ok=True)

model.save(save_path)
print(f"Model saved to {save_path}")

# List saved files
for f in save_path.iterdir():
    print(f"  {f.name}")

## Step 10: Load and Verify Model

In [None]:
# Create new model instance and load weights
loaded_model = ModelRegistry.create('xgboost')
loaded_model.load(save_path)

# Verify predictions match
loaded_predictions = loaded_model.predict(X_test[:100])
original_predictions = model.predict(X_test[:100])

predictions_match = np.allclose(
    loaded_predictions.class_probabilities,
    original_predictions.class_probabilities
)

print(f"Predictions match after load: {predictions_match}")

## Summary

In this quickstart, you learned:

1. **Environment Setup**: How to configure the notebook environment for GPU training
2. **Data Preparation**: Loading OHLCV data and computing technical features
3. **Labeling**: Creating trading labels using the triple-barrier method
4. **Model Training**: Training an XGBoost model using the Model Factory
5. **Evaluation**: Computing metrics and visualizing results
6. **Persistence**: Saving and loading trained models

**Next Steps:**
- Try `02_train_all_models.ipynb` to compare different model types
- Use `03_cross_validation.ipynb` for proper cross-validation
- Load your own OHLCV data for real trading predictions