# Opening Returns Prediction with IntradayMomentumLight

This notebook demonstrates how to use `IntradayMomentumLight` (which inherits from `CTALight`) to predict opening period returns using:

1. **Short-term daily momentum features** (1d, 5d, 10d, 20d) - properly lagged to avoid lookahead bias
2. **HAR-style realized volatility features** for 1-day ahead forecasts
3. **Opening range volatility measures** - first 60 minutes of the session

**Target**: Returns during the opening period (first 60 minutes of session)

**Model**: LightGBM regressor (via CTALight base class) with built-in `.fit()`, `.fit_with_grid_search()`, and `.evaluate()` methods

## Setup and Imports

In [None]:
from __future__ import annotations

import sys
from pathlib import Path
from datetime import time, timedelta
from typing import Dict

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add parent directories to path
project_root = Path.cwd().parent.parent.parent
sys.path.insert(0, str(project_root))

from CTAFlow.models.intraday_momentum import IntradayMomentumLight
from CTAFlow.config import INTRADAY_DATA_PATH

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print(f"✓ Imports successful")
print(f"✓ Data path: {INTRADAY_DATA_PATH}")

## Helper Functions

In [None]:
def calculate_opening_returns(
    intraday_df: pd.DataFrame,
    session_open: time = time(8, 30),
    opening_window: timedelta = timedelta(minutes=60),
    price_col: str = "Close",
) -> pd.Series:
    """Calculate daily returns during the opening period.
    
    Parameters
    ----------
    intraday_df : pd.DataFrame
        Intraday OHLCV data with DatetimeIndex
    session_open : time
        Session start time (default 8:30 AM)
    opening_window : timedelta
        Length of opening period (default 60 minutes)
    price_col : str
        Price column name
        
    Returns
    -------
    pd.Series
        Daily opening period returns (log returns from open to open+window)
    """
    if not isinstance(intraday_df.index, pd.DatetimeIndex):
        intraday_df.index = pd.to_datetime(intraday_df.index)
    
    # Create working dataframe
    work_df = pd.DataFrame({'price': intraday_df[price_col]})
    work_df['date'] = work_df.index.normalize()
    
    # Calculate session times
    session_open_offset = pd.Timedelta(hours=session_open.hour, minutes=session_open.minute)
    work_df['session_start'] = work_df['date'] + session_open_offset
    work_df['session_end'] = work_df['session_start'] + opening_window
    
    # Get opening and closing prices
    opening_mask = work_df.index >= work_df['session_start']
    opening_data = work_df[opening_mask].groupby('date')['price'].first()
    
    closing_mask = (work_df.index >= work_df['session_start']) & (work_df.index < work_df['session_end'])
    closing_data = work_df[closing_mask].groupby('date')['price'].last()
    
    # Calculate log returns
    opening_returns = np.log(closing_data / opening_data)
    
    return opening_returns


def prepare_daily_data(intraday_df: pd.DataFrame) -> pd.DataFrame:
    """Create daily OHLC data from intraday bars."""
    if not isinstance(intraday_df.index, pd.DatetimeIndex):
        intraday_df.index = pd.to_datetime(intraday_df.index)
    
    daily = intraday_df.resample('1D').agg({
        'Open': 'first',
        'High': 'max',
        'Low': 'min',
        'Close': 'last',
        'Volume': 'sum'
    }).dropna()
    
    return daily

print("✓ Helper functions defined")

## Load Data

Load intraday data for a single ticker from CSV. Using the same file path structure as `gather_tickers()`.

In [None]:
ticker = "CL"  # Crude Oil futures

# Load intraday data from CSV
csv_path = INTRADAY_DATA_PATH / f"{ticker}_intraday.csv"
print(f"Loading {ticker} from {csv_path}")

intraday_data = pd.read_csv(csv_path, parse_dates=['timestamp'])
intraday_data.set_index('timestamp', inplace=True)
intraday_data.sort_index(inplace=True)

print(f"\n✓ Loaded {len(intraday_data):,} bars")
print(f"✓ Date range: {intraday_data.index[0].date()} to {intraday_data.index[-1].date()}")
print(f"\nData preview:")
intraday_data.head()

## Initialize IntradayMomentumLight Model

`IntradayMomentumLight` inherits from `CTALight`, which provides built-in LightGBM functionality:
- `.fit()` - Train the model with early stopping
- `.fit_with_grid_search()` - Hyperparameter tuning with cross-validation
- `.evaluate()` - Compute metrics on test set
- `.predict()` - Generate predictions
- `.get_feature_importance()` - Extract feature importances

In [None]:
model = IntradayMomentumLight(
    intraday_data=intraday_data,
    session_open=time(8, 30),
    session_end=time(15, 0),
    closing_length=timedelta(minutes=60),
    tz="America/Chicago",
    price_col="Close"
)

print("✓ Model initialized (inherits LightGBM functionality from CTALight)")

## Prepare Daily Data

In [None]:
daily_df = prepare_daily_data(intraday_data)
print(f"✓ Daily data: {len(daily_df)} days")
print(f"\nDaily OHLCV preview:")
daily_df.tail()

## Build Feature Set

We'll use three types of features:
1. **Daily Momentum** - Short-term price trends (1d, 5d, 10d, 20d)
2. **HAR Volatility** - Multi-horizon realized volatility (1d, 5d, 22d)
3. **Opening Range Volatility** - First 60 minutes volatility measures

In [None]:
# Initialize training_data
model.training_data = pd.DataFrame(index=daily_df.index)

# 1. Daily momentum features (lagged by 1 day)
print("Adding daily momentum features...")
momentum_feats = model.add_daily_momentum_features(
    daily_df,
    lookbacks=(1, 5, 10, 20)
)
print(f"  Features: {list(momentum_feats.columns)}")

# 2. HAR volatility features
print("\nAdding HAR volatility features...")
har_feats = model.har_volatility_features(
    intraday_df=intraday_data,
    horizons=(1, 5, 22)
)
print(f"  Features: {list(har_feats.columns)}")

# 3. Opening range volatility
print("\nAdding opening range volatility...")
opening_vol = model.opening_range_volatility(
    intraday_df=intraday_data,
    period_length=timedelta(minutes=60)
)
print(f"  Features: {list(opening_vol.columns)}")

# Combine all features
model.training_data = pd.concat([momentum_feats, har_feats, opening_vol], axis=1).dropna()
print(f"\n✓ Combined features shape: {model.training_data.shape}")
print(f"✓ Feature names tracked: {model.feature_names}")

In [None]:
# Preview the feature matrix
print("Feature matrix preview:")
model.training_data.head(10)

## Calculate Target Variable

Target is the log return during the opening period (first 60 minutes).

In [None]:
target = calculate_opening_returns(intraday_data)

print(f"✓ Target shape: {len(target)}")
print(f"✓ Target mean: {target.mean():.6f}")
print(f"✓ Target std: {target.std():.6f}")
print(f"✓ Target min: {target.min():.6f}")
print(f"✓ Target max: {target.max():.6f}")

# Plot target distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(target, bins=50, alpha=0.7, edgecolor='black')
axes[0].axvline(target.mean(), color='red', linestyle='--', label=f'Mean: {target.mean():.4f}')
axes[0].set_xlabel('Opening Returns (log)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Opening Period Returns')
axes[0].legend()

# Time series
axes[1].plot(target.index, target, alpha=0.6, linewidth=0.8)
axes[1].axhline(0, color='black', linestyle='-', linewidth=0.5)
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Opening Returns (log)')
axes[1].set_title('Opening Period Returns Over Time')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Align Features and Target

In [None]:
common_index = model.training_data.index.intersection(target.index)
X = model.training_data.loc[common_index]
y = target.loc[common_index]

print(f"✓ Final dataset: {len(X)} samples, {X.shape[1]} features")
print(f"✓ Date range: {X.index[0].date()} to {X.index[-1].date()}")

## Train/Test Split (Temporal)

Using 80/20 temporal split to avoid lookahead bias.

In [None]:
test_size = 0.2
split_idx = int(len(X) * (1 - test_size))

X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

print(f"Train: {len(X_train)} samples ({X_train.index[0].date()} to {X_train.index[-1].date()})")
print(f"Test:  {len(X_test)} samples ({X_test.index[0].date()} to {X_test.index[-1].date()})")

# Create validation set for early stopping (from training set)
val_size = int(len(X_train) * 0.2)
X_val = X_train.iloc[-val_size:]
y_val = y_train.iloc[-val_size:]
X_train_fit = X_train.iloc[:-val_size]
y_train_fit = y_train.iloc[:-val_size]

print(f"\nFor training:")
print(f"  Training: {len(X_train_fit)} samples")
print(f"  Validation: {len(X_val)} samples (for early stopping)")

## Train LightGBM Model

Using the built-in `.fit()` method from CTALight base class.

In [None]:
print("Training LightGBM model (via CTALight.fit())...\n")

model.fit(
    X_train_fit,
    y_train_fit,
    eval_set=(X_val, y_val),
    early_stopping_rounds=50,
    num_boost_round=1000
)

print("\n✓ Model trained successfully!")

## Evaluate on Test Set

In [None]:
test_metrics = model.evaluate(X_test, y_test)

print("Test Set Metrics:")
print("=" * 50)
print(f"MSE:                    {test_metrics['mse']:.6f}")
print(f"RMSE:                   {test_metrics['rmse']:.6f}")
print(f"MAE:                    {test_metrics['mae']:.6f}")
print(f"R²:                     {test_metrics['r2']:.4f}")
print(f"Directional Accuracy:   {test_metrics['directional_accuracy']:.2%}")

## Predictions vs Actuals

In [None]:
# Generate predictions
y_pred = model.predict(X_test)

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Scatter plot
axes[0, 0].scatter(y_test, y_pred, alpha=0.5, s=20)
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                'r--', linewidth=2, label='Perfect Prediction')
axes[0, 0].set_xlabel('Actual Returns')
axes[0, 0].set_ylabel('Predicted Returns')
axes[0, 0].set_title(f'Predictions vs Actuals (R² = {test_metrics["r2"]:.4f})')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Time series
axes[0, 1].plot(y_test.index, y_test.values, label='Actual', alpha=0.7, linewidth=1)
axes[0, 1].plot(y_test.index, y_pred, label='Predicted', alpha=0.7, linewidth=1)
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Returns')
axes[0, 1].set_title('Actual vs Predicted Returns Over Time')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Residuals
residuals = y_test - y_pred
axes[1, 0].scatter(y_pred, residuals, alpha=0.5, s=20)
axes[1, 0].axhline(0, color='red', linestyle='--', linewidth=2)
axes[1, 0].set_xlabel('Predicted Returns')
axes[1, 0].set_ylabel('Residuals')
axes[1, 0].set_title('Residual Plot')
axes[1, 0].grid(True, alpha=0.3)

# 4. Residuals distribution
axes[1, 1].hist(residuals, bins=50, alpha=0.7, edgecolor='black')
axes[1, 1].axvline(0, color='red', linestyle='--', linewidth=2, label='Zero')
axes[1, 1].axvline(residuals.mean(), color='blue', linestyle='--', linewidth=2, 
                   label=f'Mean: {residuals.mean():.4f}')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Distribution of Residuals')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

## Feature Importance Analysis

In [None]:
# Get top features
top_features = model.get_feature_importance(importance_type='gain', top_n=10)

print("Top 10 Features by Gain:")
print("=" * 50)
for feat, importance in top_features.items():
    print(f"{feat:25s}: {importance:.1f}")

# Plot feature importance
fig, ax = plt.subplots(figsize=(10, 6))
features = list(top_features.keys())
importances = list(top_features.values())

ax.barh(features, importances, color='steelblue', alpha=0.8)
ax.set_xlabel('Importance (Gain)')
ax.set_title('Top 10 Feature Importances')
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## Grid Search for Hyperparameter Tuning

Demonstrating the `.fit_with_grid_search()` method for hyperparameter optimization.

In [None]:
print("Running grid search for hyperparameter tuning...\n")

# Reinitialize model for grid search
model_gs = IntradayMomentumLight(
    intraday_data=intraday_data,
    session_open=time(8, 30),
    session_end=time(15, 0),
    tz="America/Chicago"
)

# Rebuild features (same as before)
model_gs.training_data = pd.DataFrame(index=daily_df.index)
momentum_feats_gs = model_gs.add_daily_momentum_features(daily_df, lookbacks=(1, 5, 10, 20))
har_feats_gs = model_gs.har_volatility_features(intraday_df=intraday_data, horizons=(1, 5, 22))
opening_vol_gs = model_gs.opening_range_volatility(intraday_df=intraday_data, period_length=timedelta(minutes=60))
model_gs.training_data = pd.concat([momentum_feats_gs, har_feats_gs, opening_vol_gs], axis=1).dropna()

# Use same target
common_index_gs = model_gs.training_data.index.intersection(target.index)
X_gs = model_gs.training_data.loc[common_index_gs]
y_gs = target.loc[common_index_gs]

# Split (80/20 for grid search)
split_idx = int(len(X_gs) * 0.8)
X_train_gs, X_test_gs = X_gs.iloc[:split_idx], X_gs.iloc[split_idx:]
y_train_gs, y_test_gs = y_gs.iloc[:split_idx], y_gs.iloc[split_idx:]
val_size = int(len(X_train_gs) * 0.2)
X_val_gs = X_train_gs.iloc[-val_size:]
y_val_gs = y_train_gs.iloc[-val_size:]
X_train_fit_gs = X_train_gs.iloc[:-val_size]
y_train_fit_gs = y_train_gs.iloc[:-val_size]

# Define parameter grid
param_grid = {
    'num_leaves': [31, 63],
    'learning_rate': [0.03, 0.07],
    'feature_fraction': [0.7, 0.9]
}

print(f"Parameter grid: {param_grid}")
print(f"Total combinations: {np.prod([len(v) for v in param_grid.values()])}\n")

# Run grid search
grid_results = model_gs.fit_with_grid_search(
    X_train_fit_gs,
    y_train_fit_gs,
    param_grid=param_grid,
    eval_set=(X_val_gs, y_val_gs),
    cv_folds=3,
    scoring='neg_mean_squared_error',
    verbose=True
)

print(f"\n✓ Grid search complete!")
print(f"\nBest parameters: {grid_results['best_params']}")
print(f"Best CV score: {grid_results['best_score']:.6f}")

In [None]:
# Evaluate grid search model on test set
gs_metrics = model_gs.evaluate(X_test_gs, y_test_gs)

print("Grid Search Model - Test Set Metrics:")
print("=" * 50)
print(f"MSE:                    {gs_metrics['mse']:.6f}")
print(f"RMSE:                   {gs_metrics['rmse']:.6f}")
print(f"MAE:                    {gs_metrics['mae']:.6f}")
print(f"R²:                     {gs_metrics['r2']:.4f}")
print(f"Directional Accuracy:   {gs_metrics['directional_accuracy']:.2%}")

# Compare with baseline model
print("\nComparison with Baseline:")
print("=" * 50)
print(f"Baseline R²:      {test_metrics['r2']:.4f}")
print(f"Grid Search R²:   {gs_metrics['r2']:.4f}")
print(f"Improvement:      {(gs_metrics['r2'] - test_metrics['r2']):.4f}")

## Summary

This notebook demonstrated:

1. ✓ Loading single ticker intraday data
2. ✓ Building features with `IntradayMomentumLight` methods:
   - Daily momentum (lagged)
   - HAR volatility
   - Opening range volatility
3. ✓ Training LightGBM model using `.fit()` from `CTALight`
4. ✓ Evaluating model performance with built-in metrics
5. ✓ Visualizing predictions and feature importance
6. ✓ Hyperparameter tuning with `.fit_with_grid_search()`

**Key Takeaways:**
- `IntradayMomentumLight` inherits LightGBM functionality from `CTALight`
- Features are automatically tracked via `._add_feature()` method
- Proper lagging prevents lookahead bias
- Opening returns are predictable using momentum and volatility features