# ML Signal Discovery

This notebook uses machine learning to discover trading signals within confirmed market regimes.

**Philosophy:**
- Regime detection is rule-based (from notebook 02)
- ML finds *when* to trade within each regime
- The model learns patterns that precede profitable moves

**Approach:**
1. Load data with auto-generated regime labels
2. Create target labels based on future price movement
3. Train ML to predict good entry points
4. Evaluate signal quality

## 1. Setup

In [None]:
import sys
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
from pathlib import Path
from datetime import datetime
import json
import joblib
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, cross_val_score, TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

# Add parent to path
sys.path.insert(0, str(Path('..').resolve()))
from src.config import FEATURES_DIR, MODELS_DIR, LABELS_DIR

print("Setup complete!")

## 2. Load Labeled Data

In [None]:
SYMBOL = 'BTCUSDT'
INTERVAL = '1h'

# Try to load labeled features first, fall back to regular features
labeled_path = FEATURES_DIR / f'{SYMBOL}_{INTERVAL}_features_labeled.parquet'
regular_path = FEATURES_DIR / f'{SYMBOL}_{INTERVAL}_features.parquet'

if labeled_path.exists():
    print(f"Loading labeled features from: {labeled_path}")
    df = pd.read_parquet(labeled_path)
    has_regime = 'regime' in df.columns
else:
    print(f"Labeled features not found. Loading regular features from: {regular_path}")
    print("Note: Run notebook 02 to generate regime labels first!")
    df = pd.read_parquet(regular_path)
    has_regime = False

df['open_time'] = pd.to_datetime(df['open_time'])
df = df.sort_values('open_time').reset_index(drop=True)

print(f"\nLoaded {len(df):,} rows")
print(f"Date range: {df['open_time'].min()} to {df['open_time'].max()}")
print(f"Has regime labels: {has_regime}")

if has_regime:
    print(f"\nRegime distribution:")
    regime_map = {0: 'Ranging', 1: 'Trending Up', 2: 'Trending Down'}
    for r, name in regime_map.items():
        count = (df['regime'] == r).sum()
        print(f"  {name}: {count:,} ({count/len(df)*100:.1f}%)")

## 3. Create Target Labels

We create forward-looking labels based on actual price movement:
- **Good Long Entry**: Price increases by X% within N bars
- **Good Short Entry**: Price decreases by X% within N bars
- **No Trade**: Neither condition met

In [None]:
# Target parameters
LOOKAHEAD_BARS = 24  # How far ahead to look (24 hours for 1h data)
PROFIT_TARGET_PCT = 1.5  # Minimum % gain to be considered "good entry"
STOP_LOSS_PCT = 1.0  # Maximum % loss before exit (for risk-adjusted signals)

print(f"Target Configuration:")
print(f"  Lookahead: {LOOKAHEAD_BARS} bars")
print(f"  Profit Target: {PROFIT_TARGET_PCT}%")
print(f"  Stop Loss: {STOP_LOSS_PCT}%")

In [None]:
def create_signal_targets(df, lookahead=24, profit_pct=1.5, stop_pct=1.0):
    """
    Create target labels based on future price movement.
    
    Returns:
        DataFrame with added columns:
        - future_return: Max favorable return in lookahead period
        - future_drawdown: Max adverse return in lookahead period
        - signal_long: 1 if good long entry, 0 otherwise
        - signal_short: 1 if good short entry, 0 otherwise
        - signal: Combined (0=none, 1=long, 2=short)
    """
    df = df.copy()
    n = len(df)
    
    # Calculate future highs/lows
    future_high = np.full(n, np.nan)
    future_low = np.full(n, np.nan)
    
    for i in range(n - lookahead):
        future_slice = df.iloc[i+1:i+lookahead+1]
        future_high[i] = future_slice['high'].max()
        future_low[i] = future_slice['low'].min()
    
    df['future_high'] = future_high
    df['future_low'] = future_low
    
    # Calculate potential returns
    df['future_return_up'] = (df['future_high'] - df['close']) / df['close'] * 100
    df['future_return_down'] = (df['close'] - df['future_low']) / df['close'] * 100
    df['future_drawdown_long'] = (df['close'] - df['future_low']) / df['close'] * 100
    df['future_drawdown_short'] = (df['future_high'] - df['close']) / df['close'] * 100
    
    # Define good entries
    # Long: Price goes up by profit_pct without first dropping by stop_pct
    df['signal_long'] = (
        (df['future_return_up'] >= profit_pct) & 
        (df['future_drawdown_long'] < stop_pct)
    ).astype(int)
    
    # Short: Price goes down by profit_pct without first rising by stop_pct
    df['signal_short'] = (
        (df['future_return_down'] >= profit_pct) & 
        (df['future_drawdown_short'] < stop_pct)
    ).astype(int)
    
    # Combined signal (prioritize based on regime if available)
    df['signal'] = 0  # No trade
    df.loc[df['signal_long'] == 1, 'signal'] = 1  # Long
    df.loc[df['signal_short'] == 1, 'signal'] = 2  # Short
    # If both, prefer based on regime
    if 'regime' in df.columns:
        both = (df['signal_long'] == 1) & (df['signal_short'] == 1)
        df.loc[both & (df['regime'] == 1), 'signal'] = 1  # Uptrend -> prefer long
        df.loc[both & (df['regime'] == 2), 'signal'] = 2  # Downtrend -> prefer short
    
    return df

# Create targets
df = create_signal_targets(df, LOOKAHEAD_BARS, PROFIT_TARGET_PCT, STOP_LOSS_PCT)

print(f"\nSignal Distribution:")
signal_map = {0: 'No Trade', 1: 'Long', 2: 'Short'}
for s, name in signal_map.items():
    count = (df['signal'] == s).sum()
    print(f"  {name}: {count:,} ({count/len(df)*100:.1f}%)")

# Check by regime if available
if has_regime:
    print("\nSignals by Regime:")
    for r, rname in {0: 'Ranging', 1: 'Trending Up', 2: 'Trending Down'}.items():
        regime_df = df[df['regime'] == r]
        if len(regime_df) > 0:
            long_pct = (regime_df['signal'] == 1).mean() * 100
            short_pct = (regime_df['signal'] == 2).mean() * 100
            print(f"  {rname}: Long={long_pct:.1f}%, Short={short_pct:.1f}%")

## 4. Feature Selection

In [None]:
# Columns to exclude from features
exclude_cols = [
    # Metadata
    'open_time', 'close_time', 'timestamp',
    # Raw OHLCV (we keep derived features)
    'open', 'high', 'low', 'close', 'volume',
    'quote_volume', 'trades', 'taker_buy_base', 'taker_buy_quote', 'ignore',
    # Target columns (would be cheating!)
    'future_high', 'future_low', 'future_return_up', 'future_return_down',
    'future_drawdown_long', 'future_drawdown_short',
    'signal_long', 'signal_short', 'signal',
    # Regime (use as filter, not feature)
    'regime', 'regime_name', 'raw_regime'
]

# Get feature columns
feature_cols = [col for col in df.columns if col not in exclude_cols]
print(f"Total features: {len(feature_cols)}")

# Group features by type
ma_features = [f for f in feature_cols if f.startswith('ma') and '_' in f]
spread_features = [f for f in feature_cols if 'spread' in f]
slope_features = [f for f in feature_cols if 'slope' in f]
other_features = [f for f in feature_cols if f not in ma_features + spread_features + slope_features]

print(f"\nFeature breakdown:")
print(f"  MA values: {len(ma_features)}")
print(f"  Spread features: {len(spread_features)}")
print(f"  Slope features: {len(slope_features)}")
print(f"  Other: {len(other_features)}")

## 5. Prepare Training Data

We use time-series split to avoid look-ahead bias.

In [None]:
# Remove rows with NaN in features or target
df_clean = df.dropna(subset=feature_cols + ['signal']).copy()
print(f"Clean samples: {len(df_clean):,} (removed {len(df) - len(df_clean):,} with NaN)")

# Prepare X and y
X = df_clean[feature_cols].values
y = df_clean['signal'].values

# Time-series split (80% train, 20% test - no shuffle!)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

print(f"\nTrain set: {len(X_train):,} samples")
print(f"Test set: {len(X_test):,} samples")

print(f"\nTrain distribution:")
for s, name in signal_map.items():
    count = (y_train == s).sum()
    print(f"  {name}: {count:,} ({count/len(y_train)*100:.1f}%)")

print(f"\nTest distribution:")
for s, name in signal_map.items():
    count = (y_test == s).sum()
    print(f"  {name}: {count:,} ({count/len(y_test)*100:.1f}%)")

In [None]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features scaled.")

## 6. Train Signal Detection Model

In [None]:
# Train XGBoost
print("Training XGBoost classifier...")

model = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='mlogloss',
    use_label_encoder=False,
    n_jobs=-1
)

model.fit(X_train_scaled, y_train)
print("Training complete!")

In [None]:
# Evaluate
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)

print("Classification Report:")
print("="*60)
print(classification_report(y_test, y_pred, target_names=['No Trade', 'Long', 'Short']))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
# Visualize confusion matrix
fig = go.Figure(data=go.Heatmap(
    z=cm,
    x=['No Trade', 'Long', 'Short'],
    y=['No Trade', 'Long', 'Short'],
    text=cm,
    texttemplate='%{text}',
    colorscale='Blues'
))

fig.update_layout(
    title='Confusion Matrix - Signal Detection',
    xaxis_title='Predicted',
    yaxis_title='Actual',
    width=500,
    height=500
)
fig.show()

## 7. Feature Importance

In [None]:
# Get feature importances
importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 20 Most Important Features:")
print(importance_df.head(20).to_string(index=False))

In [None]:
# Visualize top features
top_n = 25
top_features = importance_df.head(top_n)

fig = go.Figure(go.Bar(
    x=top_features['importance'],
    y=top_features['feature'],
    orientation='h'
))

fig.update_layout(
    title=f'Top {top_n} Feature Importances',
    xaxis_title='Importance',
    yaxis_title='Feature',
    height=700,
    yaxis={'categoryorder': 'total ascending'}
)
fig.show()

## 8. Signal Quality Analysis

Analyze how well the model's signals would have performed.

In [None]:
# Add predictions to test data
df_test = df_clean.iloc[split_idx:].copy()
df_test['pred_signal'] = y_pred
df_test['pred_prob_long'] = y_proba[:, 1]
df_test['pred_prob_short'] = y_proba[:, 2]

# Analyze predicted signals
print("Predicted Signal Analysis:")
print("="*60)

for signal_type, signal_name in [(1, 'Long'), (2, 'Short')]:
    pred_signals = df_test[df_test['pred_signal'] == signal_type]
    if len(pred_signals) > 0:
        actual_good = (pred_signals['signal'] == signal_type).mean() * 100
        if signal_type == 1:
            avg_return = pred_signals['future_return_up'].mean()
        else:
            avg_return = pred_signals['future_return_down'].mean()
        
        print(f"\n{signal_name} Signals:")
        print(f"  Predicted: {len(pred_signals):,}")
        print(f"  Accuracy (actual good entry): {actual_good:.1f}%")
        print(f"  Avg potential return: {avg_return:.2f}%")

In [None]:
# High-confidence signals only
CONFIDENCE_THRESHOLD = 0.6

print(f"\nHigh-Confidence Signals (prob > {CONFIDENCE_THRESHOLD}):")
print("="*60)

# Long signals
high_conf_long = df_test[df_test['pred_prob_long'] > CONFIDENCE_THRESHOLD]
if len(high_conf_long) > 0:
    accuracy = (high_conf_long['signal'] == 1).mean() * 100
    avg_return = high_conf_long['future_return_up'].mean()
    print(f"\nLong Signals:")
    print(f"  Count: {len(high_conf_long):,}")
    print(f"  Accuracy: {accuracy:.1f}%")
    print(f"  Avg potential return: {avg_return:.2f}%")

# Short signals
high_conf_short = df_test[df_test['pred_prob_short'] > CONFIDENCE_THRESHOLD]
if len(high_conf_short) > 0:
    accuracy = (high_conf_short['signal'] == 2).mean() * 100
    avg_return = high_conf_short['future_return_down'].mean()
    print(f"\nShort Signals:")
    print(f"  Count: {len(high_conf_short):,}")
    print(f"  Accuracy: {accuracy:.1f}%")
    print(f"  Avg potential return: {avg_return:.2f}%")

## 9. Visualize Signals on Chart

In [None]:
# Plot last portion of test data with signals
n_display = 500
df_plot = df_test.tail(n_display).copy()

fig = make_subplots(
    rows=2, cols=1,
    shared_xaxes=True,
    vertical_spacing=0.05,
    row_heights=[0.7, 0.3],
    subplot_titles=('Price & Signals', 'Signal Confidence')
)

# Candlestick
fig.add_trace(go.Candlestick(
    x=df_plot['open_time'],
    open=df_plot['open'], high=df_plot['high'],
    low=df_plot['low'], close=df_plot['close'],
    name='Price'
), row=1, col=1)

# Long signals
long_signals = df_plot[df_plot['pred_prob_long'] > CONFIDENCE_THRESHOLD]
fig.add_trace(go.Scatter(
    x=long_signals['open_time'],
    y=long_signals['low'] * 0.998,
    mode='markers',
    marker=dict(symbol='triangle-up', size=12, color='green'),
    name=f'Long Signal (p>{CONFIDENCE_THRESHOLD})'
), row=1, col=1)

# Short signals
short_signals = df_plot[df_plot['pred_prob_short'] > CONFIDENCE_THRESHOLD]
fig.add_trace(go.Scatter(
    x=short_signals['open_time'],
    y=short_signals['high'] * 1.002,
    mode='markers',
    marker=dict(symbol='triangle-down', size=12, color='red'),
    name=f'Short Signal (p>{CONFIDENCE_THRESHOLD})'
), row=1, col=1)

# Confidence plot
fig.add_trace(go.Scatter(
    x=df_plot['open_time'],
    y=df_plot['pred_prob_long'],
    mode='lines',
    name='Long Prob',
    line=dict(color='green', width=1)
), row=2, col=1)

fig.add_trace(go.Scatter(
    x=df_plot['open_time'],
    y=df_plot['pred_prob_short'],
    mode='lines',
    name='Short Prob',
    line=dict(color='red', width=1)
), row=2, col=1)

fig.add_hline(y=CONFIDENCE_THRESHOLD, line_dash='dash', line_color='white', row=2, col=1)

fig.update_layout(
    template='plotly_dark',
    height=800,
    xaxis_rangeslider_visible=False,
    showlegend=True
)

fig.update_yaxes(title_text='Price', row=1, col=1)
fig.update_yaxes(title_text='Probability', range=[0, 1], row=2, col=1)

fig.show()

## 10. Save Model

In [None]:
# Save model and scaler
MODELS_DIR.mkdir(parents=True, exist_ok=True)

model_path = MODELS_DIR / f'{SYMBOL}_{INTERVAL}_signal_model.joblib'
scaler_path = MODELS_DIR / f'{SYMBOL}_{INTERVAL}_signal_scaler.joblib'

joblib.dump(model, model_path)
joblib.dump(scaler, scaler_path)

print(f"Model saved to: {model_path}")
print(f"Scaler saved to: {scaler_path}")

# Save metadata
metadata = {
    'symbol': SYMBOL,
    'interval': INTERVAL,
    'created_at': datetime.now().isoformat(),
    'feature_cols': feature_cols,
    'target_params': {
        'lookahead_bars': LOOKAHEAD_BARS,
        'profit_target_pct': PROFIT_TARGET_PCT,
        'stop_loss_pct': STOP_LOSS_PCT
    },
    'train_samples': len(X_train),
    'test_samples': len(X_test)
}

metadata_path = MODELS_DIR / f'{SYMBOL}_{INTERVAL}_signal_metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"Metadata saved to: {metadata_path}")

## 11. Quick Inference Function

Function to generate signals on new data.

In [None]:
def predict_signals(df, model, scaler, feature_cols, confidence_threshold=0.6):
    """
    Generate trading signals for new data.
    
    Returns DataFrame with signal columns added.
    """
    df = df.copy()
    
    # Prepare features
    X = df[feature_cols].values
    X_scaled = scaler.transform(X)
    
    # Predict
    proba = model.predict_proba(X_scaled)
    
    df['prob_no_trade'] = proba[:, 0]
    df['prob_long'] = proba[:, 1]
    df['prob_short'] = proba[:, 2]
    
    # High-confidence signals
    df['signal'] = 0
    df.loc[df['prob_long'] > confidence_threshold, 'signal'] = 1
    df.loc[df['prob_short'] > confidence_threshold, 'signal'] = 2
    
    return df

print("Inference function defined.")
print("\nUsage:")
print("  df_with_signals = predict_signals(df, model, scaler, feature_cols)")

---

## Summary

This notebook trained a signal detection model that:
1. Uses MAR indicator features to predict good entry points
2. Labels are based on actual future price movement (not manual)
3. Outputs probability scores for long/short/no-trade
4. Can filter by confidence threshold for higher quality signals

**Next Steps:**
- Use **04_backtest.ipynb** to test the complete strategy
- Tune target parameters (lookahead, profit target, stop loss)
- Experiment with different confidence thresholds
- Add regime filtering (only long in uptrend, only short in downtrend)