# Machine Learning Trading Strategies

This notebook demonstrates how to build ML-based trading strategies:

1. **Feature Engineering** - Create technical features from OHLCV data
2. **Logistic Regression** - Simple binary classification
3. **Random Forest** - Ensemble tree-based model
4. **Gradient Boosting** - Advanced ensemble method
5. **Strategy Comparison** - ML vs Traditional strategies

**Important**: We'll use walk-forward validation to avoid look-ahead bias!

In [None]:
# Install the library (run this cell if using Colab or if you haven't installed the package)
!pip install --upgrade --no-cache-dir simple-backtest yfinance scikit-learn

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
from typing import Any, Dict, List
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

from simple_backtest import BacktestConfig, Backtest
from simple_backtest.strategy import Strategy, MovingAverageStrategy
from simple_backtest.visualization import plot_equity_curve

## Load and Prepare Data

In [None]:
# Download data
ticker = "AAPL"
data = yf.download(ticker, start="2018-01-01", end="2023-12-31", progress=False)

# Handle MultiIndex columns if present
if isinstance(data.columns, pd.MultiIndex):
    data.columns = data.columns.get_level_values(0)

data = data.dropna()

print(f"Data shape: {data.shape}")
print(f"Date range: {data.index[0]} to {data.index[-1]}")

## 1. Feature Engineering

Create technical features that ML models can learn from:
- Returns and momentum
- Moving averages
- Volatility
- RSI
- Volume indicators

In [None]:
def create_features(df: pd.DataFrame) -> pd.DataFrame:
    """Create technical features for ML models."""
    features = pd.DataFrame(index=df.index)
    
    # Price features
    features['returns'] = df['Close'].pct_change()
    features['returns_5'] = df['Close'].pct_change(5)
    features['returns_10'] = df['Close'].pct_change(10)
    features['returns_20'] = df['Close'].pct_change(20)
    
    # Moving averages
    features['sma_5'] = df['Close'].rolling(5).mean() / df['Close'] - 1
    features['sma_10'] = df['Close'].rolling(10).mean() / df['Close'] - 1
    features['sma_20'] = df['Close'].rolling(20).mean() / df['Close'] - 1
    features['sma_50'] = df['Close'].rolling(50).mean() / df['Close'] - 1
    
    # Volatility
    features['volatility_5'] = df['Close'].pct_change().rolling(5).std()
    features['volatility_20'] = df['Close'].pct_change().rolling(20).std()
    
    # RSI
    delta = df['Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
    rs = gain / loss
    features['rsi'] = 100 - (100 / (1 + rs))
    features['rsi'] = (features['rsi'] - 50) / 50  # Normalize to [-1, 1]
    
    # Volume features
    features['volume_ratio'] = df['Volume'] / df['Volume'].rolling(20).mean()
    features['volume_change'] = df['Volume'].pct_change()
    
    # Price position in range
    features['high_low_ratio'] = (df['Close'] - df['Low']) / (df['High'] - df['Low'])
    
    # Momentum
    features['momentum_5'] = df['Close'] / df['Close'].shift(5) - 1
    features['momentum_10'] = df['Close'] / df['Close'].shift(10) - 1
    
    # Drop NaN values
    features = features.dropna()
    
    return features

# Create features
features_df = create_features(data)

print(f"\nFeatures shape: {features_df.shape}")
print(f"\nFeatures:")
print(features_df.columns.tolist())
print(f"\nSample:")
features_df.head()

In [None]:
# Create target: 1 if next day's return is positive, 0 otherwise
features_df['target'] = (data.loc[features_df.index, 'Close'].pct_change().shift(-1) > 0).astype(int)
features_df = features_df.dropna()

print(f"Target distribution:")
print(features_df['target'].value_counts())
print(f"\nPositive days: {features_df['target'].mean()*100:.2f}%")

## 2. ML Strategy Base Class

Create a base class for ML strategies that handles:
- Training window management
- Feature calculation on the fly
- Model prediction

In [None]:
class MLStrategy(Strategy):
    """Base class for ML-based trading strategies."""
    
    def __init__(self, model, training_window: int = 500, retrain_interval: int = 50,
                 shares: float = 10, name: str = None):
        super().__init__(name=name or f"ML_{model.__class__.__name__}")
        self.model = model
        self.scaler = StandardScaler()
        self.training_window = training_window
        self.retrain_interval = retrain_interval
        self.shares = shares
        self.days_since_training = 0
        self.is_trained = False
    
    def create_features_from_data(self, data: pd.DataFrame) -> pd.DataFrame:
        """Create features from OHLCV data."""
        return create_features(data)
    
    def train_model(self, data: pd.DataFrame):
        """Train the model on historical data."""
        # Create features
        features = self.create_features_from_data(data)
        
        if len(features) < 100:  # Need minimum data
            return False
        
        # Create target (next day return > 0)
        target = (data.loc[features.index, 'Close'].pct_change().shift(-1) > 0).astype(int)
        
        # Align features and target
        valid_idx = features.index.intersection(target.dropna().index)
        features = features.loc[valid_idx]
        target = target.loc[valid_idx]
        
        if len(features) < 50:
            return False
        
        # Drop target column if exists
        if 'target' in features.columns:
            features = features.drop('target', axis=1)
        
        # Scale features
        X = self.scaler.fit_transform(features)
        y = target.values
        
        # Train model
        self.model.fit(X, y)
        self.feature_names = features.columns.tolist()
        self.is_trained = True
        
        return True
    
    def predict(self, data: pd.DataFrame, trade_history: List[Dict[str, Any]]) -> Dict[str, Any]:
        # Check if we need to train or retrain
        if not self.is_trained or self.days_since_training >= self.retrain_interval:
            # Use full lookback for training
            training_data = data.tail(min(self.training_window, len(data)))
            self.train_model(training_data)
            self.days_since_training = 0
        
        self.days_since_training += 1
        
        if not self.is_trained:
            return self.hold()
        
        # Create features for current data
        try:
            features = self.create_features_from_data(data)
            
            if len(features) == 0:
                return self.hold()
            
            # Get latest features
            latest_features = features.iloc[-1:]
            
            # Drop target if exists
            if 'target' in latest_features.columns:
                latest_features = latest_features.drop('target', axis=1)
            
            # Ensure same features as training
            latest_features = latest_features[self.feature_names]
            
            # Scale features
            X = self.scaler.transform(latest_features)
            
            # Get prediction and probability
            prediction = self.model.predict(X)[0]
            
            # Get prediction probability if available
            if hasattr(self.model, 'predict_proba'):
                prob = self.model.predict_proba(X)[0][1]  # Probability of class 1
                confidence_threshold = 0.55  # Only trade if confident
                
                # Buy signal: predict up with confidence
                if prediction == 1 and prob > confidence_threshold and not self.has_position():
                    return self.buy(self.shares)
                
                # Sell signal: predict down with confidence
                if prediction == 0 and prob < (1 - confidence_threshold) and self.has_position():
                    return self.sell(self.shares)
            else:
                # For models without probability (like some linear models)
                if prediction == 1 and not self.has_position():
                    return self.buy(self.shares)
                elif prediction == 0 and self.has_position():
                    return self.sell(self.shares)
            
        except Exception as e:
            # If any error in feature calculation or prediction, hold
            pass
        
        return self.hold()
    
    def reset_state(self):
        super().reset_state()
        self.days_since_training = 0
        self.is_trained = False

print("✓ MLStrategy base class created")

## 3. Logistic Regression Strategy

Simple linear model - fast and interpretable

In [None]:
# Create Logistic Regression strategy
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_strategy = MLStrategy(
    model=lr_model,
    training_window=500,
    retrain_interval=50,
    shares=10,
    name="LogisticRegression"
)

print("✓ Logistic Regression strategy created")

## 4. Random Forest Strategy

Ensemble of decision trees - handles non-linear relationships

In [None]:
# Create Random Forest strategy
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf_strategy = MLStrategy(
    model=rf_model,
    training_window=500,
    retrain_interval=50,
    shares=10,
    name="RandomForest"
)

print("✓ Random Forest strategy created")

## 5. Gradient Boosting Strategy

Advanced ensemble method - builds trees sequentially

In [None]:
# Create Gradient Boosting strategy
gb_model = GradientBoostingClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    random_state=42
)
gb_strategy = MLStrategy(
    model=gb_model,
    training_window=500,
    retrain_interval=50,
    shares=10,
    name="GradientBoosting"
)

print("✓ Gradient Boosting strategy created")

## Compare ML Strategies with Traditional Strategy

In [None]:
# Configure backtest
config = BacktestConfig(
    initial_capital=10000.0,
    lookback_period=100,  # Need longer lookback for ML features
    commission_type="percentage",
    commission_value=0.001,
    risk_free_rate=0.02
)

# Include traditional strategy for comparison
ma_strategy = MovingAverageStrategy(
    short_window=10,
    long_window=30,
    shares=10,
    name="MA_Traditional"
)

# All strategies
strategies = [
    lr_strategy,
    rf_strategy,
    gb_strategy,
    ma_strategy,
]

print("Testing strategies:")
for s in strategies:
    print(f"  - {s.get_name()}")

print("\n⏳ Running backtest... (ML strategies take longer due to training)")

In [None]:
# Run backtest (disable parallel execution for ML strategies to avoid issues)
config.parallel_execution = False

backtest = Backtest(data=data, config=config)
results = backtest.run(strategies)

print("\n✓ Backtest completed!")

In [None]:
# Compare results
comparison_df = results.compare()

print("\nML vs Traditional Strategy Comparison:")
print("=" * 120)
display_cols = ['total_return', 'cagr', 'sharpe_ratio', 'sortino_ratio', 'max_drawdown', 
                'total_trades', 'win_rate', 'profit_factor']
print(comparison_df[display_cols])

In [None]:
# Best strategies by different metrics
print("\n" + "="*70)
print("BEST STRATEGIES")
print("="*70)

for metric in ['sharpe_ratio', 'total_return', 'win_rate']:
    best = results.best_strategy(metric)  # Returns StrategyResult object
    value = best.metrics[metric]
    
    if 'return' in metric or 'rate' in metric or 'drawdown' in metric:
        print(f"\nBest by {metric}: {best.name} = {value*100:.2f}%")
    else:
        print(f"\nBest by {metric}: {best.name} = {value:.2f}")

In [None]:
# Visualize comparison
fig = results.plot_comparison()
fig.update_layout(title="ML vs Traditional Strategies - Equity Curves")
fig.show()

## Detailed Analysis: Best ML Strategy

In [None]:
# Find best ML strategy
ml_strategies = ['LogisticRegression', 'RandomForest', 'GradientBoosting']
ml_sharpe = {name: results[name]['metrics']['sharpe_ratio'] for name in ml_strategies}
best_ml = max(ml_sharpe, key=ml_sharpe.get)

print(f"Best ML Strategy: {best_ml}\n")

metrics = results[best_ml]['metrics']

print("=" * 80)
print(f"{best_ml.upper()} - DETAILED METRICS")
print("=" * 80)

print(f"\n📊 Returns:")
print(f"  Total Return:        {metrics['total_return']*100:>10.2f}%")
print(f"  CAGR:                {metrics['cagr']*100:>10.2f}%")

print(f"\n⚡ Risk-Adjusted Returns:")
print(f"  Sharpe Ratio:        {metrics['sharpe_ratio']:>10.2f}")
print(f"  Sortino Ratio:       {metrics['sortino_ratio']:>10.2f}")
print(f"  Calmar Ratio:        {metrics['calmar_ratio']:>10.2f}")

print(f"\n⚠️  Risk Metrics:")
print(f"  Max Drawdown:        {metrics['max_drawdown']*100:>10.2f}%")
print(f"  Volatility:          {metrics['volatility']*100:>10.2f}%")

print(f"\n📈 Trading Performance:")
print(f"  Total Trades:        {metrics['total_trades']:>10}")
print(f"  Win Rate:            {metrics['win_rate']*100:>10.2f}%")
print(f"  Profit Factor:       {metrics['profit_factor']:>10.2f}")
print(f"  Avg Win:             ${metrics['avg_win']:>10.2f}")
print(f"  Avg Loss:            ${metrics['avg_loss']:>10.2f}")

print(f"\n🎯 vs Benchmark:")
print(f"  Alpha:               {metrics['alpha']*100:>10.2f}%")
print(f"  Beta:                {metrics['beta']:>10.2f}")
print(f"  Information Ratio:   {metrics['information_ratio']:>10.2f}")

print("\n" + "="*80)

## Feature Importance (for Random Forest)

In [None]:
# Show feature importance for Random Forest if it's trained
if rf_strategy.is_trained and hasattr(rf_strategy.model, 'feature_importances_'):
    importance_df = pd.DataFrame({
        'feature': rf_strategy.feature_names,
        'importance': rf_strategy.model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("\nRandom Forest - Top 10 Most Important Features:")
    print("=" * 50)
    for idx, row in importance_df.head(10).iterrows():
        print(f"  {row['feature']:.<30} {row['importance']:.4f}")
    print("=" * 50)

## Summary

In this notebook, we built ML-based trading strategies:

1. ✅ **Feature Engineering**: Created 15+ technical features from OHLCV data
2. ✅ **Logistic Regression**: Simple linear classifier
3. ✅ **Random Forest**: Ensemble tree-based model
4. ✅ **Gradient Boosting**: Advanced sequential ensemble
5. ✅ **Strategy Comparison**: ML vs traditional strategies

### Key Insights:

**Advantages of ML Strategies:**
- Can capture complex non-linear relationships
- Automatically learn from multiple features
- Adapt to changing market conditions through retraining
- Can incorporate many indicators simultaneously

**Challenges:**
- Risk of overfitting (learning noise instead of signal)
- Require careful feature engineering
- Need sufficient training data
- Computationally expensive (especially retraining)
- Harder to interpret than rule-based strategies

### Best Practices for ML Trading:

1. **Walk-Forward Validation**: Always use rolling training windows (no look-ahead bias)
2. **Regular Retraining**: Markets change - retrain periodically
3. **Feature Selection**: More features ≠ better performance
4. **Ensemble Methods**: Combine multiple models for robustness
5. **Probability Thresholds**: Only trade when model is confident
6. **Risk Management**: ML signals need stop-losses too
7. **Transaction Costs**: ML strategies often trade more - watch commissions

### Performance Tips:

- **Feature Scaling**: Always normalize/standardize features
- **Class Balance**: Handle imbalanced buy/sell signals
- **Cross-Validation**: Use time-series CV, not random CV
- **Hyperparameter Tuning**: Optimize model parameters
- **Avoid Overfitting**: Use regularization, limit tree depth

### Next Steps:

- Add more advanced features (technical patterns, sentiment)
- Try deep learning models (LSTM, Transformers)
- Implement position sizing based on prediction confidence
- Add regime detection to switch between strategies
- Create ensemble strategies combining multiple ML models