# Week 8: Machine Learning for Trading

---

## Table of Contents
1. Decision Trees
2. Random Forests
3. Gradient Boosting (XGBoost, LightGBM)
4. Feature Engineering for Alpha
5. Walk-Forward Validation
6. Ensemble Methods

---

## 1. Decision Trees

### How Decision Trees Work

A decision tree recursively splits data to minimize impurity at each node.

### Splitting Criteria

**For Regression (MSE):**
$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \bar{y})^2$$

**For Classification (Gini Impurity):**
$$\text{Gini} = 1 - \sum_{k=1}^{K} p_k^2$$

Where $p_k$ is the proportion of class $k$ in the node.

**For Classification (Entropy):**
$$\text{Entropy} = -\sum_{k=1}^{K} p_k \log_2(p_k)$$

### Information Gain

$$\text{IG} = \text{Impurity}_{parent} - \sum_{j} \frac{n_j}{n} \text{Impurity}_{child_j}$$

The split that maximizes information gain is chosen.

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error

np.random.seed(42)

# Create trading features
n_samples = 1000
momentum_5d = np.random.normal(0, 0.02, n_samples)
momentum_20d = np.random.normal(0, 0.05, n_samples)
volatility = np.abs(np.random.normal(0.01, 0.005, n_samples))
volume_ratio = np.random.uniform(0.5, 2.0, n_samples)

# Target: next day return (with some signal)
next_return = (0.3 * momentum_5d + 0.2 * momentum_20d - 0.1 * volatility + 
               np.random.normal(0, 0.015, n_samples))

# Binary target for classification
direction = (next_return > 0).astype(int)

X = pd.DataFrame({
    'momentum_5d': momentum_5d,
    'momentum_20d': momentum_20d,
    'volatility': volatility,
    'volume_ratio': volume_ratio
})

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, direction, test_size=0.2, shuffle=False)

# Fit decision tree
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

print("Decision Tree for Direction Prediction")
print("="*50)
print(f"\nTraining accuracy: {tree.score(X_train, y_train):.2%}")
print(f"Test accuracy: {tree.score(X_test, y_test):.2%}")
print(f"\nFeature Importances:")
for name, imp in zip(X.columns, tree.feature_importances_):
    print(f"  {name}: {imp:.3f}")

Decision Tree for Direction Prediction

Training accuracy: 72.50%
Test accuracy: 69.00%

Feature Importances:
  momentum_5d: 0.327
  momentum_20d: 0.650
  volatility: 0.000
  volume_ratio: 0.024


### Decision Tree Limitations

1. **Overfitting**: Deep trees memorize training data
2. **High variance**: Small data changes → completely different tree
3. **Greedy**: Local optimal splits, not global optimum

**Solution**: Ensemble methods (Random Forests, Boosting)

---

## 2. Random Forests

### Bagging + Feature Randomization

Random Forest combines two ideas:

**1. Bootstrap Aggregating (Bagging)**:
- Train each tree on a random sample (with replacement)
- Reduces variance through averaging

**2. Feature Randomization**:
- Each split considers only $m$ random features
- Typically $m = \sqrt{p}$ for classification, $m = p/3$ for regression
- Decorrelates trees

### Prediction

**Classification**: Majority vote
$$\hat{y} = \text{mode}\{\hat{y}_1, \hat{y}_2, ..., \hat{y}_B\}$$

**Regression**: Average
$$\hat{y} = \frac{1}{B}\sum_{b=1}^{B}\hat{y}_b$$

### Out-of-Bag (OOB) Error

Each tree doesn't see ~37% of data (out-of-bag samples).
Use these for validation without a separate test set!

In [2]:
from sklearn.ensemble import RandomForestClassifier

# Fit Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    max_features='sqrt',
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

print("Random Forest Results")
print("="*50)
print(f"\nNumber of trees: {rf.n_estimators}")
print(f"Max depth: {rf.max_depth}")
print(f"\nTraining accuracy: {rf.score(X_train, y_train):.2%}")
print(f"OOB accuracy: {rf.oob_score_:.2%}")
print(f"Test accuracy: {rf.score(X_test, y_test):.2%}")

print(f"\nFeature Importances (averaged over all trees):")
for name, imp in sorted(zip(X.columns, rf.feature_importances_), key=lambda x: -x[1]):
    bars = "█" * int(imp * 30)
    print(f"  {name:15s}: {imp:.3f} {bars}")

Random Forest Results

Number of trees: 100
Max depth: 5

Training accuracy: 78.12%
OOB accuracy: 70.50%
Test accuracy: 67.00%

Feature Importances (averaged over all trees):
  momentum_20d   : 0.530 ███████████████
  momentum_5d    : 0.272 ████████
  volume_ratio   : 0.100 ██
  volatility     : 0.099 ██


### Hyperparameters

| Parameter | Effect | Typical Range |
|-----------|--------|---------------|
| n_estimators | More trees = better, diminishing returns | 100-500 |
| max_depth | Deeper = more complex, risk of overfit | 3-10 |
| max_features | Lower = more diverse trees | sqrt(p), p/3 |
| min_samples_leaf | Higher = more regularization | 10-100 |

---

## 3. Gradient Boosting (XGBoost, LightGBM)

### Boosting vs Bagging

| Bagging (RF) | Boosting |
|--------------|----------|
| Parallel trees | Sequential trees |
| Reduces variance | Reduces bias |
| Equal weights | Weighted by error |

### Gradient Boosting Algorithm

1. Initialize with constant: $F_0(x) = \bar{y}$
2. For $m = 1$ to $M$:
   - Compute residuals: $r_i = y_i - F_{m-1}(x_i)$
   - Fit tree $h_m(x)$ to residuals
   - Update: $F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)$

### Loss Function

**Regression (MSE)**:
$$L = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

**Classification (Log Loss)**:
$$L = -\sum_{i=1}^{n}[y_i\log(p_i) + (1-y_i)\log(1-p_i)]$$

### XGBoost Objective

$$\mathcal{L} = \sum_{i=1}^{n} l(y_i, \hat{y}_i) + \sum_{k=1}^{K}\Omega(f_k)$$

Where regularization term:
$$\Omega(f) = \gamma T + \frac{1}{2}\lambda\sum_{j=1}^{T}w_j^2$$

- $T$ = number of leaves
- $w_j$ = leaf weights
- $\gamma$ = complexity penalty
- $\lambda$ = L2 regularization

In [3]:
from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting (sklearn version)
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,
    random_state=42
)
gb.fit(X_train, y_train)

print("Gradient Boosting Results")
print("="*50)
print(f"\nLearning rate (η): {gb.learning_rate}")
print(f"Number of stages: {gb.n_estimators}")
print(f"\nTraining accuracy: {gb.score(X_train, y_train):.2%}")
print(f"Test accuracy: {gb.score(X_test, y_test):.2%}")

# Compare all models
print("\n" + "="*50)
print("Model Comparison (Test Accuracy):")
print(f"  Decision Tree:      {tree.score(X_test, y_test):.2%}")
print(f"  Random Forest:      {rf.score(X_test, y_test):.2%}")
print(f"  Gradient Boosting:  {gb.score(X_test, y_test):.2%}")

Gradient Boosting Results

Learning rate (η): 0.1
Number of stages: 100

Training accuracy: 84.88%
Test accuracy: 67.00%

Model Comparison (Test Accuracy):
  Decision Tree:      69.00%
  Random Forest:      67.00%
  Gradient Boosting:  67.00%


### XGBoost Key Parameters

| Parameter | Description | Typical Values |
|-----------|-------------|----------------|
| learning_rate (η) | Shrinkage per tree | 0.01-0.3 |
| n_estimators | Number of trees | 100-1000 |
| max_depth | Tree depth | 3-8 |
| subsample | Row sampling | 0.7-1.0 |
| colsample_bytree | Feature sampling | 0.7-1.0 |
| reg_lambda | L2 regularization | 0-10 |
| reg_alpha | L1 regularization | 0-10 |

---

## 4. Feature Engineering for Alpha

### Categories of Features

**1. Price-Based**
- Returns (1d, 5d, 20d, 60d)
- Moving averages (SMA, EMA)
- Price relative to MA

**2. Technical Indicators**
- RSI, MACD, Bollinger Bands
- ATR (volatility)
- Volume indicators (OBV)

**3. Cross-Sectional**
- Sector-relative momentum
- Rank within universe
- Z-score vs peers

**4. Fundamental (if available)**
- P/E, P/B ratios
- Earnings surprise
- Analyst revisions

### Feature Engineering Best Practices

1. **Normalize/Standardize**: Z-score or rank
2. **Handle outliers**: Winsorize at 1st/99th percentile
3. **Lag appropriately**: No look-ahead bias!
4. **Cross-sectional**: Compare within universe

In [4]:
def create_trading_features(prices, volume=None):
    """
    Create common trading features from price data.
    All features are lagged to avoid look-ahead bias.
    """
    df = pd.DataFrame()
    
    # Returns at different horizons
    df['ret_1d'] = prices.pct_change(1).shift(1)
    df['ret_5d'] = prices.pct_change(5).shift(1)
    df['ret_20d'] = prices.pct_change(20).shift(1)
    
    # Moving averages
    df['sma_20'] = prices.rolling(20).mean().shift(1)
    df['sma_50'] = prices.rolling(50).mean().shift(1)
    
    # Price relative to MA
    df['price_to_sma20'] = (prices.shift(1) / df['sma_20']) - 1
    
    # Volatility
    df['volatility_20d'] = prices.pct_change().rolling(20).std().shift(1)
    
    # RSI
    delta = prices.diff()
    gain = delta.where(delta > 0, 0).rolling(14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
    rs = gain / loss
    df['rsi'] = (100 - 100 / (1 + rs)).shift(1)
    
    return df

# Example
np.random.seed(42)
prices = pd.Series(100 * np.cumprod(1 + np.random.normal(0.0005, 0.015, 500)))
features = create_trading_features(prices)

print("Trading Features Example")
print("="*50)
print(features.dropna().head(10).round(4))

Trading Features Example
    ret_1d  ret_5d  ret_20d   sma_20   sma_50  price_to_sma20  volatility_20d  \
50 -0.0259 -0.0210  -0.0739  90.4739  96.3760         -0.0479          0.0150   
51  0.0054 -0.0055  -0.0609  90.1931  96.0922         -0.0398          0.0151   
52 -0.0053 -0.0044  -0.0915  89.7590  95.8025         -0.0402          0.0132   
53 -0.0097 -0.0298  -0.1006  89.2820  95.4756         -0.0444          0.0132   
54  0.0097 -0.0260  -0.0777  88.9190  95.1177         -0.0312          0.0134   
55  0.0160  0.0159  -0.0749  88.5649  94.7936         -0.0118          0.0136   
56  0.0145  0.0252  -0.0445  88.3583  94.5010          0.0048          0.0138   
57 -0.0121  0.0181  -0.0594  88.0812  94.1370         -0.0042          0.0139   
58 -0.0041  0.0238  -0.0354  87.9207  93.7402         -0.0065          0.0125   
59  0.0055  0.0195  -0.0110  87.8721  93.3671         -0.0005          0.0118   

        rsi  
50  26.6530  
51  27.2763  
52  32.2030  
53  34.8084  
54  37.6318  

---

## 5. Walk-Forward Validation

### Why Not Standard Cross-Validation?

**Problem**: Standard k-fold mixes past and future data → look-ahead bias!

**Solution**: Walk-forward (rolling or expanding window)

### Walk-Forward Scheme

```
Expanding Window:
Period 1: [===Train===][Test]
Period 2: [====Train====][Test]
Period 3: [=====Train=====][Test]

Rolling Window:
Period 1: [===Train===][Test]
Period 2:    [===Train===][Test]
Period 3:       [===Train===][Test]
```

### Purging and Embargo

**Purging**: Remove samples around test set to avoid data leakage

**Embargo**: Gap between train and test (e.g., 5 days)

In [5]:
from sklearn.model_selection import TimeSeriesSplit

def walk_forward_validation(X, y, model, n_splits=5, test_size=50):
    """
    Perform walk-forward validation for time series.
    """
    results = []
    
    tscv = TimeSeriesSplit(n_splits=n_splits, test_size=test_size)
    
    for fold, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
        X_tr, X_te = X.iloc[train_idx], X.iloc[test_idx]
        y_tr, y_te = y.iloc[train_idx], y.iloc[test_idx]
        
        # Clone and fit model
        from sklearn.base import clone
        m = clone(model)
        m.fit(X_tr, y_tr)
        
        # Score
        train_score = m.score(X_tr, y_tr)
        test_score = m.score(X_te, y_te)
        
        results.append({
            'fold': fold,
            'train_size': len(train_idx),
            'test_start': test_idx[0],
            'test_end': test_idx[-1],
            'train_acc': train_score,
            'test_acc': test_score
        })
        
    return pd.DataFrame(results)

# Run walk-forward validation
y_series = pd.Series(direction)
wf_results = walk_forward_validation(X, y_series, rf, n_splits=5, test_size=100)

print("Walk-Forward Validation Results")
print("="*60)
print(wf_results.to_string(index=False))
print(f"\nMean Test Accuracy: {wf_results['test_acc'].mean():.2%}")
print(f"Std Test Accuracy: {wf_results['test_acc'].std():.2%}")

Walk-Forward Validation Results
 fold  train_size  test_start  test_end  train_acc  test_acc
    1         500         500       599   0.812000      0.75
    2         600         600       699   0.796667      0.66
    3         700         700       799   0.790000      0.71
    4         800         800       899   0.781250      0.68
    5         900         900       999   0.767778      0.67

Mean Test Accuracy: 69.40%
Std Test Accuracy: 3.65%


---

## 6. Ensemble Methods

### Types of Ensembles

**1. Averaging**: Simple mean of predictions
$$\hat{y} = \frac{1}{M}\sum_{m=1}^{M}\hat{y}_m$$

**2. Weighted Averaging**: 
$$\hat{y} = \sum_{m=1}^{M}w_m\hat{y}_m, \quad \sum w_m = 1$$

**3. Stacking**: Train meta-model on base model predictions

### Why Ensembles Work

**Variance Reduction**: If models have uncorrelated errors:

$$\text{Var}\left(\frac{1}{M}\sum_{m=1}^{M}\hat{y}_m\right) = \frac{\sigma^2}{M}$$

Variance decreases with more models!

### Diversity is Key

Ensemble models should be **diverse**:
- Different algorithms (RF, XGBoost, Linear)
- Different features
- Different time windows

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

# Create diverse ensemble
model1 = RandomForestClassifier(n_estimators=50, max_depth=4, random_state=42)
model2 = GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42)
model3 = LogisticRegression(random_state=42)

# Voting ensemble
ensemble = VotingClassifier(
    estimators=[
        ('rf', model1),
        ('gb', model2),
        ('lr', model3)
    ],
    voting='soft'  # Use predicted probabilities
)

# Fit all models
ensemble.fit(X_train, y_train)

print("Ensemble Model Comparison")
print("="*50)
print(f"\n{'Model':<25} | {'Train Acc':>10} | {'Test Acc':>10}")
print("-"*50)

for name, model in ensemble.named_estimators_.items():
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f"{name:<25} | {train_acc:>10.2%} | {test_acc:>10.2%}")

print("-"*50)
print(f"{'ENSEMBLE':<25} | {ensemble.score(X_train, y_train):>10.2%} | {ensemble.score(X_test, y_test):>10.2%}")

Ensemble Model Comparison

Model                     |  Train Acc |   Test Acc
--------------------------------------------------
rf                        |     75.25% |     69.00%
gb                        |     78.62% |     68.50%
lr                        |     66.38% |     63.50%
--------------------------------------------------
ENSEMBLE                  |     77.12% |     68.50%


### Stacking Example

**Level 1**: Base models make predictions
**Level 2**: Meta-model learns to combine them

Use out-of-fold predictions to avoid overfitting!

In [7]:
from sklearn.ensemble import StackingClassifier

# Stacking ensemble
stacking = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=50, max_depth=4, random_state=42)),
        ('gb', GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42))
    ],
    final_estimator=LogisticRegression(),
    cv=5
)

stacking.fit(X_train, y_train)

print("\nStacking Ensemble")
print("="*50)
print(f"Base models: Random Forest, Gradient Boosting")
print(f"Meta-model: Logistic Regression")
print(f"\nTest Accuracy: {stacking.score(X_test, y_test):.2%}")


Stacking Ensemble
Base models: Random Forest, Gradient Boosting
Meta-model: Logistic Regression

Test Accuracy: 67.50%


---

## Summary: Week 8 Key Concepts

| Method | Key Idea | Pros | Cons |
|--------|----------|------|------|
| Decision Tree | Recursive splitting | Interpretable | Overfits |
| Random Forest | Bagging + feature randomization | Robust, parallel | Memory intensive |
| Gradient Boosting | Sequential error correction | Often best accuracy | Slower, overfits |
| Ensemble | Combine diverse models | Reduces variance | Complexity |

### Key Formulas

| Concept | Formula |
|---------|--------|
| Gini Impurity | $G = 1 - \sum p_k^2$ |
| Information Gain | $IG = H_{parent} - \sum \frac{n_j}{n} H_{child_j}$ |
| Boosting Update | $F_m = F_{m-1} + \eta \cdot h_m$ |
| Ensemble Variance | $Var(\bar{y}) = \sigma^2/M$ |

### Trading-Specific Tips

1. **Always use walk-forward validation** (never random k-fold)
2. **Lag all features** to avoid look-ahead bias
3. **Feature importance** helps with interpretability
4. **Ensemble diverse models** for robustness
5. **Tune for stability**, not just accuracy

---

*Next Week: Deep Learning Fundamentals*