# Week 4: Machine Learning Basics for Trading

---

## Table of Contents
1. Supervised vs Unsupervised Learning
2. Train-Test Split
3. Cross-Validation
4. Bias-Variance Tradeoff
5. Feature Engineering

---

In [1]:
# Standard imports and data loading
import numpy as np
import pandas as pd
import yfinance as yf
from datetime import datetime, timedelta

# Standard 5 equities for analysis
tickers = ['AAPL', 'MSFT', 'GOOGL', 'JPM', 'GS']

# Fetch 5 years of data
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)

print("üì• Downloading market data...")
data = yf.download(tickers, start=start_date, end=end_date, progress=False, auto_adjust=True)
prices = data['Close'].dropna()
returns = prices.pct_change().dropna()
print(f"‚úÖ Loaded {len(prices)} days of data for {len(tickers)} tickers")
print(f"üìÖ Date range: {prices.index[0].strftime('%Y-%m-%d')} to {prices.index[-1].strftime('%Y-%m-%d')}")
print(prices.tail())

üì• Downloading market data...
‚úÖ Loaded 1255 days of data for 5 tickers
üìÖ Date range: 2021-01-25 to 2026-01-22
Ticker            AAPL       GOOGL          GS         JPM        MSFT
Date                                                                  
2026-01-15  258.209991  332.779999  975.859985  309.260010  456.660004
2026-01-16  255.529999  330.000000  962.000000  312.470001  459.859985
2026-01-20  246.699997  322.000000  943.369995  302.739990  454.519989
2026-01-21  247.649994  328.380005  953.010010  302.040009  444.109985
2026-01-22  249.695007  331.475006  965.546692  306.709991  449.884491


## 1. Supervised vs Unsupervised Learning

### Supervised Learning

Learn from labeled data to predict outcomes.

**Goal**: Learn function $f$ such that $Y = f(X) + \epsilon$

**Types**:
- **Regression**: Predict continuous values (e.g., next day's return)
- **Classification**: Predict categories (e.g., up/down/neutral)

**Trading Applications**:
- Return prediction
- Direction forecasting
- Credit default prediction
- Trade execution optimization

### Unsupervised Learning

Find patterns in data without labels.

**Types**:
- **Clustering**: Group similar assets
- **Dimensionality Reduction**: Find hidden factors
- **Anomaly Detection**: Identify unusual market events

**Trading Applications**:
- Regime detection
- Asset clustering for diversification
- Factor discovery

In [2]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans

np.random.seed(42)

# Create sample data: 3 features predicting returns
n_samples = 1000
momentum = np.random.normal(0, 1, n_samples)     # Momentum signal
value = np.random.normal(0, 1, n_samples)        # Value signal
volatility = np.abs(np.random.normal(0, 1, n_samples))  # Volatility

# True relationship (with noise)
true_returns = 0.3 * momentum + 0.2 * value - 0.1 * volatility + np.random.normal(0, 0.5, n_samples)

# Create DataFrame
X = pd.DataFrame({
    'momentum': momentum,
    'value': value,
    'volatility': volatility
})
y = true_returns

print("SUPERVISED LEARNING EXAMPLE")
print("="*50)
print(f"Features (X): {list(X.columns)}")
print(f"Target (y): Stock returns")
print(f"Samples: {n_samples}")
print(f"\nWe want to learn: Return = f(momentum, value, volatility)")

SUPERVISED LEARNING EXAMPLE
Features (X): ['momentum', 'value', 'volatility']
Target (y): Stock returns
Samples: 1000

We want to learn: Return = f(momentum, value, volatility)


---

## 2. Train-Test Split

### Why Split Data?

**Overfitting**: Model memorizes training data but fails on new data.

**Solution**: Hold out data to test generalization.

### Standard Split

$$\text{Data} = \text{Training Set (70-80%)} + \text{Test Set (20-30%)}$$

### Time Series Split (Critical for Finance!)

**WRONG**: Random split (future data in training set = look-ahead bias)

**RIGHT**: Chronological split

```
Timeline: ---|----Training----|----Test----|->
             Start           Split        End
```

In [3]:
from sklearn.model_selection import train_test_split, TimeSeriesSplit

# WRONG: Random split (introduces look-ahead bias)
X_train_wrong, X_test_wrong, y_train_wrong, y_test_wrong = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# RIGHT: Chronological split for time series
split_idx = int(len(X) * 0.8)
X_train = X.iloc[:split_idx]
X_test = X.iloc[split_idx:]
y_train = y[:split_idx]
y_test = y[split_idx:]

print("Train-Test Split for Time Series")
print("="*50)
print(f"\n‚ùå WRONG (Random Split):")
print(f"   Training indices include: {sorted(X_train_wrong.index[:5].tolist())}... (mixed!)")
print(f"   This causes look-ahead bias!")

print(f"\n‚úì CORRECT (Chronological Split):")
print(f"   Training: indices 0 to {split_idx-1} ({len(X_train)} samples)")
print(f"   Testing: indices {split_idx} to {len(X)-1} ({len(X_test)} samples)")
print(f"   No future information leaks into training!")

Train-Test Split for Time Series

‚ùå WRONG (Random Split):
   Training indices include: [29, 535, 557, 695, 836]... (mixed!)
   This causes look-ahead bias!

‚úì CORRECT (Chronological Split):
   Training: indices 0 to 799 (800 samples)
   Testing: indices 800 to 999 (200 samples)
   No future information leaks into training!


### Training and Evaluating

**Process**:
1. Train model on training data only
2. Make predictions on test data
3. Compare predictions to actual values
4. Calculate performance metrics

In [4]:
from sklearn.metrics import mean_squared_error, r2_score

# Train linear regression
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("Model Performance")
print("="*50)
print(f"\nCoefficients learned:")
for name, coef in zip(X.columns, model.coef_):
    print(f"  {name}: {coef:.4f}")

print(f"\n           | Training | Test     |")
print(f"  RMSE     | {train_rmse:.4f}   | {test_rmse:.4f}   |")
print(f"  R¬≤       | {train_r2:.4f}   | {test_r2:.4f}   |")

if test_r2 < train_r2 * 0.8:
    print("\n‚ö†Ô∏è Warning: Possible overfitting (test R¬≤ << train R¬≤)")
else:
    print("\n‚úì Model generalizes well")

Model Performance

Coefficients learned:
  momentum: 0.2991
  value: 0.1687
  volatility: -0.1961

           | Training | Test     |
  RMSE     | 0.5115   | 0.5150   |
  R¬≤       | 0.3192   | 0.2319   |



---

## 3. Cross-Validation

### Why Cross-Validation?

Single train-test split is unreliable:
- What if test period was unusual?
- We "waste" data (only train on 80%)

### K-Fold Cross-Validation

Split data into $K$ folds, train $K$ times:

```
Fold 1: [Test] [Train] [Train] [Train] [Train]
Fold 2: [Train] [Test] [Train] [Train] [Train]
Fold 3: [Train] [Train] [Test] [Train] [Train]
...
```

**Final score** = Average of all folds

### Time Series Cross-Validation (Walk-Forward)

For time series, use **expanding window**:

```
Fold 1: [Train    ] [Test]
Fold 2: [Train         ] [Test]
Fold 3: [Train              ] [Test]
Fold 4: [Train                   ] [Test]
```

This mimics real trading: train on history, test on future.

In [5]:
from sklearn.model_selection import TimeSeriesSplit, cross_val_score

# Time Series Cross-Validation
tscv = TimeSeriesSplit(n_splits=5)

print("Time Series Cross-Validation (Walk-Forward)")
print("="*50)

cv_scores = []
for fold, (train_idx, test_idx) in enumerate(tscv.split(X), 1):
    # Split data
    X_tr, X_te = X.iloc[train_idx], X.iloc[test_idx]
    y_tr, y_te = y[train_idx], y[test_idx]
    
    # Train and evaluate
    model = LinearRegression()
    model.fit(X_tr, y_tr)
    score = model.score(X_te, y_te)
    cv_scores.append(score)
    
    print(f"Fold {fold}: Train[{train_idx[0]:3d}-{train_idx[-1]:3d}] ‚Üí "
          f"Test[{test_idx[0]:3d}-{test_idx[-1]:3d}] | R¬≤ = {score:.4f}")

print(f"\nMean R¬≤: {np.mean(cv_scores):.4f} (¬±{np.std(cv_scores):.4f})")
print("\n‚úì This tests model across different market periods!")

Time Series Cross-Validation (Walk-Forward)
Fold 1: Train[  0-169] ‚Üí Test[170-335] | R¬≤ = 0.2796
Fold 2: Train[  0-335] ‚Üí Test[336-501] | R¬≤ = 0.2712
Fold 3: Train[  0-501] ‚Üí Test[502-667] | R¬≤ = 0.3181
Fold 4: Train[  0-667] ‚Üí Test[668-833] | R¬≤ = 0.3009
Fold 5: Train[  0-833] ‚Üí Test[834-999] | R¬≤ = 0.2004

Mean R¬≤: 0.2741 (¬±0.0403)

‚úì This tests model across different market periods!


---

## 4. Bias-Variance Tradeoff

### The Fundamental Tradeoff

**Total Error** = Bias¬≤ + Variance + Irreducible Noise

$$E[(y - \hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2$$

### Definitions

**Bias**: Error from wrong assumptions (underfitting)
$$\text{Bias}[\hat{f}(x)] = E[\hat{f}(x)] - f(x)$$

**Variance**: Error from sensitivity to training data (overfitting)
$$\text{Var}[\hat{f}(x)] = E[(\hat{f}(x) - E[\hat{f}(x)])^2]$$

### The Tradeoff

| Model Complexity | Bias | Variance | Typical Result |
|-----------------|------|----------|----------------|
| Too Simple | High | Low | Underfitting |
| Just Right | Balanced | Balanced | Good Generalization |
| Too Complex | Low | High | Overfitting |

In [6]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Demonstrate bias-variance with polynomial regression
np.random.seed(42)

# True signal: simple linear relationship
X_simple = np.linspace(0, 1, 100).reshape(-1, 1)
y_true = 2 * X_simple.ravel() + np.random.normal(0, 0.3, 100)

# Split
X_tr, X_te = X_simple[:70], X_simple[70:]
y_tr, y_te = y_true[:70], y_true[70:]

print("Bias-Variance Tradeoff Demo")
print("="*50)
print("\nTrue relationship: y = 2x + noise")
print("\nModel Complexity Comparison:")
print("-"*50)

for degree in [1, 3, 15]:
    # Create polynomial features
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    
    model.fit(X_tr, y_tr)
    
    train_err = mean_squared_error(y_tr, model.predict(X_tr))
    test_err = mean_squared_error(y_te, model.predict(X_te))
    
    status = "Good" if abs(train_err - test_err) < 0.1 else \
             ("Underfit" if train_err > 0.1 else "Overfit")
    
    print(f"Degree {degree:2d}: Train MSE={train_err:.4f}, Test MSE={test_err:.4f} ‚Üí {status}")

Bias-Variance Tradeoff Demo

True relationship: y = 2x + noise

Model Complexity Comparison:
--------------------------------------------------
Degree  1: Train MSE=0.0717, Test MSE=0.0771 ‚Üí Good
Degree  3: Train MSE=0.0658, Test MSE=0.1448 ‚Üí Good
Degree 15: Train MSE=0.0558, Test MSE=895361478770.1882 ‚Üí Overfit


### Regularization

Add penalty for complexity to prevent overfitting:

**Ridge (L2)**: $\text{Loss} = MSE + \lambda \sum \beta_j^2$

**Lasso (L1)**: $\text{Loss} = MSE + \lambda \sum |\beta_j|$

- Larger $\lambda$ = more regularization = simpler model
- Lasso can shrink coefficients to exactly zero (feature selection)

In [7]:
from sklearn.linear_model import Ridge, Lasso

# Compare regularization strengths
print("Effect of Regularization (Ridge)")
print("="*50)

# Use original data with many features
for alpha in [0.01, 0.1, 1.0, 10.0]:
    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train)
    
    train_r2 = model.score(X_train, y_train)
    test_r2 = model.score(X_test, y_test)
    
    print(f"Œª = {alpha:5.2f}: Train R¬≤={train_r2:.4f}, Test R¬≤={test_r2:.4f}, "
          f"Coefs magnitude: {np.sum(model.coef_**2):.4f}")

Effect of Regularization (Ridge)
Œª =  0.01: Train R¬≤=0.3192, Test R¬≤=0.2319, Coefs magnitude: 0.1564
Œª =  0.10: Train R¬≤=0.3192, Test R¬≤=0.2319, Coefs magnitude: 0.1563
Œª =  1.00: Train R¬≤=0.3192, Test R¬≤=0.2323, Coefs magnitude: 0.1558
Œª = 10.00: Train R¬≤=0.3192, Test R¬≤=0.2354, Coefs magnitude: 0.1505


---

## 5. Feature Engineering

### Why Feature Engineering?

**"Garbage in, garbage out"**

Raw data is rarely suitable for ML. We must create meaningful features.

### Common Trading Features

**Price-Based**:
- Returns: $r_t = \frac{P_t - P_{t-1}}{P_{t-1}}$
- Log returns: $r_t = \ln(P_t / P_{t-1})$
- Moving averages: $MA_n = \frac{1}{n}\sum_{i=0}^{n-1} P_{t-i}$

**Momentum**:
- RSI: $RSI = 100 - \frac{100}{1 + RS}$ where $RS = \frac{\text{Avg Gain}}{\text{Avg Loss}}$
- MACD: $EMA_{12} - EMA_{26}$

**Volatility**:
- Rolling std dev
- True Range
- Bollinger Band width

**Volume**:
- Volume moving average
- On-balance volume

In [8]:
# Feature engineering example
np.random.seed(42)

# Simulate price data
n_days = 500
returns = np.random.normal(0.0005, 0.015, n_days)
prices = 100 * np.cumprod(1 + returns)

df = pd.DataFrame({
    'price': prices,
    'return': returns
})

# Create features
df['return_1d'] = df['price'].pct_change()           # 1-day return
df['return_5d'] = df['price'].pct_change(5)          # 5-day return (momentum)
df['ma_20'] = df['price'].rolling(20).mean()         # 20-day moving average
df['ma_50'] = df['price'].rolling(50).mean()         # 50-day moving average
df['ma_ratio'] = df['ma_20'] / df['ma_50']           # MA crossover signal
df['volatility_20'] = df['return_1d'].rolling(20).std()  # 20-day volatility

# Price relative to MA (mean reversion signal)
df['price_ma_ratio'] = df['price'] / df['ma_20']

print("Feature Engineering Example")
print("="*50)
print("\nCreated features from raw price data:")
print(df[['price', 'return_1d', 'return_5d', 'ma_ratio', 'volatility_20']].dropna().head(10).round(4))

Feature Engineering Example

Created features from raw price data:
      price  return_1d  return_5d  ma_ratio  volatility_20
49  86.1447    -0.0259    -0.0210    0.9388         0.0150
50  86.6065     0.0054    -0.0055    0.9386         0.0151
51  86.1496    -0.0053    -0.0044    0.9369         0.0132
52  85.3179    -0.0097    -0.0298    0.9351         0.0132
53  86.1434     0.0097    -0.0260    0.9348         0.0134
54  87.5186     0.0160     0.0159    0.9343         0.0136
55  88.7850     0.0145     0.0252    0.9350         0.0138
56  87.7117    -0.0121     0.0181    0.9357         0.0139
57  87.3487    -0.0041     0.0238    0.9379         0.0125
58  87.8265     0.0055     0.0195    0.9411         0.0118


### Feature Scaling

Many ML algorithms require scaled features:

**Standardization** (Z-score):
$$X_{scaled} = \frac{X - \mu}{\sigma}$$

**Min-Max Scaling**:
$$X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}$$

**Important**: Fit scaler on training data only, transform both train and test!

In [9]:
from sklearn.preprocessing import StandardScaler

# Prepare features and target
features = ['return_5d', 'ma_ratio', 'volatility_20', 'price_ma_ratio']
df_clean = df.dropna().copy()

# Target: Next day's return
df_clean['target'] = df_clean['return_1d'].shift(-1)
df_clean = df_clean.dropna()

# Split chronologically
split_idx = int(len(df_clean) * 0.8)
train_df = df_clean.iloc[:split_idx]
test_df = df_clean.iloc[split_idx:]

# Fit scaler on TRAINING DATA ONLY
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(train_df[features])
X_test_scaled = scaler.transform(test_df[features])  # Use same params!

print("Feature Scaling")
print("="*50)
print("\n‚ö†Ô∏è Key: Fit scaler on training data, transform both!")
print("\nBefore scaling (training data):")
print(train_df[features].describe().loc[['mean', 'std']].round(4))

print("\nAfter scaling (training data):")
print(f"Mean: ~0, Std: ~1 for all features")
print(f"Actual - Mean: {X_train_scaled.mean(axis=0).round(4)}")
print(f"         Std:  {X_train_scaled.std(axis=0).round(4)}")

Feature Scaling

‚ö†Ô∏è Key: Fit scaler on training data, transform both!

Before scaling (training data):
      return_5d  ma_ratio  volatility_20  price_ma_ratio
mean     0.0058    1.0125         0.0142          1.0101
std      0.0293    0.0294         0.0026          0.0310

After scaling (training data):
Mean: ~0, Std: ~1 for all features
Actual - Mean: [0. 0. 0. 0.]
         Std:  [1. 1. 1. 1.]


---

## Summary: Week 4 Key Concepts

| Concept | Key Point |
|---------|----------|
| Supervised Learning | Predict from labeled data (regression/classification) |
| Train-Test Split | Chronological for time series (no look-ahead!) |
| Cross-Validation | Walk-forward for robust evaluation |
| Bias-Variance | Simple=high bias, Complex=high variance |
| Regularization | Ridge (L2), Lasso (L1) prevent overfitting |
| Feature Engineering | Create meaningful inputs from raw data |
| Feature Scaling | Fit on train, transform both |

---

*Next Week: Portfolio Theory*

## üî¥ PROS & CONS: THEORY

### ‚úÖ PROS (Advantages)

| Advantage | Description | Real-World Application |
|-----------|-------------|----------------------|
| **Industry Standard** | Widely adopted in quantitative finance | Used by major hedge funds and banks |
| **Well-Documented** | Extensive research and documentation | Easy to find resources and support |
| **Proven Track Record** | Years of practical application | Validated in real market conditions |
| **Interpretable** | Results can be explained to stakeholders | Important for risk management and compliance |

### ‚ùå CONS (Limitations)

| Limitation | Description | How to Mitigate |
|------------|-------------|-----------------|
| **Assumptions** | May not hold in all market conditions | Validate assumptions with data |
| **Historical Bias** | Based on past data patterns | Use rolling windows and regime detection |
| **Overfitting Risk** | May fit noise rather than signal | Use proper cross-validation |
| **Computational Cost** | Can be resource-intensive | Optimize code and use appropriate hardware |

### üéØ Real-World Usage

**WHERE THIS IS USED:**
- ‚úÖ Quantitative hedge funds (Two Sigma, Renaissance, Citadel)
- ‚úÖ Investment banks (Goldman Sachs, JP Morgan, Morgan Stanley)
- ‚úÖ Asset management firms
- ‚úÖ Risk management departments
- ‚úÖ Algorithmic trading desks

**NOT JUST THEORY - THIS IS PRODUCTION CODE:**
The techniques in this notebook are used daily by professionals managing billions of dollars.