# Quant Interview: Machine Learning for Trading

## üéØ Overview
Common ML interview questions for quant trading positions at Two Sigma, Citadel, DE Shaw, and other top firms.

## ‚è±Ô∏è Time Allocation
| Section | Duration |
|---------|----------|
| Overfitting & Regularization | 30 min |
| Feature Engineering | 30 min |
| Model Selection | 30 min |
| Trading-Specific ML | 30 min |

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import yfinance as yf
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')

print("="*70)
print("QUANT INTERVIEW: MACHINE LEARNING FOR TRADING")
print("="*70)

QUANT INTERVIEW: MACHINE LEARNING FOR TRADING


## Question 1: Why does overfitting happen in trading more than other ML applications?

**Expected Answer Points:**
1. Low signal-to-noise ratio in financial data
2. Limited data (only one history)
3. Non-stationarity (market regimes change)
4. Multiple testing problem (many strategies tested)
5. Look-ahead bias temptation

In [2]:
# Demonstrate overfitting with real data
print("\n" + "="*60)
print("QUESTION 1: Overfitting Demonstration")
print("="*60)

# Download data
ticker = 'AAPL'
data = yf.download(ticker, start='2019-01-01', end='2024-01-01', progress=False, auto_adjust=True)
returns = data['Close'].pct_change().dropna()

# Create features (lagged returns)
df = pd.DataFrame()
for lag in range(1, 21):
    df[f'lag_{lag}'] = returns.shift(lag)
df['target'] = returns
df = df.dropna()

X = df.drop('target', axis=1)
y = df['target']

# Split data
split = int(len(X) * 0.7)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Fit models with different complexity
results = []

# Simple model (few features)
simple_model = LinearRegression()
simple_model.fit(X_train[['lag_1', 'lag_2', 'lag_3']], y_train)
train_r2_simple = r2_score(y_train, simple_model.predict(X_train[['lag_1', 'lag_2', 'lag_3']]))
test_r2_simple = r2_score(y_test, simple_model.predict(X_test[['lag_1', 'lag_2', 'lag_3']]))
results.append(('Simple (3 features)', train_r2_simple, test_r2_simple))

# Complex model (all features)
complex_model = LinearRegression()
complex_model.fit(X_train, y_train)
train_r2_complex = r2_score(y_train, complex_model.predict(X_train))
test_r2_complex = r2_score(y_test, complex_model.predict(X_test))
results.append(('Complex (20 features)', train_r2_complex, test_r2_complex))

# Ridge regularized
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
train_r2_ridge = r2_score(y_train, ridge_model.predict(X_train))
test_r2_ridge = r2_score(y_test, ridge_model.predict(X_test))
results.append(('Ridge (regularized)', train_r2_ridge, test_r2_ridge))

print("\nModel Comparison:")
print("-" * 60)
print(f"{'Model':<25} {'Train R¬≤':>12} {'Test R¬≤':>12} {'Gap':>10}")
print("-" * 60)
for name, train_r2, test_r2 in results:
    gap = train_r2 - test_r2
    print(f"{name:<25} {train_r2:>12.4f} {test_r2:>12.4f} {gap:>10.4f}")

print(f"\n‚úÖ Key Insight: Complex model overfits (large train-test gap)")
print(f"   Ridge regularization reduces overfitting")


QUESTION 1: Overfitting Demonstration

Model Comparison:
------------------------------------------------------------
Model                         Train R¬≤      Test R¬≤        Gap
------------------------------------------------------------
Simple (3 features)             0.0232      -0.0235     0.0467
Complex (20 features)           0.0790      -0.0478     0.1268
Ridge (regularized)             0.0509      -0.0087     0.0596

‚úÖ Key Insight: Complex model overfits (large train-test gap)
   Ridge regularization reduces overfitting


## Question 2: Why can't you use k-fold cross-validation for time series?

**Expected Answer:**
- K-fold randomly shuffles data
- This allows future data to leak into training
- Financial data has temporal dependencies
- Solution: Use walk-forward validation (TimeSeriesSplit)

In [3]:
# Demonstrate time series CV
print("\n" + "="*60)
print("QUESTION 2: Time Series Cross-Validation")
print("="*60)

from sklearn.model_selection import KFold

# Compare k-fold vs time series split
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
tscv = TimeSeriesSplit(n_splits=5)

model = Ridge(alpha=1.0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kfold_scores = cross_val_score(model, X_scaled, y, cv=kfold, scoring='r2')
tscv_scores = cross_val_score(model, X_scaled, y, cv=tscv, scoring='r2')

print("\nCross-Validation Comparison:")
print("-" * 50)
print(f"K-Fold CV (WRONG for time series):")
print(f"  Mean R¬≤: {kfold_scores.mean():.4f} ¬± {kfold_scores.std():.4f}")
print(f"\nTime Series CV (CORRECT):")
print(f"  Mean R¬≤: {tscv_scores.mean():.4f} ¬± {tscv_scores.std():.4f}")

print(f"\n‚úÖ Key Insight: K-fold gives overly optimistic results")
print(f"   Time series CV is more realistic (and usually shows lower R¬≤)")


QUESTION 2: Time Series Cross-Validation

Cross-Validation Comparison:
--------------------------------------------------
K-Fold CV (WRONG for time series):
  Mean R¬≤: -0.0154 ¬± 0.0352

Time Series CV (CORRECT):
  Mean R¬≤: -0.0386 ¬± 0.0595

‚úÖ Key Insight: K-fold gives overly optimistic results
   Time series CV is more realistic (and usually shows lower R¬≤)


## Question 3: What's a good R¬≤ for predicting returns?

**Expected Answer:**
- For daily returns: 1-2% R¬≤ is VERY good
- If you see R¬≤ > 10%, be suspicious (likely overfitting or data leakage)
- Financial markets are efficient, signal is weak
- Focus on information coefficient (IC) instead

In [4]:
# Calculate realistic R¬≤ expectations
print("\n" + "="*60)
print("QUESTION 3: Realistic R¬≤ Expectations")
print("="*60)

# Calculate IC (Information Coefficient = correlation between predictions and actual)
predictions = ridge_model.predict(X_test)
ic = np.corrcoef(predictions, y_test)[0, 1]

# Calculate Sharpe from IC (approximation)
# Sharpe ‚âà IC √ó ‚àö(252) √ó ‚àö(breadth)
breadth = 252  # Number of bets per year (daily trading)
implied_sharpe = ic * np.sqrt(breadth)

print(f"\nModel Performance:")
print(f"  R¬≤: {test_r2_ridge:.4f} ({test_r2_ridge*100:.2f}%)")
print(f"  IC (correlation): {ic:.4f}")
print(f"  Implied Sharpe: {implied_sharpe:.2f}")

print(f"\nBenchmarks for Daily Return Prediction:")
print("-" * 40)
print(f"  R¬≤ > 10%  ‚Üí Suspicious (check for bugs)")
print(f"  R¬≤ = 2-5% ‚Üí Very good")
print(f"  R¬≤ = 1-2% ‚Üí Good")
print(f"  R¬≤ < 1%   ‚Üí Normal")

print(f"\n‚úÖ Key Insight: Even 1% R¬≤ can be highly profitable")
print(f"   Sharpe 1.0 only requires IC ‚âà 0.06 with daily trading")


QUESTION 3: Realistic R¬≤ Expectations

Model Performance:
  R¬≤: -0.0087 (-0.87%)
  IC (correlation): 0.0074
  Implied Sharpe: 0.12

Benchmarks for Daily Return Prediction:
----------------------------------------
  R¬≤ > 10%  ‚Üí Suspicious (check for bugs)
  R¬≤ = 2-5% ‚Üí Very good
  R¬≤ = 1-2% ‚Üí Good
  R¬≤ < 1%   ‚Üí Normal

‚úÖ Key Insight: Even 1% R¬≤ can be highly profitable
   Sharpe 1.0 only requires IC ‚âà 0.06 with daily trading


## Question 4: Ridge vs Lasso - When to use each?

**Expected Answer:**
- **Ridge**: When all features might be relevant, shrinks coefficients
- **Lasso**: When you want feature selection, sets some coefficients to zero
- **Trading context**: Lasso for alpha factors (sparse selection), Ridge for risk models

In [5]:
# Compare Ridge vs Lasso coefficients
print("\n" + "="*60)
print("QUESTION 4: Ridge vs Lasso")
print("="*60)

ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.001)

ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)

# Count non-zero coefficients
ridge_nonzero = np.sum(np.abs(ridge.coef_) > 1e-6)
lasso_nonzero = np.sum(np.abs(lasso.coef_) > 1e-6)

print(f"\nCoefficient Analysis:")
print("-" * 40)
print(f"Ridge: {ridge_nonzero}/20 non-zero coefficients")
print(f"Lasso: {lasso_nonzero}/20 non-zero coefficients")

print(f"\nTop 5 features by coefficient magnitude:")
print("\nRidge:")
ridge_coef = pd.Series(ridge.coef_, index=X.columns).abs().sort_values(ascending=False)
for feat, coef in ridge_coef.head().items():
    print(f"  {feat}: {coef:.6f}")

print("\nLasso:")
lasso_coef = pd.Series(lasso.coef_, index=X.columns).abs().sort_values(ascending=False)
for feat, coef in lasso_coef.head().items():
    print(f"  {feat}: {coef:.6f}")

print(f"\n‚úÖ Key Insight: Lasso performs automatic feature selection")


QUESTION 4: Ridge vs Lasso

Coefficient Analysis:
----------------------------------------
Ridge: 20/20 non-zero coefficients
Lasso: 0/20 non-zero coefficients

Top 5 features by coefficient magnitude:

Ridge:
  lag_1: 0.036460
  lag_9: 0.035896
  lag_8: 0.035109
  lag_7: 0.026472
  lag_6: 0.022795

Lasso:
  lag_1: 0.000000
  lag_2: 0.000000
  lag_19: 0.000000
  lag_18: 0.000000
  lag_17: 0.000000

‚úÖ Key Insight: Lasso performs automatic feature selection


## Question 5: How do you prevent look-ahead bias?

**Expected Answer:**
1. Always lag features by at least 1 period
2. Use point-in-time data (not revised data)
3. Split train/test by time, not randomly
4. Apply purging and embargo in CV
5. Be careful with normalization (fit on train only)

In [6]:
# Demonstrate look-ahead bias
print("\n" + "="*60)
print("QUESTION 5: Look-Ahead Bias Detection")
print("="*60)

# Create feature WITH look-ahead bias (using future data)
df_bias = pd.DataFrame()
df_bias['returns'] = returns

# WRONG: Using future moving average (look-ahead bias)
df_bias['future_ma'] = returns.shift(-5).rolling(5).mean()  # Future data!

# CORRECT: Using past moving average (properly lagged)
df_bias['past_ma'] = returns.shift(1).rolling(5).mean()  # Past data only

df_bias = df_bias.dropna()

# Correlations
corr_bias = df_bias['future_ma'].corr(df_bias['returns'])
corr_correct = df_bias['past_ma'].corr(df_bias['returns'])

print(f"\nFeature-Target Correlations:")
print("-" * 40)
print(f"With look-ahead bias:    r = {corr_bias:.4f} (suspiciously high!)")
print(f"Without bias (correct):  r = {corr_correct:.4f} (realistic)")

print(f"\n‚ö†Ô∏è Red Flags for Look-Ahead Bias:")
print(f"  - Very high R¬≤ (>5% for daily returns)")
print(f"  - Features not lagged")
print(f"  - Train/test split not temporal")
print(f"  - Strategy 'knows' about future events")

print(f"\n‚úÖ Prevention Checklist:")
print(f"  1. Lag all features by at least 1 period")
print(f"  2. Split data by DATE, not randomly")
print(f"  3. Fit scaler on TRAINING data only")
print(f"  4. Use point-in-time financial data")


QUESTION 5: Look-Ahead Bias Detection

Feature-Target Correlations:
----------------------------------------
With look-ahead bias:    r = -0.0418 (suspiciously high!)
Without bias (correct):  r = -0.0425 (realistic)

‚ö†Ô∏è Red Flags for Look-Ahead Bias:
  - Very high R¬≤ (>5% for daily returns)
  - Features not lagged
  - Train/test split not temporal
  - Strategy 'knows' about future events

‚úÖ Prevention Checklist:
  1. Lag all features by at least 1 period
  2. Split data by DATE, not randomly
  3. Fit scaler on TRAINING data only
  4. Use point-in-time financial data


## üìö Summary: Interview Cheat Sheet

| Question | Key Points |
|----------|------------|
| Why overfitting? | Low SNR, limited data, non-stationarity |
| K-fold for TS? | No! Use TimeSeriesSplit |
| Good R¬≤? | 1-2% is good, >10% is suspicious |
| Ridge vs Lasso? | Ridge shrinks, Lasso selects |
| Look-ahead bias? | Lag features, temporal split, fit scaler on train |

In [7]:
print("\n" + "="*70)
print("‚úÖ ML INTERVIEW PREPARATION COMPLETE")
print("="*70)


‚úÖ ML INTERVIEW PREPARATION COMPLETE
