# Week 8: Instance-Based Methods - KNN & SVM Regression

## ðŸŽ¯ Learning Objectives

By the end of this week, you will understand:
- **K-Nearest Neighbors (KNN)**: Non-parametric predictions
- **Distance Metrics**: Euclidean, Manhattan, Mahalanobis
- **SVM Regression (SVR)**: Epsilon-insensitive loss
- **Finance Applications**: Similar historical patterns, regime matching

---

## Why Instance-Based Methods?

Instead of learning explicit parameters, these methods:
- Store training data
- Make predictions based on similar examples
- Capture local structure in data
- "Show me similar market conditions in the past"

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("âœ… Libraries loaded!")
print("ðŸ“š Week 8: Instance-Based Methods")

---

## Part 1: K-Nearest Neighbors

### The Algorithm

1. Store all training data
2. For new point $x$, find $k$ nearest neighbors
3. Prediction = average (regression) or majority vote (classification)

$$\hat{y} = \frac{1}{k}\sum_{i \in N_k(x)} y_i$$

### ðŸ¤” Simple Explanation

KNN is like asking: "What happened the last 5 times the market looked like this?" Then average those outcomes for your prediction.

### Key Hyperparameters

- **k**: Number of neighbors (bias-variance tradeoff)
- **metric**: Distance function
- **weights**: Uniform or distance-weighted

In [None]:
# Generate market regime data
n = 1000
vix = np.random.exponential(20, n)  # VIX-like
momentum = np.random.randn(n) * 10  # Momentum score

# Returns depend on regime
returns = np.where(
    vix < 15,
    0.001 + 0.0002 * momentum,  # Low vol: momentum works
    np.where(
        vix > 30,
        -0.002 - 0.0001 * vix,  # High vol: negative
        0.0001 * momentum  # Medium vol: slight momentum
    )
) + np.random.randn(n) * 0.01

X = np.column_stack([vix, momentum])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, returns, test_size=0.3, random_state=42)

# KNN with different k
print("KNN: Effect of k")
print("="*50)
for k in [1, 5, 10, 20, 50]:
    knn = KNeighborsRegressor(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=5)
    print(f"k={k:2d}: CV RÂ² = {scores.mean():.4f} Â± {scores.std():.4f}")

In [None]:
# Visualize KNN predictions
knn = KNeighborsRegressor(n_neighbors=10)
knn.fit(X_train, y_train)

# Create grid for visualization
xx, yy = np.meshgrid(
    np.linspace(X_scaled[:, 0].min(), X_scaled[:, 0].max(), 50),
    np.linspace(X_scaled[:, 1].min(), X_scaled[:, 1].max(), 50)
)
grid = np.c_[xx.ravel(), yy.ravel()]
predictions = knn.predict(grid).reshape(xx.shape)

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.contourf(xx, yy, predictions, levels=20, cmap='RdYlGn', alpha=0.8)
plt.colorbar(label='Predicted Return')
plt.scatter(X_train[:100, 0], X_train[:100, 1], c=y_train[:100], cmap='RdYlGn', edgecolors='k', s=30)
plt.xlabel('VIX (scaled)')
plt.ylabel('Momentum (scaled)')
plt.title('KNN Prediction Surface')

plt.subplot(1, 2, 2)
plt.scatter(y_test, knn.predict(X_test), alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Return')
plt.ylabel('Predicted Return')
plt.title(f'KNN Predictions (RÂ² = {knn.score(X_test, y_test):.3f})')
plt.tight_layout()
plt.show()

---

## Part 2: Distance Metrics

### Common Metrics

**Euclidean** (L2): $d(x, y) = \sqrt{\sum_i (x_i - y_i)^2}$

**Manhattan** (L1): $d(x, y) = \sum_i |x_i - y_i|$

**Mahalanobis**: $d(x, y) = \sqrt{(x-y)^T \Sigma^{-1} (x-y)}$

### ðŸ¤” Simple Explanation

- **Euclidean**: Straight-line distance (bird flies)
- **Manhattan**: City-block distance (taxi drives)
- **Mahalanobis**: Accounts for feature correlations

In [None]:
# Compare distance metrics
metrics = ['euclidean', 'manhattan', 'chebyshev']

print("Distance Metric Comparison (k=10)")
print("="*50)
for metric in metrics:
    knn = KNeighborsRegressor(n_neighbors=10, metric=metric)
    knn.fit(X_train, y_train)
    score = knn.score(X_test, y_test)
    print(f"{metric:12}: Test RÂ² = {score:.4f}")

---

## Part 3: Support Vector Regression (SVR)

### The Idea: Îµ-Insensitive Loss

SVR finds a function that:
- Deviates at most Îµ from actual targets
- Is as flat as possible

**Loss Function:**
$$L_\epsilon(y, f(x)) = \max(0, |y - f(x)| - \epsilon)$$

### ðŸ¤” Simple Explanation

SVR creates a "tube" around the predictions. Points inside the tube have zero error. Only points outside the tube contribute to the loss.

### Key Hyperparameters

- **C**: Penalty for points outside tube
- **epsilon**: Width of the tube
- **kernel**: RBF, linear, polynomial

In [None]:
# SVR comparison
kernels = ['linear', 'rbf', 'poly']

print("SVR Kernel Comparison")
print("="*50)
for kernel in kernels:
    svr = SVR(kernel=kernel, C=1.0, epsilon=0.001)
    svr.fit(X_train, y_train)
    score = svr.score(X_test, y_test)
    print(f"{kernel:8} kernel: Test RÂ² = {score:.4f}")

In [None]:
# Visualize SVR epsilon tube
# 1D example for clarity
X_1d = np.sort(np.random.randn(100))[:, np.newaxis]
y_1d = np.sin(X_1d.ravel()) + np.random.randn(100) * 0.2

svr = SVR(kernel='rbf', C=100, epsilon=0.1)
svr.fit(X_1d, y_1d)

X_plot = np.linspace(X_1d.min(), X_1d.max(), 200)[:, np.newaxis]
y_pred = svr.predict(X_plot)

plt.figure(figsize=(10, 4))
plt.scatter(X_1d, y_1d, alpha=0.5, label='Data')
plt.plot(X_plot, y_pred, 'r-', linewidth=2, label='SVR')
plt.fill_between(X_plot.ravel(), y_pred - 0.1, y_pred + 0.1, alpha=0.3, color='red', label='Îµ-tube')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('SVR with Îµ-Insensitive Tube')
plt.legend()
plt.show()

---

## Part 4: Finance Application - Regime Matching

Use KNN to find similar historical market conditions and predict outcomes.

In [None]:
# Simulate historical market data
np.random.seed(42)
n_days = 2520  # 10 years

# Features: VIX, momentum, volume
vix_history = 20 + 15 * np.abs(np.random.randn(n_days)) * np.sin(np.linspace(0, 10*np.pi, n_days))
vix_history = np.clip(vix_history, 10, 80)
momentum_history = pd.Series(np.random.randn(n_days) * 10).rolling(20).mean().fillna(0).values
returns_history = np.random.randn(n_days) * 0.01 + 0.0002 - 0.0001 * vix_history / 20

# Forward returns (what we want to predict)
fwd_returns = pd.Series(returns_history).shift(-5).rolling(5).sum().fillna(0).values

X_hist = np.column_stack([vix_history, momentum_history])
scaler = StandardScaler()
X_hist_scaled = scaler.fit_transform(X_hist)

# Train on first 8 years, test on last 2
train_size = 2016
X_train_h = X_hist_scaled[:train_size]
y_train_h = fwd_returns[:train_size]
X_test_h = X_hist_scaled[train_size:]
y_test_h = fwd_returns[train_size:]

# KNN regime matching
knn_regime = KNeighborsRegressor(n_neighbors=20, weights='distance')
knn_regime.fit(X_train_h, y_train_h)

print("Regime Matching Results")
print("="*50)
print(f"Train RÂ²: {knn_regime.score(X_train_h, y_train_h):.3f}")
print(f"Test RÂ²:  {knn_regime.score(X_test_h, y_test_h):.3f}")

# Example: Current market conditions
current_vix = 25
current_momentum = 5
current = scaler.transform([[current_vix, current_momentum]])

# Find similar historical periods
distances, indices = knn_regime.kneighbors(current, n_neighbors=5)
similar_returns = fwd_returns[indices[0]]

print(f"\nCurrent conditions: VIX={current_vix}, Momentum={current_momentum}")
print(f"5-day predicted return: {knn_regime.predict(current)[0]:.4f}")
print(f"Similar historical 5-day returns: {np.round(similar_returns, 4)}")

---

## Interview Questions

### Conceptual
1. What are the advantages of KNN over parametric models?
2. When would you prefer SVR over linear regression?
3. How does the choice of distance metric affect KNN?

### Technical
1. What is the time complexity of KNN prediction?
2. How would you scale KNN to millions of data points?
3. Explain the role of epsilon in SVR.

### Finance-Specific
1. How would you use KNN to identify market regimes?
2. What features would you use to find "similar" market conditions?
3. What are the risks of regime-matching strategies?

---

## Key Takeaways

| Model | Type | Strengths | Weaknesses |
|-------|------|-----------|------------|
| KNN | Instance | Simple, local patterns | Slow prediction, curse of dimensionality |
| SVR | Kernel | Robust to outliers | Slow training, hyperparameter sensitive |