# Week 9: Unsupervised Learning - PCA, Clustering, HMM, Anomaly Detection

## üéØ Learning Objectives

By the end of this week, you will understand:
- **PCA**: Dimensionality reduction for factors
- **Clustering**: Market regime identification
- **Hidden Markov Models (HMM)**: Regime switching
- **Anomaly Detection**: Unusual market conditions

---

## Why Unsupervised Learning in Finance?

- **No labels needed**: Markets don't come with "regime" labels
- **Pattern discovery**: Find structure in data
- **Risk management**: Detect unusual conditions
- **Feature engineering**: Create new factors

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
print("‚úÖ Libraries loaded!")
print("üìö Week 9: Unsupervised Learning")

---

## Part 1: Principal Component Analysis (PCA)

### The Problem

Many features, but they're correlated. Can we reduce dimensions while preserving information?

### The Math

1. Center the data: $\tilde{X} = X - \bar{X}$
2. Compute covariance: $\Sigma = \frac{1}{n}\tilde{X}^T\tilde{X}$
3. Eigendecomposition: $\Sigma = V\Lambda V^T$
4. Project: $Z = XV_k$ (keep top k eigenvectors)

### ü§î Simple Explanation

PCA finds the directions in your data with the most variation. Instead of 100 correlated features, you might get 5 uncorrelated "principal components" that explain 95% of the variance.

### Finance Application: Statistical Factor Model

PCA on stock returns reveals "statistical factors" - latent drivers of returns.

In [None]:
# Simulate stock returns with underlying factors
n_days = 252
n_stocks = 50

# True underlying factors
market_factor = np.random.randn(n_days) * 0.01
sector_factor = np.random.randn(n_days) * 0.008
momentum_factor = np.random.randn(n_days) * 0.006

# Stock loadings on factors
market_beta = np.random.uniform(0.5, 1.5, n_stocks)
sector_beta = np.random.randn(n_stocks) * 0.5
momentum_beta = np.random.randn(n_stocks) * 0.3

# Generate returns
returns = np.zeros((n_days, n_stocks))
for i in range(n_stocks):
    returns[:, i] = (
        market_beta[i] * market_factor +
        sector_beta[i] * sector_factor +
        momentum_beta[i] * momentum_factor +
        np.random.randn(n_days) * 0.02  # Idiosyncratic
    )

# Apply PCA
scaler = StandardScaler()
returns_scaled = scaler.fit_transform(returns)

pca = PCA()
pca.fit(returns_scaled)

# Explained variance
exp_var = pca.explained_variance_ratio_
cum_var = np.cumsum(exp_var)

print("PCA Results")
print("="*50)
print(f"Variance explained by top 5 PCs:")
for i in range(5):
    print(f"  PC{i+1}: {exp_var[i]:.1%} (cumulative: {cum_var[i]:.1%})")

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.bar(range(1, 11), exp_var[:10])
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.title('Scree Plot')

plt.subplot(1, 2, 2)
plt.plot(range(1, 11), cum_var[:10], 'bo-')
plt.axhline(y=0.9, color='r', linestyle='--', label='90% threshold')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance')
plt.title('Cumulative Variance')
plt.legend()
plt.tight_layout()
plt.show()

---

## Part 2: Clustering - K-Means

### The Algorithm

1. Initialize k cluster centers
2. Assign points to nearest center
3. Update centers as mean of assigned points
4. Repeat until convergence

### Objective

$$\min \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2$$

### ü§î Simple Explanation

K-Means groups similar data points together. Think of it as sorting market days into "buckets" based on their characteristics.

In [None]:
# Create market regime features
n_days = 1000
vix = np.random.exponential(20, n_days)
volume_ratio = np.random.lognormal(0, 0.5, n_days)
momentum = np.random.randn(n_days) * 10

X = np.column_stack([vix, volume_ratio, momentum])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find optimal k using elbow method
inertias = []
K_range = range(2, 10)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.axvline(x=4, color='r', linestyle='--', label='Elbow')
plt.legend()
plt.show()

In [None]:
# Apply K-Means with k=4
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

# Analyze clusters
print("Market Regime Clusters")
print("="*60)

regime_names = {0: 'Normal', 1: 'High Vol', 2: 'Low Vol', 3: 'Trending'}

for c in range(4):
    mask = clusters == c
    print(f"\nCluster {c} ({mask.sum()} days):")
    print(f"  Avg VIX: {vix[mask].mean():.1f}")
    print(f"  Avg Volume Ratio: {volume_ratio[mask].mean():.2f}")
    print(f"  Avg Momentum: {momentum[mask].mean():.1f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

scatter = axes[0].scatter(vix, momentum, c=clusters, cmap='viridis', alpha=0.5)
axes[0].set_xlabel('VIX')
axes[0].set_ylabel('Momentum')
axes[0].set_title('Market Regimes')
plt.colorbar(scatter, ax=axes[0], label='Cluster')

cluster_counts = pd.Series(clusters).value_counts().sort_index()
axes[1].bar(cluster_counts.index, cluster_counts.values)
axes[1].set_xlabel('Cluster')
axes[1].set_ylabel('Days')
axes[1].set_title('Cluster Distribution')

plt.tight_layout()
plt.show()

---

## Part 3: Hidden Markov Models (HMM)

### The Concept

Markets switch between hidden "regimes" (bull, bear, sideways). We observe returns but not the regime.

### Model Components

- **Hidden states**: S = {Bull, Bear, Neutral}
- **Transition matrix**: P(state_t | state_{t-1})
- **Emission**: P(return | state)

### ü§î Simple Explanation

HMM assumes the market is always in one of several "modes" (regimes), but we can't directly see which one. We can only observe returns and infer the regime.

In [None]:
try:
    from hmmlearn import hmm
    
    # Simulate regime-switching returns
    n_days = 2000
    true_states = np.zeros(n_days, dtype=int)
    
    # Simple regime generation
    for i in range(1, n_days):
        if true_states[i-1] == 0:  # Bull
            true_states[i] = np.random.choice([0, 1], p=[0.95, 0.05])
        else:  # Bear
            true_states[i] = np.random.choice([0, 1], p=[0.1, 0.9])
    
    # Regime-dependent returns
    returns_hmm = np.where(
        true_states == 0,
        np.random.normal(0.001, 0.01, n_days),  # Bull
        np.random.normal(-0.002, 0.025, n_days)  # Bear
    )
    
    # Fit HMM
    model = hmm.GaussianHMM(n_components=2, covariance_type="full", n_iter=100)
    model.fit(returns_hmm.reshape(-1, 1))
    
    # Predict states
    predicted_states = model.predict(returns_hmm.reshape(-1, 1))
    
    print("Hidden Markov Model Results")
    print("="*50)
    print(f"\nLearned means: {model.means_.ravel()}")
    print(f"Learned variances: {np.sqrt(model.covars_.ravel())}")
    print(f"\nTransition matrix:")
    print(model.transmat_.round(3))
    
    # Visualize
    fig, axes = plt.subplots(2, 1, figsize=(12, 6), sharex=True)
    
    axes[0].plot(np.cumsum(returns_hmm))
    axes[0].set_ylabel('Cumulative Return')
    axes[0].set_title('Returns')
    
    axes[1].fill_between(range(len(predicted_states)), 0, predicted_states, alpha=0.5, label='Predicted')
    axes[1].set_ylabel('Regime')
    axes[1].set_xlabel('Day')
    axes[1].set_title('Regime States')
    
    plt.tight_layout()
    plt.show()
    
except ImportError:
    print("‚ö†Ô∏è hmmlearn not installed. Install with: pip install hmmlearn")

---

## Part 4: Anomaly Detection

### Methods

1. **Isolation Forest**: Isolate anomalies using random trees
2. **DBSCAN**: Density-based clustering (outliers = no cluster)
3. **Statistical**: Z-score, Mahalanobis distance

### Finance Applications

- Flash crash detection
- Unusual trading patterns
- Market manipulation

In [None]:
# Anomaly Detection with Isolation Forest
n_normal = 950
n_anomaly = 50

# Normal market conditions
normal_vix = np.random.exponential(15, n_normal)
normal_volume = np.random.lognormal(0, 0.3, n_normal)

# Anomalous conditions (flash crashes, unusual activity)
anomaly_vix = np.random.uniform(50, 80, n_anomaly)
anomaly_volume = np.random.uniform(3, 6, n_anomaly)

# Combine
X_all = np.vstack([
    np.column_stack([normal_vix, normal_volume]),
    np.column_stack([anomaly_vix, anomaly_volume])
])
labels_true = np.array([0]*n_normal + [1]*n_anomaly)

# Isolation Forest
iso_forest = IsolationForest(contamination=0.05, random_state=42)
predictions = iso_forest.fit_predict(X_all)
anomaly_pred = predictions == -1  # -1 indicates anomaly

print("Isolation Forest Results")
print("="*50)
print(f"True anomalies: {labels_true.sum()}")
print(f"Detected anomalies: {anomaly_pred.sum()}")
print(f"True positives: {(anomaly_pred & (labels_true==1)).sum()}")
print(f"False positives: {(anomaly_pred & (labels_true==0)).sum()}")

# Visualize
plt.figure(figsize=(10, 4))
plt.scatter(X_all[~anomaly_pred, 0], X_all[~anomaly_pred, 1], c='blue', alpha=0.5, label='Normal')
plt.scatter(X_all[anomaly_pred, 0], X_all[anomaly_pred, 1], c='red', s=100, marker='x', label='Anomaly')
plt.xlabel('VIX')
plt.ylabel('Volume Ratio')
plt.title('Anomaly Detection')
plt.legend()
plt.show()

---

## Interview Questions

### Conceptual
1. What's the difference between PCA and factor analysis?
2. How do you choose the number of clusters in K-Means?
3. When would you use HMM over K-Means for regime detection?

### Technical
1. Derive the PCA solution using SVD.
2. What are the limitations of K-Means for financial data?
3. Explain the Viterbi algorithm for HMM.

### Finance-Specific
1. How would you use PCA factors in a trading strategy?
2. What market events would you expect anomaly detection to catch?
3. How often should regime models be retrained?

---

## Key Takeaways

| Method | Use Case | Output |
|--------|----------|--------|
| PCA | Dimension reduction | Uncorrelated factors |
| K-Means | Regime clustering | Discrete labels |
| HMM | Regime switching | Probabilistic states |
| Isolation Forest | Anomaly detection | Outlier scores |