# 04: Regularization

**Module 1.2: Linear Systems & Least Squares**

## Learning Objectives

By the end of this notebook, you will:
1. Understand why regularization is needed
2. Derive Ridge regression from scratch
3. Understand LASSO for sparse solutions
4. Choose between Ridge, LASSO, and Elastic Net

## Resources
- ISLR, Chapter 6
- Cohen, *Practical Linear Algebra*, Chapter 9

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LassoCV

torch.manual_seed(42)
np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 6)

---
## 1. Why Regularization?

Three problems that regularization solves:

1. **Singular $X^\top X$**: When features > samples
2. **Ill-conditioned $X^\top X$**: When features are correlated
3. **Overfitting**: When model memorizes noise

In [None]:
# Demonstrate the underdetermined problem
m, n = 20, 100  # 20 samples, 100 features

X = torch.randn(m, n, dtype=torch.float64)
y = torch.randn(m, dtype=torch.float64)

print(f"Design matrix: {m} samples × {n} features")
print(f"Rank of X'X: {torch.linalg.matrix_rank(X.T @ X).item()} (need {n})")
print("\n✗ Cannot solve: X'X is singular!")

---
## 2. Ridge Regression (L2)

**Objective:**
$$\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \|X\beta - y\|_2^2 + \lambda\|\beta\|_2^2$$

**Solution:**
$$\hat{\beta}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y$$

Adding $\lambda I$ shifts all eigenvalues by $\lambda$ → always invertible!

In [None]:
def ridge_regression(X, y, lambd):
    """Solve Ridge regression."""
    n = X.shape[1]
    XtX_reg = X.T @ X + lambd * torch.eye(n, dtype=X.dtype)
    return torch.linalg.solve(XtX_reg, X.T @ y)

# Now we can solve!
beta_ridge = ridge_regression(X, y, lambd=1.0)
print(f"Ridge solution: ||β||₂ = {beta_ridge.norm().item():.4f}")

---
## 3. LASSO (L1) - Sparse Solutions

**Objective:**
$$\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \|X\beta - y\|_2^2 + \lambda\|\beta\|_1$$

**Key difference:**
- Ridge: shrinks toward zero, never exactly zero
- LASSO: sets coefficients **exactly to zero** (feature selection)

In [None]:
# Compare sparsity
np.random.seed(42)
n_samples, n_features = 100, 50

X_np = np.random.randn(n_samples, n_features)
true_beta = np.zeros(n_features)
true_beta[:5] = np.random.randn(5) * 3
y_np = X_np @ true_beta + np.random.randn(n_samples) * 0.5

ridge = Ridge(alpha=1.0).fit(X_np, y_np)
lasso = Lasso(alpha=0.1).fit(X_np, y_np)

print(f"True: {5} nonzero coefficients")
print(f"Ridge: {np.sum(np.abs(ridge.coef_) > 1e-10)} nonzero")
print(f"LASSO: {np.sum(np.abs(lasso.coef_) > 1e-10)} nonzero")

---
## 4. Elastic Net - Best of Both

$$\hat{\beta} = \arg\min_{\beta} \|X\beta - y\|_2^2 + \lambda_1\|\beta\|_1 + \lambda_2\|\beta\|_2^2$$

| Method | Sparse? | Handles correlation? |
|--------|---------|---------------------|
| Ridge | No | Yes |
| LASSO | Yes | No |
| Elastic Net | Yes | Yes |

---
## 5. Genomics Applications

| Scenario | Method | Reason |
|----------|--------|--------|
| GWAS polygenic | Ridge | Many small effects |
| Biomarker discovery | LASSO | Feature selection |
| scRNA-seq markers | Elastic Net | Correlated pathways |

In [None]:
# Biomarker discovery simulation
np.random.seed(42)
n_genes = 1000
true_markers = [42, 123, 256, 333, 444]

X_genes = np.random.randn(50, n_genes)
beta_genes = np.zeros(n_genes)
beta_genes[true_markers] = 2.0
y_response = X_genes @ beta_genes + np.random.randn(50) * 0.5

lasso_cv = LassoCV(cv=5).fit(X_genes, y_response)
discovered = np.where(np.abs(lasso_cv.coef_) > 1e-10)[0]

print(f"True markers: {true_markers}")
print(f"Discovered: {list(discovered)}")

---
## Summary

| Method | Penalty | Key Property |
|--------|---------|-------------|
| Ridge | $\lambda\|\beta\|_2^2$ | Shrinks, never zero |
| LASSO | $\lambda\|\beta\|_1$ | Exact zeros (sparse) |
| Elastic Net | Both | Sparse + stable |