# Module 11: Regularization

**Goal:** Understand L1 and L2 regularization, tune lambda via cross-validation, and prevent overfitting.

**Prerequisites:** Modules 3-4 (Linear/Logistic Regression), Module 10 (Feature Engineering)

**Expected Runtime:** ~20 minutes

**Outputs:**
- L1 vs L2 coefficient comparison
- Cross-validation lambda tuning
- Train vs test performance analysis

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
plt.rcParams['figure.figsize'] = (12, 5)

## Part 1: Generate Data with Known Sparsity

We'll create data where only some features actually matter.

In [None]:
# Generate data where only first 5 features matter
n_samples = 200
n_features = 20
n_informative = 5

# Random features
X = np.random.randn(n_samples, n_features)

# True coefficients (only first 5 are non-zero)
true_coefs = np.zeros(n_features)
true_coefs[:n_informative] = np.array([3, -2, 1.5, -1, 0.5])

# Generate target
y = X @ true_coefs + np.random.randn(n_samples) * 0.5

print(f"Dataset: {n_samples} samples, {n_features} features")
print(f"True informative features: {n_informative}")
print(f"\nTrue coefficients:")
for i, c in enumerate(true_coefs):
    if c != 0:
        print(f"  X{i+1}: {c:.2f}")

## Part 2: Split and Scale

In [None]:
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# IMPORTANT: Scale features before regularization!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Train: {len(X_train)} samples")
print(f"Test: {len(X_test)} samples")
print("\n‚úÖ Features scaled (mean=0, std=1)")

In [None]:
# DEMO: What happens WITHOUT scaling?
# Let's see Lasso fail when features are on different scales

# Create data with varying scales
X_varied = X.copy()
X_varied[:, 0] *= 1000  # First feature: range ~1000
X_varied[:, 1] *= 100   # Second feature: range ~100
X_varied[:, 2] *= 0.01  # Third feature: range ~0.01

# Lasso without scaling
lasso_unscaled = Lasso(alpha=1.0).fit(X_varied, y)

print("=== ‚ö†Ô∏è FAILURE MODE: Lasso WITHOUT Scaling ===")
print("\nTrue coefficients: X1=3, X2=-2, X3=1.5, X4=-1, X5=0.5")
print("\nLasso coefficients (UNSCALED features):")
for i in range(5):
    coef = lasso_unscaled.coef_[i]
    status = "ZEROED!" if abs(coef) < 0.001 else f"{coef:.4f}"
    print(f"  X{i+1}: {status}")

print("\nüö® PROBLEM: Feature scale affects which coefficients get zeroed!")
print("   X1 (scale 1000) ‚Üí coefficient appears tiny ‚Üí wrongly kept")
print("   X3 (scale 0.01) ‚Üí coefficient appears huge ‚Üí wrongly zeroed")
print("\nüí° SOLUTION: Always scale features BEFORE regularization!")

## Part 3: Compare L1 vs L2

In [None]:
# Fit models with same regularization strength
alpha = 0.1

ridge = Ridge(alpha=alpha)
lasso = Lasso(alpha=alpha)

ridge.fit(X_train_scaled, y_train)
lasso.fit(X_train_scaled, y_train)

# Compare coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

x_pos = np.arange(n_features)

# Ridge coefficients
ax1 = axes[0]
colors = ['#0ea5e9' if c > 0 else '#ef4444' for c in ridge.coef_]
ax1.bar(x_pos, ridge.coef_, color=colors, alpha=0.7)
ax1.axhline(y=0, color='gray', linestyle='--')
ax1.set_xticks(x_pos)
ax1.set_xticklabels([f'X{i+1}' for i in range(n_features)], rotation=45)
ax1.set_title(f'Ridge (L2) Coefficients (alpha={alpha})')
ax1.set_ylabel('Coefficient')

# Lasso coefficients
ax2 = axes[1]
colors = ['#22c55e' if c > 0 else '#ef4444' if c < 0 else '#94a3b8' for c in lasso.coef_]
ax2.bar(x_pos, lasso.coef_, color=colors, alpha=0.7)
ax2.axhline(y=0, color='gray', linestyle='--')
ax2.set_xticks(x_pos)
ax2.set_xticklabels([f'X{i+1}' for i in range(n_features)], rotation=45)
ax2.set_title(f'Lasso (L1) Coefficients (alpha={alpha})')
ax2.set_ylabel('Coefficient')

plt.tight_layout()
plt.show()

print("=== Coefficient Comparison ===")
print(f"Ridge: {(np.abs(ridge.coef_) > 0.01).sum()} non-zero coefficients")
print(f"Lasso: {(np.abs(lasso.coef_) > 0.01).sum()} non-zero coefficients")
print(f"\nüí° Lasso zeros out irrelevant features - automatic feature selection!")

## Part 4: Regularization Path

Let's see how coefficients change as we increase regularization.

In [None]:
# Test different alphas
alphas = np.logspace(-3, 2, 50)

ridge_coefs = []
lasso_coefs = []

for a in alphas:
    ridge = Ridge(alpha=a).fit(X_train_scaled, y_train)
    lasso = Lasso(alpha=a, max_iter=10000).fit(X_train_scaled, y_train)
    ridge_coefs.append(ridge.coef_)
    lasso_coefs.append(lasso.coef_)

ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge path
ax1 = axes[0]
for i in range(n_features):
    ax1.plot(alphas, ridge_coefs[:, i], label=f'X{i+1}' if i < 5 else None)
ax1.set_xscale('log')
ax1.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax1.set_xlabel('Alpha (log scale)')
ax1.set_ylabel('Coefficient')
ax1.set_title('Ridge (L2) Coefficient Path')
ax1.legend(loc='upper right')

# Lasso path
ax2 = axes[1]
for i in range(n_features):
    ax2.plot(alphas, lasso_coefs[:, i], label=f'X{i+1}' if i < 5 else None)
ax2.set_xscale('log')
ax2.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
ax2.set_xlabel('Alpha (log scale)')
ax2.set_ylabel('Coefficient')
ax2.set_title('Lasso (L1) Coefficient Path')
ax2.legend(loc='upper right')

plt.tight_layout()
plt.show()

print("üí° Notice: Lasso coefficients hit exactly zero; Ridge just shrinks toward zero.")

## Part 5: Finding Optimal Alpha via Cross-Validation

In [None]:
# Use built-in CV to find best alpha
alphas_to_try = np.logspace(-3, 2, 100)

ridge_cv = RidgeCV(alphas=alphas_to_try, cv=5)
lasso_cv = LassoCV(alphas=alphas_to_try, cv=5, max_iter=10000)

ridge_cv.fit(X_train_scaled, y_train)
lasso_cv.fit(X_train_scaled, y_train)

print("=== Cross-Validation Results ===")
print(f"\nRidge optimal alpha: {ridge_cv.alpha_:.4f}")
print(f"Lasso optimal alpha: {lasso_cv.alpha_:.4f}")

# Evaluate on test set
print("\n=== Test Set Performance ===")
print(f"Ridge Test R¬≤: {r2_score(y_test, ridge_cv.predict(X_test_scaled)):.4f}")
print(f"Lasso Test R¬≤: {r2_score(y_test, lasso_cv.predict(X_test_scaled)):.4f}")

print(f"\nLasso selected {(np.abs(lasso_cv.coef_) > 0.01).sum()} features out of {n_features}")

## Part 6: Train vs Test Error Curve

Visualize the bias-variance tradeoff.

In [None]:
train_errors_ridge = []
test_errors_ridge = []
train_errors_lasso = []
test_errors_lasso = []

for a in alphas:
    # Ridge
    ridge = Ridge(alpha=a).fit(X_train_scaled, y_train)
    train_errors_ridge.append(mean_squared_error(y_train, ridge.predict(X_train_scaled)))
    test_errors_ridge.append(mean_squared_error(y_test, ridge.predict(X_test_scaled)))
    
    # Lasso
    lasso = Lasso(alpha=a, max_iter=10000).fit(X_train_scaled, y_train)
    train_errors_lasso.append(mean_squared_error(y_train, lasso.predict(X_train_scaled)))
    test_errors_lasso.append(mean_squared_error(y_test, lasso.predict(X_test_scaled)))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge
ax1 = axes[0]
ax1.plot(alphas, train_errors_ridge, 'b-', label='Train MSE')
ax1.plot(alphas, test_errors_ridge, 'r-', label='Test MSE')
ax1.axvline(x=ridge_cv.alpha_, color='green', linestyle='--', label=f'CV optimal ({ridge_cv.alpha_:.3f})')
ax1.set_xscale('log')
ax1.set_xlabel('Alpha (log scale)')
ax1.set_ylabel('MSE')
ax1.set_title('Ridge: Train vs Test Error')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Lasso
ax2 = axes[1]
ax2.plot(alphas, train_errors_lasso, 'b-', label='Train MSE')
ax2.plot(alphas, test_errors_lasso, 'r-', label='Test MSE')
ax2.axvline(x=lasso_cv.alpha_, color='green', linestyle='--', label=f'CV optimal ({lasso_cv.alpha_:.3f})')
ax2.set_xscale('log')
ax2.set_xlabel('Alpha (log scale)')
ax2.set_ylabel('MSE')
ax2.set_title('Lasso: Train vs Test Error')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üí° Insight:")
print("  ‚Ä¢ Left side (low alpha): Overfit - train error low, test error high")
print("  ‚Ä¢ Right side (high alpha): Underfit - both errors high")
print("  ‚Ä¢ Sweet spot: CV finds where test error is minimized")

## Part 7: TODO - Compare Feature Recovery

In [None]:
# TODO: Compare how well each method recovers the true features

print("=== Feature Recovery Comparison ===")
print("\nTrue coefficients (first 5 are informative):")

results = pd.DataFrame({
    'Feature': [f'X{i+1}' for i in range(n_features)],
    'True': true_coefs,
    'Ridge': ridge_cv.coef_,
    'Lasso': lasso_cv.coef_
})

# Show first 10
print(results.head(10).to_string(index=False))

# Calculate recovery accuracy
true_nonzero = set(np.where(np.abs(true_coefs) > 0.01)[0])
lasso_nonzero = set(np.where(np.abs(lasso_cv.coef_) > 0.01)[0])

print(f"\nTrue informative features: {true_nonzero}")
print(f"Lasso selected features: {lasso_nonzero}")
print(f"Correctly identified: {true_nonzero & lasso_nonzero}")
print(f"False positives: {lasso_nonzero - true_nonzero}")
print(f"Missed: {true_nonzero - lasso_nonzero}")

## Part 8: TODO - Elastic Net

When to use Elastic Net: combines L1 and L2 benefits.

In [None]:
# TODO: Try Elastic Net with different l1_ratio
# l1_ratio=1 is pure Lasso, l1_ratio=0 is pure Ridge

from sklearn.linear_model import ElasticNetCV

# Test different L1/L2 mixes
l1_ratios = [0.1, 0.5, 0.9]

print("=== Elastic Net Comparison ===")
for ratio in l1_ratios:
    elastic = ElasticNetCV(l1_ratio=ratio, alphas=alphas_to_try, cv=5, max_iter=10000)
    elastic.fit(X_train_scaled, y_train)
    
    n_selected = (np.abs(elastic.coef_) > 0.01).sum()
    test_r2 = r2_score(y_test, elastic.predict(X_test_scaled))
    
    print(f"\nl1_ratio={ratio:.1f}: {n_selected} features, R¬≤={test_r2:.4f}, alpha={elastic.alpha_:.4f}")

## Self-Check

Uncomment and run the asserts below to verify your regularization models are correct.

In [None]:
# SELF-CHECK: Verify your regularization models
assert hasattr(ridge, 'coef_'), "Ridge model should be fitted"
assert hasattr(lasso, 'coef_'), "Lasso model should be fitted"
n_lasso_zero = (np.abs(lasso.coef_) < 0.01).sum()
assert n_lasso_zero > 0, "Lasso should zero out at least some coefficients"
print(f"‚úÖ Self-check passed! Lasso zeroed {n_lasso_zero}/{len(lasso.coef_)} coefficients")

## Part 9: Stakeholder Summary

### TODO: Write a 3-bullet summary (~100 words) for the PM

Template:
‚Ä¢ **What regularization does:** Prevents overfitting by [penalty description]. This is important when [scenario].
‚Ä¢ **Recommendation:** Use [L1/L2/ElasticNet] because [reason based on feature selection needs].
‚Ä¢ **How we tuned it:** Cross-validation found optimal Œª = ____, selecting ____ features with R¬≤ = ____.

### Your Summary:

*Write your explanation here...*

---

## Key Takeaways

1. **L1 (Lasso)** zeros out coefficients ‚Üí automatic feature selection
2. **L2 (Ridge)** shrinks all coefficients ‚Üí stable when features correlated
3. **Always scale** before applying regularization
4. **Use cross-validation** to find optimal alpha
5. **Monitor train vs test** error to detect over/underfitting

### sklearn Parameter Gotcha
- Ridge/Lasso: `alpha` = Œª (higher = more regularization)
- LogisticRegression: `C` = 1/Œª (higher = LESS regularization)

### Next Steps
- Explore the interactive playground
- Complete the quiz