# Module 08: Regularization (L1, L2, Elastic Net)

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 75 minutes  
**Prerequisites**: 
- [Module 03: Linear Regression](03_linear_regression.ipynb)
- [Module 07: Cross-Validation and Hyperparameter Tuning](07_cross_validation_hyperparameter_tuning.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand overfitting vs underfitting and the bias-variance tradeoff
2. Apply Ridge regression (L2 regularization) to prevent overfitting
3. Use Lasso regression (L1 regularization) for feature selection
4. Implement Elastic Net (combination of L1 and L2)
5. Choose appropriate regularization strength (alpha parameter)
6. Visualize how regularization affects model coefficients

## 1. The Overfitting Problem

### What is Overfitting?

**Overfitting**: Model learns training data *too well*, including noise!

**Symptoms**:
- Perfect (or very high) training accuracy
- Poor test accuracy
- Model memorizes data instead of learning patterns

### Underfitting vs Overfitting vs Just Right

```
UNDERFITTING          JUST RIGHT           OVERFITTING
Too Simple            Balanced             Too Complex
    |                    |                      |
  High Bias         Low Bias             High Variance
  Low Variance      Low Variance         Low Bias
    |
  Poor on both      Good on both         Great on train
  train and test    train and test       Poor on test
```

### Real-World Analogy

**Studying for exam**:
- **Underfitting**: Only study one chapter (miss key concepts)
- **Just Right**: Understand principles (can handle variations)
- **Overfitting**: Memorize all practice problems word-for-word (can't handle new questions)

### The Solution: Regularization

**Regularization**: Add penalty for model complexity!

**Benefits**:
- Prevents overfitting
- Improves generalization
- Can perform automatic feature selection

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Visualization settings
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("✓ Setup complete!")

In [None]:
# Demonstrate overfitting with polynomial features
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic data with noise
np.random.seed(42)
X_simple = np.linspace(0, 10, 50).reshape(-1, 1)
y_simple = 2 * X_simple.ravel() + 5 + np.random.normal(0, 2, 50)

# Split data
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(
    X_simple, y_simple, test_size=0.3, random_state=42
)

# Test different polynomial degrees
degrees = [1, 3, 15]
results = []

for degree in degrees:
    # Transform features
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train_simple)
    X_test_poly = poly.transform(X_test_simple)
    
    # Train model
    model = LinearRegression()
    model.fit(X_train_poly, y_train_simple)
    
    # Evaluate
    train_score = model.score(X_train_poly, y_train_simple)
    test_score = model.score(X_test_poly, y_test_simple)
    
    results.append({
        'Degree': degree,
        'Train R²': train_score,
        'Test R²': test_score,
        'Gap': train_score - test_score
    })

results_df = pd.DataFrame(results)
print("Impact of Model Complexity:\n")
print(results_df.to_string(index=False))
print("\nObservations:")
print("- Degree 1: Slight underfitting (both scores decent)")
print("- Degree 3: Good balance")
print("- Degree 15: Severe overfitting (perfect train, poor test)")

In [None]:
# Visualize underfitting vs overfitting
fig, axes = plt.subplots(1, 3, figsize=(16, 4))

for idx, degree in enumerate(degrees):
    # Transform and fit
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train_simple)
    model = LinearRegression()
    model.fit(X_train_poly, y_train_simple)
    
    # Generate smooth curve for visualization
    X_plot = np.linspace(0, 10, 200).reshape(-1, 1)
    X_plot_poly = poly.transform(X_plot)
    y_plot = model.predict(X_plot_poly)
    
    # Plot
    axes[idx].scatter(X_train_simple, y_train_simple, alpha=0.6, label='Training data')
    axes[idx].scatter(X_test_simple, y_test_simple, alpha=0.6, label='Test data')
    axes[idx].plot(X_plot, y_plot, 'r-', linewidth=2, label='Model fit')
    
    title = f"Degree {degree}\n"
    if degree == 1:
        title += "UNDERFITTING"
    elif degree == 3:
        title += "JUST RIGHT"
    else:
        title += "OVERFITTING"
    
    axes[idx].set_title(title, fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('X', fontsize=11)
    axes[idx].set_ylabel('y', fontsize=11)
    axes[idx].legend(fontsize=9)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice how degree 15 model passes through almost every training point")
print("but makes wild predictions between points!")

## 2. Ridge Regression (L2 Regularization)

### How It Works

**Standard Linear Regression**:
- Minimize: Sum of Squared Errors
- Problem: Can have very large coefficients

**Ridge Regression**:
- Minimize: Sum of Squared Errors + α × (Sum of Squared Coefficients)
- Penalty term: L2 norm = sum of squared coefficients

### Mathematical Formula

```
Loss = MSE + α × Σ(coefficient²)
         ↑             ↑
    Fit data    Regularization penalty
```

### Effect of Alpha (α)

- **α = 0**: No regularization (standard linear regression)
- **α → small**: Slight regularization
- **α → large**: Strong regularization (coefficients shrink toward 0)
- **α → ∞**: All coefficients → 0 (underfitting)

### Key Properties

✓ Shrinks coefficients but doesn't make them exactly 0  
✓ Works well when all features are relevant  
✓ Stable with correlated features  
✓ Must scale features first!

In [None]:
# Load California housing dataset
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

housing_df = pd.read_csv('data/sample/california_housing.csv')

X_housing = housing_df.drop('median_house_value', axis=1)
y_housing = housing_df['median_house_value']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_housing, y_housing, test_size=0.3, random_state=42
)

# Scale features (IMPORTANT for regularization!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Dataset shape: {X_housing.shape}")
print(f"Features: {list(X_housing.columns)}")

In [None]:
# Compare Linear Regression vs Ridge
# Standard Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
lr_train_score = lr_model.score(X_train_scaled, y_train)
lr_test_score = lr_model.score(X_test_scaled, y_test)

# Ridge Regression (α=10)
ridge_model = Ridge(alpha=10)
ridge_model.fit(X_train_scaled, y_train)
ridge_train_score = ridge_model.score(X_train_scaled, y_train)
ridge_test_score = ridge_model.score(X_test_scaled, y_test)

print("Linear Regression (No Regularization):")
print(f"  Train R²: {lr_train_score:.3f}")
print(f"  Test R²:  {lr_test_score:.3f}")
print(f"  Gap:      {lr_train_score - lr_test_score:.3f}\n")

print("Ridge Regression (α=10):")
print(f"  Train R²: {ridge_train_score:.3f}")
print(f"  Test R²:  {ridge_test_score:.3f}")
print(f"  Gap:      {ridge_train_score - ridge_test_score:.3f}")

print("\nRidge slightly sacrifices training performance for better generalization!")

In [None]:
# Test different alpha values
alphas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
train_scores = []
test_scores = []

for alpha in alphas:
    model = Ridge(alpha=alpha)
    model.fit(X_train_scaled, y_train)
    train_scores.append(model.score(X_train_scaled, y_train))
    test_scores.append(model.score(X_test_scaled, y_test))

# Find best alpha
best_idx = np.argmax(test_scores)
best_alpha = alphas[best_idx]

print(f"Best alpha: {best_alpha}")
print(f"Best test R²: {test_scores[best_idx]:.3f}")

In [None]:
# Visualize alpha tuning
plt.figure(figsize=(10, 6))
plt.semilogx(alphas, train_scores, 'o-', linewidth=2, markersize=8, label='Train R²')
plt.semilogx(alphas, test_scores, 's-', linewidth=2, markersize=8, label='Test R²')
plt.scatter([best_alpha], [test_scores[best_idx]], s=300, c='green', 
           marker='*', label=f'Best α={best_alpha}', zorder=5)
plt.xlabel('Alpha (Regularization Strength)', fontsize=12)
plt.ylabel('R² Score', fontsize=12)
plt.title('Ridge Regression: Impact of Alpha\n(Low α = weak regularization, High α = strong regularization)', 
         fontsize=13, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Observations:")
print("- α too small: Approaches standard linear regression")
print("- α too large: Underfitting (all coefficients shrink to ~0)")
print(f"- α = {best_alpha}: Best balance!")

In [None]:
# Compare coefficients: Linear vs Ridge
coef_comparison = pd.DataFrame({
    'Feature': X_housing.columns,
    'Linear Regression': lr_model.coef_,
    'Ridge (α=10)': ridge_model.coef_
})

print("Coefficient Comparison:\n")
print(coef_comparison.to_string(index=False))
print("\nNotice: Ridge coefficients are smaller (shrunk toward 0)")

In [None]:
# Visualize coefficient shrinkage
fig, ax = plt.subplots(figsize=(10, 6))

x_pos = np.arange(len(X_housing.columns))
width = 0.35

ax.bar(x_pos - width/2, lr_model.coef_, width, label='Linear Regression', alpha=0.7)
ax.bar(x_pos + width/2, ridge_model.coef_, width, label='Ridge (α=10)', alpha=0.7)

ax.set_xlabel('Feature', fontsize=12)
ax.set_ylabel('Coefficient Value', fontsize=12)
ax.set_title('Coefficient Shrinkage with Ridge Regularization', fontsize=13, fontweight='bold')
ax.set_xticks(x_pos)
ax.set_xticklabels(X_housing.columns, rotation=45, ha='right')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3, axis='y')
ax.axhline(y=0, color='black', linewidth=0.8)
plt.tight_layout()
plt.show()

## 3. Lasso Regression (L1 Regularization)

### How It Works

**Lasso Regression**:
- Minimize: Sum of Squared Errors + α × (Sum of Absolute Coefficients)
- Penalty term: L1 norm = sum of absolute values

### Mathematical Formula

```
Loss = MSE + α × Σ|coefficient|
         ↑             ↑
    Fit data    Regularization penalty
```

### Ridge vs Lasso

**Ridge (L2)**:
- Penalty: coefficient²
- Effect: Shrinks coefficients smoothly
- Result: All coefficients remain non-zero (just smaller)

**Lasso (L1)**:
- Penalty: |coefficient|
- Effect: Can set coefficients to exactly 0
- Result: Automatic feature selection!

### When to Use Lasso

✓ When you suspect many features are irrelevant  
✓ Want automatic feature selection  
✓ Need interpretable model (fewer features)  
✓ Want sparse models (many zero coefficients)

In [None]:
# Load synthetic regression data (many features)
from sklearn.linear_model import Lasso

synthetic_df = pd.read_csv('data/sample/synthetic_regression.csv')

X_synthetic = synthetic_df.drop('target', axis=1)
y_synthetic = synthetic_df['target']

print(f"Dataset shape: {X_synthetic.shape}")
print(f"Number of features: {X_synthetic.shape[1]}")
print("\nMany features - some may be irrelevant!")

In [None]:
# Split and scale
X_train_syn, X_test_syn, y_train_syn, y_test_syn = train_test_split(
    X_synthetic, y_synthetic, test_size=0.3, random_state=42
)

scaler_syn = StandardScaler()
X_train_syn_scaled = scaler_syn.fit_transform(X_train_syn)
X_test_syn_scaled = scaler_syn.transform(X_test_syn)

# Train three models
lr_syn = LinearRegression()
ridge_syn = Ridge(alpha=1)
lasso_syn = Lasso(alpha=1)

lr_syn.fit(X_train_syn_scaled, y_train_syn)
ridge_syn.fit(X_train_syn_scaled, y_train_syn)
lasso_syn.fit(X_train_syn_scaled, y_train_syn)

# Count non-zero coefficients
lr_nonzero = np.sum(np.abs(lr_syn.coef_) > 0.001)
ridge_nonzero = np.sum(np.abs(ridge_syn.coef_) > 0.001)
lasso_nonzero = np.sum(np.abs(lasso_syn.coef_) > 0.001)

print("Comparison of Three Models:\n")
print(f"Linear Regression:")
print(f"  Test R²: {lr_syn.score(X_test_syn_scaled, y_test_syn):.3f}")
print(f"  Non-zero coefficients: {lr_nonzero}/{len(lr_syn.coef_)}\n")

print(f"Ridge (L2):")
print(f"  Test R²: {ridge_syn.score(X_test_syn_scaled, y_test_syn):.3f}")
print(f"  Non-zero coefficients: {ridge_nonzero}/{len(ridge_syn.coef_)}\n")

print(f"Lasso (L1):")
print(f"  Test R²: {lasso_syn.score(X_test_syn_scaled, y_test_syn):.3f}")
print(f"  Non-zero coefficients: {lasso_nonzero}/{len(lasso_syn.coef_)}")

print(f"\nLasso selected only {lasso_nonzero} features (automatic feature selection)!")

In [None]:
# Visualize coefficient values
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

models = [lr_syn, ridge_syn, lasso_syn]
titles = ['Linear Regression', 'Ridge (L2)', 'Lasso (L1)']

for idx, (model, title) in enumerate(zip(models, titles)):
    coef_vals = model.coef_
    axes[idx].bar(range(len(coef_vals)), coef_vals, alpha=0.7)
    axes[idx].set_xlabel('Feature Index', fontsize=11)
    axes[idx].set_ylabel('Coefficient Value', fontsize=11)
    axes[idx].set_title(title, fontsize=12, fontweight='bold')
    axes[idx].axhline(y=0, color='black', linewidth=0.8)
    axes[idx].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("Notice: Lasso sets many coefficients to exactly 0 (feature selection)!")
print("Ridge shrinks coefficients but keeps them all non-zero.")

In [None]:
# Demonstrate Lasso feature selection
# Train Lasso with different alphas
alphas_lasso = [0.01, 0.1, 1, 10, 100]
n_features_selected = []

for alpha in alphas_lasso:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_syn_scaled, y_train_syn)
    n_selected = np.sum(np.abs(lasso.coef_) > 0.001)
    n_features_selected.append(n_selected)
    
print("Lasso Feature Selection at Different Alpha Values:\n")
for alpha, n_features in zip(alphas_lasso, n_features_selected):
    print(f"α = {alpha:6.2f}: {n_features:3d} features selected (out of {X_synthetic.shape[1]})")

print("\nPattern: Higher α → More aggressive selection → Fewer features")

In [None]:
# Visualize feature selection path
plt.figure(figsize=(10, 6))
plt.semilogx(alphas_lasso, n_features_selected, 'o-', linewidth=2, markersize=10)
plt.xlabel('Alpha (Regularization Strength)', fontsize=12)
plt.ylabel('Number of Selected Features', fontsize=12)
plt.title('Lasso Feature Selection\n(Higher α = Fewer features)', 
         fontsize=13, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 4. Elastic Net (Combination of L1 and L2)

### Best of Both Worlds

**Elastic Net** = Ridge + Lasso

### Mathematical Formula

```
Loss = MSE + α × (λ × Σ|coef| + (1-λ) × Σ(coef²))
         ↑        ↑               ↑
    Fit data   L1 (Lasso)    L2 (Ridge)
```

**Two hyperparameters**:
- **α**: Overall regularization strength
- **l1_ratio** (λ): Balance between L1 and L2
  - λ = 0: Pure Ridge
  - λ = 1: Pure Lasso
  - λ = 0.5: Equal mix

### When to Use Elastic Net

✓ When features are correlated (Lasso might arbitrarily pick one)  
✓ Want feature selection but more stable than Lasso  
✓ Not sure whether to use Ridge or Lasso  
✓ Default safe choice for regularization

### Advantages

1. More stable than Lasso with correlated features
2. Can select groups of correlated features
3. Better than Ridge for feature selection
4. Flexible - can tune to behave like Ridge or Lasso

In [None]:
# Elastic Net example
from sklearn.linear_model import ElasticNet

# Test different l1_ratio values
l1_ratios = [0.0, 0.25, 0.5, 0.75, 1.0]
alpha_en = 1.0

results_elastic = []

for l1_ratio in l1_ratios:
    model = ElasticNet(alpha=alpha_en, l1_ratio=l1_ratio, max_iter=10000)
    model.fit(X_train_syn_scaled, y_train_syn)
    
    test_score = model.score(X_test_syn_scaled, y_test_syn)
    n_selected = np.sum(np.abs(model.coef_) > 0.001)
    
    results_elastic.append({
        'l1_ratio': l1_ratio,
        'Type': 'Pure Ridge' if l1_ratio == 0 else 
                'Pure Lasso' if l1_ratio == 1 else 
                f'Mix ({int(l1_ratio*100)}% L1)',
        'Test R²': test_score,
        'Features Selected': n_selected
    })

results_elastic_df = pd.DataFrame(results_elastic)
print(f"Elastic Net Results (α={alpha_en}):\n")
print(results_elastic_df.to_string(index=False))
print("\nNotice the transition from Ridge (no feature selection) to Lasso (aggressive selection)")

In [None]:
# Visualize Elastic Net behavior
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Test R² vs l1_ratio
axes[0].plot(results_elastic_df['l1_ratio'], results_elastic_df['Test R²'], 
            'o-', linewidth=2, markersize=10)
axes[0].set_xlabel('l1_ratio (0=Ridge, 1=Lasso)', fontsize=12)
axes[0].set_ylabel('Test R²', fontsize=12)
axes[0].set_title('Elastic Net Performance', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Plot 2: Features selected vs l1_ratio
axes[1].plot(results_elastic_df['l1_ratio'], results_elastic_df['Features Selected'], 
            's-', linewidth=2, markersize=10, color='orange')
axes[1].set_xlabel('l1_ratio (0=Ridge, 1=Lasso)', fontsize=12)
axes[1].set_ylabel('Number of Features Selected', fontsize=12)
axes[1].set_title('Feature Selection Behavior', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Elastic Net gives you control over the tradeoff!")

## 5. Bias-Variance Tradeoff

### The Fundamental Tradeoff

**Total Error** = Bias² + Variance + Irreducible Error

### Definitions

**Bias**: 
- Error from overly simplistic assumptions
- High bias → Underfitting
- Model consistently wrong in same direction

**Variance**:
- Error from sensitivity to training data fluctuations
- High variance → Overfitting
- Predictions vary wildly with different training sets

**Irreducible Error**:
- Noise in data (can't be reduced)

### The Tradeoff

```
Simple Model          →          Complex Model
High Bias                         High Variance
Low Variance                      Low Bias
Underfitting                      Overfitting
       ↓                                ↓
    Add complexity              Add regularization
```

### Regularization's Role

**Regularization increases bias slightly but decreases variance significantly**

- Small α: Low bias, high variance (may overfit)
- Large α: High bias, low variance (may underfit)
- Optimal α: Minimizes total error

In [None]:
# Demonstrate bias-variance tradeoff
from sklearn.model_selection import cross_val_score

# Test range of alpha values
alphas_bv = np.logspace(-3, 3, 20)
train_errors = []
test_errors = []

for alpha in alphas_bv:
    model = Ridge(alpha=alpha)
    
    # Training error
    model.fit(X_train_scaled, y_train)
    train_pred = model.predict(X_train_scaled)
    train_errors.append(mean_squared_error(y_train, train_pred))
    
    # Test error (approximates generalization error)
    test_pred = model.predict(X_test_scaled)
    test_errors.append(mean_squared_error(y_test, test_pred))

# Find optimal alpha
optimal_idx = np.argmin(test_errors)
optimal_alpha = alphas_bv[optimal_idx]

print(f"Optimal alpha: {optimal_alpha:.3f}")
print(f"Minimum test error: {test_errors[optimal_idx]:,.0f}")

In [None]:
# Visualize bias-variance tradeoff
plt.figure(figsize=(10, 6))
plt.semilogx(alphas_bv, train_errors, 'b-', linewidth=2, label='Training Error (Bias)')
plt.semilogx(alphas_bv, test_errors, 'r-', linewidth=2, label='Test Error (Bias + Variance)')
plt.scatter([optimal_alpha], [test_errors[optimal_idx]], s=300, c='green', 
           marker='*', label=f'Optimal α={optimal_alpha:.3f}', zorder=5)

# Annotate regions
plt.text(0.001, max(test_errors)*0.95, 'High Variance\n(Overfitting)', 
        fontsize=10, ha='left', bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))
plt.text(100, max(test_errors)*0.95, 'High Bias\n(Underfitting)', 
        fontsize=10, ha='right', bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.3))

plt.xlabel('Alpha (Regularization Strength)', fontsize=12)
plt.ylabel('Mean Squared Error', fontsize=12)
plt.title('Bias-Variance Tradeoff in Ridge Regression', fontsize=13, fontweight='bold')
plt.legend(fontsize=11, loc='upper center')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nObservations:")
print("- Left (low α): Small gap between train/test (high variance)")
print("- Right (high α): Large gap (high bias, both errors increase)")
print(f"- Optimal α={optimal_alpha:.3f}: Best balance!")

## Exercises

### Exercise 1: Ridge Regularization Path

Visualize how Ridge coefficients change with alpha:
1. Use California housing data
2. Train Ridge models with alphas from 0.001 to 1000 (log scale, 50 points)
3. For each alpha, store all coefficients
4. Create a plot showing coefficient values vs alpha
5. Each feature should be a separate line
6. Which features shrink fastest?

In [None]:
# Your code here


### Exercise 2: Lasso Feature Selection

Use Lasso for feature selection on synthetic data:
1. Train Lasso with alpha=1.0
2. Identify which features have non-zero coefficients
3. Retrain a simple LinearRegression using ONLY selected features
4. Compare performance:
   - Full model (all features)
   - Lasso (all features with regularization)
   - Selected model (only selected features)
5. Which approach works best?

In [None]:
# Your code here


### Exercise 3: Elastic Net Grid Search

Find optimal hyperparameters for Elastic Net:
1. Use GridSearchCV or RandomizedSearchCV
2. Search over:
   - alpha: [0.01, 0.1, 1, 10, 100]
   - l1_ratio: [0, 0.25, 0.5, 0.75, 1.0]
3. Use 5-fold cross-validation
4. Print best parameters and score
5. Create heatmap of alpha vs l1_ratio showing CV scores

Use California housing dataset.

In [None]:
# Your code here


### Exercise 4: Regularization Comparison

Comprehensive comparison of all regularization methods:
1. Load synthetic_regression dataset
2. Split into train/test (70/30)
3. Scale features
4. Train and evaluate:
   - Linear Regression (no regularization)
   - Ridge (tune alpha with cross-validation)
   - Lasso (tune alpha with cross-validation)
   - Elastic Net (tune both alpha and l1_ratio)
5. Create comparison table with:
   - Train R², Test R²
   - Number of features selected
   - Training time
6. Which model would you choose and why?

In [None]:
# Your code here


## Summary

### Key Concepts

1. **Overfitting vs Underfitting**:
   - **Overfitting**: Model too complex, learns noise, poor generalization
   - **Underfitting**: Model too simple, misses patterns
   - **Just Right**: Captures true patterns, generalizes well

2. **Ridge Regression (L2)**:
   - Penalty: Sum of squared coefficients
   - Effect: Shrinks coefficients toward 0 (but not to exactly 0)
   - Use when: All features potentially relevant
   - Hyperparameter: alpha (higher = stronger regularization)

3. **Lasso Regression (L1)**:
   - Penalty: Sum of absolute coefficients
   - Effect: Can set coefficients to exactly 0
   - Use when: Want automatic feature selection
   - Result: Sparse models (many zero coefficients)

4. **Elastic Net**:
   - Combination of L1 and L2 penalties
   - Two hyperparameters: alpha and l1_ratio
   - More stable than Lasso with correlated features
   - Default safe choice for regularization

5. **Bias-Variance Tradeoff**:
   - Total Error = Bias² + Variance + Noise
   - Simple models: High bias, low variance
   - Complex models: Low bias, high variance
   - Regularization: Trades slight bias increase for large variance decrease

6. **Best Practices**:
   - **Always scale features** before regularization
   - Use cross-validation to tune alpha
   - Try multiple regularization methods
   - Start with Elastic Net (flexible)
   - Monitor train/test gap to detect overfitting

### Choosing Regularization Method

| Scenario | Recommended Method |
|----------|-------------------|
| All features relevant | Ridge (L2) |
| Many irrelevant features | Lasso (L1) |
| Correlated features | Ridge or Elastic Net |
| Want feature selection | Lasso or Elastic Net |
| Not sure | Elastic Net (safe default) |

### What's Next?

In **Module 09: Support Vector Machines**, you'll learn:
- Maximum margin classification
- Linear and non-linear SVM
- The kernel trick (RBF, polynomial)
- Hyperparameters: C and gamma
- SVC vs SVR (classification vs regression)
- Decision boundary visualization

### Additional Resources

- [Ridge and Lasso - StatQuest](https://www.youtube.com/watch?v=Q81RR3yKn30)
- [Regularization Explained - Andrew Ng](https://www.coursera.org/lecture/machine-learning/regularization-and-bias-variance-4VDlf)
- [scikit-learn Regularization Guide](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification)
- [Bias-Variance Tradeoff](https://www.youtube.com/watch?v=EuBBz3bI-aA)