# L2 Regularization (Ridge) - Complete Guide

## üéØ What This Notebook Covers

In this comprehensive notebook, we explore **L2 Regularization** (also known as Ridge Regression or Weight Decay):

1. ‚úÖ **The Overfitting Problem** - Why we need regularization
2. ‚úÖ **Mathematical Foundation** - Complete derivations with intuition
3. ‚úÖ **Multiple Intuitive Examples** - Weight shrinkage, geometry, analogies
4. ‚úÖ **Implementation from Scratch** - Pure NumPy implementation
5. ‚úÖ **Comprehensive Visualizations** - Visual learning at every step
6. ‚úÖ **L2 vs L1 Comparison** - When to use which
7. ‚úÖ **Practical Guidelines** - Hyperparameter tuning and best practices

### Why L2 Regularization?

**Key Property: WEIGHT SHRINKAGE** üéØ

L2 regularization makes all weights **small but non-zero**, preventing any single feature from dominating. This makes models:
- More stable (less sensitive to individual features)
- Better at generalization (smoother decision boundaries)
- Numerically stable (well-conditioned matrices)

Let's dive deep into the mathematics and intuition! üöÄ

---

## 1. Setup and Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.gridspec import GridSpec
from mpl_toolkits.mplot3d import Axes3D
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("‚úÖ Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

## 2. The Overfitting Problem

Before diving into L2 regularization, let's understand **why** we need regularization.

### The Problem: Large Weights Lead to Overfitting

When a model overfits:
- ‚úÖ **Training accuracy**: Very high
- ‚ùå **Test accuracy**: Poor
- üî¥ **Weights become very large** to fit training noise

**L2 regularization** penalizes large weights, keeping them small and controlled.

### Visualization 1: Weight Magnitude and Overfitting

In [None]:
# Generate polynomial regression example
np.random.seed(42)
X_demo = np.linspace(0, 10, 20)
y_demo = 2 * X_demo + 1 + np.random.randn(20) * 2

# Fit high-degree polynomial WITHOUT regularization
degree = 15
coef_no_reg = np.polyfit(X_demo, y_demo, degree)

# Simulate L2 regularization effect (smaller coefficients)
coef_l2 = coef_no_reg * 0.1  # Shrink coefficients

# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Plot 1: Coefficients without regularization
axes[0, 0].bar(range(len(coef_no_reg)), coef_no_reg, color='red', alpha=0.7, edgecolor='black', linewidth=2)
axes[0, 0].axhline(y=0, color='k', linestyle='-', linewidth=1)
axes[0, 0].set_xlabel('Polynomial Degree', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Coefficient Value', fontsize=12, fontweight='bold')
axes[0, 0].set_title('‚ùå Without Regularization\nVery Large Coefficients!', fontsize=14, fontweight='bold', color='red')
axes[0, 0].grid(True, alpha=0.3, axis='y')
axes[0, 0].text(7, max(coef_no_reg)*0.7, f'Max: {max(abs(coef_no_reg)):.1f}\nHuge weights!', 
                fontsize=11, bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.7))

# Plot 2: Coefficients with L2 regularization
axes[0, 1].bar(range(len(coef_l2)), coef_l2, color='green', alpha=0.7, edgecolor='black', linewidth=2)
axes[0, 1].axhline(y=0, color='k', linestyle='-', linewidth=1)
axes[0, 1].set_xlabel('Polynomial Degree', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('Coefficient Value', fontsize=12, fontweight='bold')
axes[0, 1].set_title('‚úÖ With L2 Regularization\nSmall, Controlled Coefficients', fontsize=14, fontweight='bold', color='green')
axes[0, 1].grid(True, alpha=0.3, axis='y')
axes[0, 1].text(7, max(coef_l2)*0.7, f'Max: {max(abs(coef_l2)):.1f}\nSmall weights!', 
                fontsize=11, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7))

# Plot 3: Fitted curves
X_plot = np.linspace(0, 10, 200)
y_no_reg = np.polyval(coef_no_reg, X_plot)
y_l2 = np.polyval(coef_l2, X_plot)

axes[1, 0].scatter(X_demo, y_demo, c='black', s=100, alpha=0.6, edgecolors='black', linewidth=2, label='Training Data')
axes[1, 0].plot(X_plot, y_no_reg, 'r-', linewidth=3, label='Without L2 (Overfit)')
axes[1, 0].set_xlabel('X', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('y', fontsize=12, fontweight='bold')
axes[1, 0].set_title('‚ùå Without L2: Wild Oscillations', fontsize=14, fontweight='bold', color='red')
axes[1, 0].legend(fontsize=11)
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_ylim(-10, 35)

axes[1, 1].scatter(X_demo, y_demo, c='black', s=100, alpha=0.6, edgecolors='black', linewidth=2, label='Training Data')
axes[1, 1].plot(X_plot, y_l2, 'g-', linewidth=3, label='With L2 (Smooth)')
axes[1, 1].set_xlabel('X', fontsize=12, fontweight='bold')
axes[1, 1].set_ylabel('y', fontsize=12, fontweight='bold')
axes[1, 1].set_title('‚úÖ With L2: Smooth, Generalizable', fontsize=14, fontweight='bold', color='green')
axes[1, 1].legend(fontsize=11)
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_ylim(-10, 35)

plt.suptitle('Visualization 1: L2 Regularization Controls Weight Magnitude', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüìä Key Observations:")
print("  ‚Ä¢ Without L2: Large coefficients ‚Üí Wild oscillations ‚Üí Overfitting")
print("  ‚Ä¢ With L2: Small coefficients ‚Üí Smooth curve ‚Üí Better generalization")
print(f"  ‚Ä¢ Weight reduction: {max(abs(coef_no_reg))/max(abs(coef_l2)):.1f}x smaller")

## 3. L2 Regularization: Mathematical Foundation

Let's build the mathematics from the ground up with complete derivations.

### 3.1 The Loss Function with L2 Penalty

**Original Loss Function** (e.g., Mean Squared Error):

$$
L_{\text{original}} = \frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2
$$

**L2 Regularized Loss Function**:

$$
\boxed{L_{\text{L2}} = L_{\text{original}} + \frac{\lambda}{2m} \sum_{j=1}^{n} W_j^2}
$$

Where:
- $L_{\text{original}}$ = Original loss (MSE, cross-entropy, etc.)
- $\lambda$ = Regularization strength (hyperparameter)
- $m$ = Number of training examples
- $n$ = Number of weights
- $W_j^2$ = Square of weight $j$
- Factor of $\frac{1}{2}$ simplifies derivative

### 3.2 Why Squared Weights?

The squared term $W^2$ has special properties:
- It penalizes weights **quadratically** (larger weights penalized more)
- Its derivative is **proportional to W**: $\frac{dW^2}{dW} = 2W$
- This creates **smooth shrinkage** (all weights get smaller, none go to zero)
- Also called **Weight Decay** because weights decay exponentially

### 3.3 Forward Pass: Computing the Loss

**Step 1**: Compute original loss
$$
L_{\text{original}} = \frac{1}{m} \sum_{i=1}^{m} \text{Loss}(y^{(i)}, \hat{y}^{(i)})
$$

**Step 2**: Compute L2 penalty
$$
L_{\text{L2 penalty}} = \frac{\lambda}{2m} \sum_{j=1}^{n} W_j^2
$$

**Step 3**: Add them together
$$
L_{\text{total}} = L_{\text{original}} + L_{\text{L2 penalty}}
$$

### 3.4 Backward Pass: Computing Gradients

This is where L2 regularization differs from L1!

**Original Gradient** (without regularization):
$$
\frac{\partial L_{\text{original}}}{\partial W_j} = \text{(computed via backpropagation)}
$$

**L2 Penalty Gradient**:
$$
\frac{\partial}{\partial W_j} \left( \frac{\lambda}{2m} W_j^2 \right) = \frac{\lambda}{2m} \cdot 2W_j = \frac{\lambda}{m} \cdot W_j
$$

**Total Gradient** (with L2 regularization):
$$
\boxed{\frac{\partial L_{\text{L2}}}{\partial W_j} = \frac{\partial L_{\text{original}}}{\partial W_j} + \frac{\lambda}{m} \cdot W_j}
$$

**Parameter Update**:
$$
W_j := W_j - \alpha \left( \frac{\partial L_{\text{original}}}{\partial W_j} + \frac{\lambda}{m} \cdot W_j \right)
$$

**Rearranging**:
$$
W_j := W_j - \alpha \frac{\partial L_{\text{original}}}{\partial W_j} - \alpha \frac{\lambda}{m} W_j
$$

$$
W_j := W_j \left(1 - \alpha \frac{\lambda}{m}\right) - \alpha \frac{\partial L_{\text{original}}}{\partial W_j}
$$

This shows **weight decay**: weights are multiplied by $(1 - \alpha \frac{\lambda}{m}) < 1$ each iteration!

### 3.5 Why Does L2 NOT Create Sparsity?

**Key Insight**: The gradient is **proportional to W** (not constant like L1)

**For any weight** $W$:
$$
W := W \left(1 - \alpha \frac{\lambda}{m}\right) - \alpha \frac{\partial L_{\text{original}}}{\partial W}
$$

- Weight is multiplied by $(1 - \alpha \frac{\lambda}{m})$ each iteration
- This is **exponential decay**: $W_t = W_0 \cdot (1 - \alpha \frac{\lambda}{m})^t$
- As $W \to 0$, the penalty $\to 0$ (gets weaker!)
- Weights get **very small** but **never exactly zero**

**Contrast with L1**:
$$
W := W - \alpha \left( \frac{\partial L_{\text{original}}}{\partial W} + \frac{\lambda}{m} \cdot \text{sign}(W) \right)
$$
- Subtracts a **constant** $\frac{\alpha \lambda}{m}$ each iteration
- Penalty stays constant even as $W \to 0$
- Weights can reach **exactly zero**

### Visualization 2: L2 Penalty Function

In [None]:
# Create L2 penalty visualization
x = np.linspace(-3, 3, 1000)
l2_penalty = x**2
l2_derivative = 2 * x
l1_penalty = np.abs(x)
l1_derivative = np.sign(x)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: L2 penalty function
axes[0].plot(x, l2_penalty, linewidth=4, color='blue', label='W¬≤')
axes[0].axhline(y=0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
axes[0].axvline(x=0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
axes[0].set_xlabel('W', fontsize=12, fontweight='bold')
axes[0].set_ylabel('W¬≤', fontsize=12, fontweight='bold')
axes[0].set_title('Squared Penalty Function\n(L2 Penalty)', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[0].text(0, 7, 'Parabola\n(Smooth everywhere)', ha='center', fontsize=10,
             bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.7))
axes[0].set_ylim(0, 9)

# Plot 2: L2 derivative (proportional to W)
axes[1].plot(x, l2_derivative, linewidth=4, color='blue', label='2W (proportional)')
axes[1].axhline(y=0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
axes[1].axvline(x=0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
axes[1].set_xlabel('W', fontsize=12, fontweight='bold')
axes[1].set_ylabel('dW¬≤/dW = 2W', fontsize=12, fontweight='bold')
axes[1].set_title('L2 Derivative\n(Proportional to W)', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(-7, 7)
axes[1].text(2, 5, 'Larger W ‚Üí\nLarger penalty', fontsize=10, color='blue', fontweight='bold')
axes[1].text(-2, -5, 'As W‚Üí0,\npenalty‚Üí0', fontsize=10, color='red', fontweight='bold')

# Plot 3: L1 vs L2 comparison
axes[2].plot(x, l2_penalty, linewidth=4, color='blue', label='L2: W¬≤ (smooth)', alpha=0.7)
axes[2].plot(x, l1_penalty, linewidth=4, color='orange', label='L1: |W| (sharp)', linestyle='--', alpha=0.7)
axes[2].axhline(y=0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
axes[2].axvline(x=0, color='k', linestyle='-', linewidth=0.5, alpha=0.3)
axes[2].set_xlabel('W', fontsize=12, fontweight='bold')
axes[2].set_ylabel('Penalty', fontsize=12, fontweight='bold')
axes[2].set_title('L1 vs L2 Penalty Functions', fontsize=13, fontweight='bold')
axes[2].legend(fontsize=11)
axes[2].grid(True, alpha=0.3)
axes[2].set_ylim(0, 9)
axes[2].text(0, 7, 'L2: Smooth (differentiable)\nL1: Sharp corner at 0', ha='center', fontsize=10,
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.7))

plt.suptitle('Visualization 2: L2 Penalty Function and Derivative', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüéØ Key Mathematical Insights:")
print("  ‚Ä¢ L2 uses squared penalty: W¬≤")
print("  ‚Ä¢ Derivative of W¬≤ is 2W (proportional to W)")
print("  ‚Ä¢ Proportional penalty ‚Üí weights shrink but never reach zero")
print("  ‚Ä¢ L1 uses |W|, derivative is sign(W) (constant)")
print("  ‚Ä¢ Constant penalty ‚Üí weights can reach exactly zero")

### Visualization 3: Weight Decay Over Time

In [None]:
# Simulate weight decay over iterations
iterations = np.arange(0, 100)
alpha = 0.01  # Learning rate
lambda_val = 0.1  # Regularization strength
m = 100  # Number of samples

# Initial weights
W0_large = 5.0
W0_medium = 2.0
W0_small = 0.5

# Decay factor
decay_factor = 1 - alpha * lambda_val / m

# Weight evolution (assuming no gradient from loss)
W_large = W0_large * (decay_factor ** iterations)
W_medium = W0_medium * (decay_factor ** iterations)
W_small = W0_small * (decay_factor ** iterations)

# L1 for comparison (constant subtraction)
l1_constant = alpha * lambda_val / m
W_l1_large = np.maximum(0, W0_large - l1_constant * iterations)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: L2 weight decay
axes[0].plot(iterations, W_large, linewidth=3, label=f'W‚ÇÄ = {W0_large}', color='red')
axes[0].plot(iterations, W_medium, linewidth=3, label=f'W‚ÇÄ = {W0_medium}', color='orange')
axes[0].plot(iterations, W_small, linewidth=3, label=f'W‚ÇÄ = {W0_small}', color='blue')
axes[0].axhline(y=0, color='k', linestyle='--', linewidth=2, alpha=0.5)
axes[0].set_xlabel('Iteration', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Weight Value', fontsize=12, fontweight='bold')
axes[0].set_title('L2 Weight Decay\nExponential Decay (Never Reaches Zero)', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[0].text(50, 3, f'Decay factor:\n(1 - Œ±Œª/m) = {decay_factor:.4f}', 
             fontsize=11, bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.7))
axes[0].text(70, 0.5, 'Approaches zero\nbut never reaches it', 
             fontsize=10, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.7))

# Plot 2: L2 vs L1 comparison
axes[1].plot(iterations, W_large, linewidth=3, label='L2: Exponential decay', color='blue')
axes[1].plot(iterations, W_l1_large, linewidth=3, label='L1: Linear decay', color='orange', linestyle='--')
axes[1].axhline(y=0, color='k', linestyle='--', linewidth=2, alpha=0.5)
axes[1].set_xlabel('Iteration', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Weight Value', fontsize=12, fontweight='bold')
axes[1].set_title('L2 vs L1 Weight Decay\n(Starting from W‚ÇÄ = 5.0)', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[1].text(30, 3, 'L2: Smooth decay\n(proportional to W)', 
             fontsize=10, color='blue', fontweight='bold',
             bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.7))
axes[1].text(30, 1, 'L1: Linear decay\n(constant rate)', 
             fontsize=10, color='orange', fontweight='bold',
             bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.7))

plt.suptitle('Visualization 3: Weight Decay Dynamics', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüìä Weight Decay Analysis:")
print(f"  ‚Ä¢ Decay factor per iteration: {decay_factor:.4f}")
print(f"  ‚Ä¢ After 100 iterations:")
print(f"    - Large weight (5.0) ‚Üí {W_large[-1]:.4f}")
print(f"    - Medium weight (2.0) ‚Üí {W_medium[-1]:.4f}")
print(f"    - Small weight (0.5) ‚Üí {W_small[-1]:.4f}")
print("\n  ‚Ä¢ L2: Exponential decay (never reaches zero)")
print("  ‚Ä¢ L1: Linear decay (reaches zero in finite time)")

### Visualization 4: Gradient Flow with L2

In [None]:
# Create a flowchart-style visualization
fig, ax = plt.subplots(figsize=(14, 10))
ax.axis('off')

# Define box positions
boxes = [
    # Forward pass
    {'xy': (0.5, 0.9), 'text': 'Forward Pass\nCompute predictions', 'color': 'lightblue'},
    {'xy': (0.5, 0.75), 'text': 'Compute Original Loss\nL_original', 'color': 'lightblue'},
    {'xy': (0.5, 0.6), 'text': 'Compute L2 Penalty\n(Œª/2m) * Œ£W¬≤', 'color': 'lightcoral'},
    {'xy': (0.5, 0.45), 'text': 'Total Loss\nL_total = L_original + L2_penalty', 'color': 'lightgreen'},
    
    # Backward pass
    {'xy': (0.5, 0.3), 'text': 'Backward Pass\nCompute gradients', 'color': 'lightyellow'},
    {'xy': (0.25, 0.15), 'text': 'Original Gradient\n‚àÇL_original/‚àÇW', 'color': 'lightblue'},
    {'xy': (0.75, 0.15), 'text': 'L2 Gradient\n(Œª/m) * W', 'color': 'lightcoral'},
    {'xy': (0.5, 0.0), 'text': 'Total Gradient\n‚àÇL_total/‚àÇW = ‚àÇL_original/‚àÇW + (Œª/m)*W', 'color': 'lightgreen'},
]

# Draw boxes
for box in boxes:
    ax.text(box['xy'][0], box['xy'][1], box['text'], 
            ha='center', va='center', fontsize=11, fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.8', facecolor=box['color'], 
                     edgecolor='black', linewidth=2))

# Draw arrows
arrows = [
    ((0.5, 0.87), (0.5, 0.78)),
    ((0.5, 0.72), (0.5, 0.63)),
    ((0.5, 0.57), (0.5, 0.48)),
    ((0.5, 0.42), (0.5, 0.33)),
    ((0.5, 0.27), (0.25, 0.18)),
    ((0.5, 0.27), (0.75, 0.18)),
    ((0.25, 0.12), (0.5, 0.03)),
    ((0.75, 0.12), (0.5, 0.03)),
]

for start, end in arrows:
    ax.annotate('', xy=end, xytext=start,
                arrowprops=dict(arrowstyle='->', lw=3, color='black'))

# Add title
ax.text(0.5, 0.98, 'Visualization 4: Gradient Flow with L2 Regularization', 
        ha='center', va='top', fontsize=16, fontweight='bold')

# Add legend
ax.text(0.05, 0.5, 'Legend:\n‚Ä¢ Blue: Standard operations\n‚Ä¢ Red: L2-specific\n‚Ä¢ Green: Combined result\n\nKey Difference from L1:\nL2 gradient = (Œª/m)*W\n(proportional to W)', 
        ha='left', va='center', fontsize=10,
        bbox=dict(boxstyle='round', facecolor='white', edgecolor='black', linewidth=2))

plt.tight_layout()
plt.show()

print("\nüìä Gradient Flow Summary:")
print("  1. Forward: Compute predictions and loss")
print("  2. Add L2 penalty to loss: (Œª/2m) * Œ£W¬≤")
print("  3. Backward: Compute original gradients")
print("  4. Add L2 gradient: (Œª/m) * W (proportional!)")
print("  5. Update: W := W(1 - Œ±Œª/m) - Œ±‚àÇL/‚àÇW (weight decay!)")