# Lab A.2: Bias-Variance Decomposition

**Module:** A - Statistical Learning Theory  
**Time:** 1.5 hours  
**Difficulty:** ⭐⭐⭐ (Intermediate-Advanced)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the three sources of prediction error (bias, variance, noise)
- [ ] Compute bias and variance empirically using bootstrap
- [ ] Visualize the classic U-shaped curve of model complexity
- [ ] Apply bias-variance thinking to model selection
- [ ] Connect theory to practical underfitting/overfitting diagnosis

---

## Prerequisites

- Completed: Lab A.1 (VC Dimension)
- Knowledge of: Linear regression, polynomial fitting

---

## Real-World Context

You're a data scientist at a financial firm predicting stock returns. Your model fits training data perfectly (low training error) but fails miserably on new data (high test error). Your manager asks: "Is the model too simple or too complex?"

**Bias-variance decomposition** gives you the mathematical framework to answer this question precisely. It's the foundation for understanding:
- Why adding more features can hurt performance
- Why ensemble methods (Random Forests, Boosting) work
- When to collect more data vs. simplify your model

---

## ELI5: Bias vs Variance

> **Imagine you're playing darts at a dartboard...** 
>
> **High Bias, Low Variance**: You throw 10 darts. They all land in a tight cluster... but the cluster is in the corner, far from the bullseye! You're consistent, but consistently wrong. This is like a model that's too simple - it keeps making the same mistake.
>
> **Low Bias, High Variance**: You throw 10 darts. On average, they center around the bullseye... but they're scattered all over the board! You're unbiased on average, but wildly inconsistent. This is like a model that's too complex - it changes drastically with small changes to training data.
>
> **The Goal**: Low bias AND low variance - darts tightly clustered around the bullseye!
>
> **The Trade-off**: Making the cluster tighter (less variance) often pushes it away from center (more bias), and vice versa. Finding the sweet spot is the art of machine learning!
>
> **In AI terms:**
> - **Bias** = How far off is your model's average prediction from the truth?
> - **Variance** = How much does your model's prediction change across different training sets?
> - **Total Error = Bias² + Variance + Irreducible Noise**

---

## Part 1: Setting Up Our Environment

In [None]:
# Core imports
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List, Callable
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn for regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

# ============================================================
# Key sklearn imports explained:
# ============================================================
# make_pipeline: Chains preprocessing steps with a model
#   Example: make_pipeline(PolynomialFeatures(2), LinearRegression())
#   Creates: PolynomialFeatures -> LinearRegression (auto-connected)
#
# Ridge: Linear regression with L2 regularization (shrinks weights)
#   alpha parameter controls regularization strength
#   Higher alpha = more regularization = simpler model
#
# cross_val_score: Evaluates model using k-fold cross-validation
#   Returns array of scores, one per fold
# ============================================================

# Set nice plotting defaults
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

# Seed for reproducibility
np.random.seed(42)

print("Environment ready for Bias-Variance exploration!")
print(f"NumPy version: {np.__version__}")

---

## Part 2: The Mathematical Decomposition

### The Fundamental Equation

For a prediction $\hat{f}(x)$ and true function $f(x)$ with noise $\epsilon$:

$$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\mathbb{E}[\hat{f}(x)] - f(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Noise}}$$

Let's break this down:
- **Bias²**: Systematic error - how wrong is the model on average?
- **Variance**: Random error - how much does the model change with different training data?
- **Noise**: Inherent randomness we can never eliminate

In [None]:
# Let's create a ground truth function with known noise

def true_function(x: np.ndarray) -> np.ndarray:
    """
    The "true" underlying function we're trying to learn.
    In real life, we never know this!
    """
    return np.sin(2 * x) + 0.5 * np.cos(4 * x)


def generate_data(n_samples: int, noise_std: float = 0.3, 
                  x_min: float = 0, x_max: float = 4,
                  seed: int = None) -> Tuple[np.ndarray, np.ndarray]:
    """
    Generate noisy samples from the true function.
    
    Args:
        n_samples: Number of samples to generate
        noise_std: Standard deviation of Gaussian noise
        x_min, x_max: Range of x values
        seed: Random seed for reproducibility
        
    Returns:
        X: Feature values shape (n_samples, 1)
        y: Target values shape (n_samples,)
    """
    if seed is not None:
        np.random.seed(seed)
    
    X = np.random.uniform(x_min, x_max, n_samples)
    y = true_function(X) + np.random.normal(0, noise_std, n_samples)
    
    # reshape(-1, 1) converts 1D array to 2D column vector
    # sklearn expects X to be 2D: (n_samples, n_features)
    # -1 means "infer this dimension", 1 means "1 column"
    return X.reshape(-1, 1), y


# Generate and visualize data
X_sample, y_sample = generate_data(100, noise_std=0.3, seed=42)

# Plot true function and noisy data
x_plot = np.linspace(0, 4, 200)
y_true = true_function(x_plot)

plt.figure(figsize=(12, 6))
plt.scatter(X_sample, y_sample, alpha=0.6, s=50, label='Noisy observations')
plt.plot(x_plot, y_true, 'r-', linewidth=3, label='True function f(x)')

# plt.fill_between() fills the area between two y-values
# Useful for showing confidence bands, error ranges, or uncertainty
# Args: x-values, lower y-bound, upper y-bound, styling options
plt.fill_between(x_plot, y_true - 0.3, y_true + 0.3, alpha=0.2, color='red', label='Noise band (±σ)')

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Our Learning Problem: Noisy Data from Unknown True Function', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Generated {len(X_sample)} samples with noise std = 0.3")
print(f"Irreducible error (variance of noise): {0.3**2:.4f}")

---

## Part 3: Fitting Models of Different Complexity

Let's fit polynomial models from degree 1 (simple line) to degree 15 (very wiggly).

In [None]:
def fit_polynomial(X: np.ndarray, y: np.ndarray, degree: int) -> object:
    """
    Fit a polynomial regression model.
    
    Args:
        X: Features shape (n_samples, 1)
        y: Target shape (n_samples,)
        degree: Polynomial degree
        
    Returns:
        Fitted sklearn Pipeline
    """
    model = make_pipeline(
        PolynomialFeatures(degree, include_bias=False),
        LinearRegression()
    )
    model.fit(X, y)
    return model


# Fit models of various degrees
degrees = [1, 3, 5, 10, 15]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

# Plot true function in first subplot
axes[0].scatter(X_sample, y_sample, alpha=0.4, s=30)
axes[0].plot(x_plot, y_true, 'r-', linewidth=3)
axes[0].set_title('True Function', fontsize=12, fontweight='bold')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].grid(True, alpha=0.3)

# Fit and plot each model
for ax, degree in zip(axes[1:], degrees):
    model = fit_polynomial(X_sample, y_sample, degree)
    y_pred = model.predict(x_plot.reshape(-1, 1))
    
    ax.scatter(X_sample, y_sample, alpha=0.4, s=30)
    ax.plot(x_plot, y_true, 'r-', linewidth=2, alpha=0.5, label='True')
    ax.plot(x_plot, y_pred, 'b-', linewidth=2, label=f'Degree {degree}')
    
    # Calculate training MSE
    train_mse = np.mean((y_sample - model.predict(X_sample)) ** 2)
    
    ax.set_title(f'Degree {degree}\nTrain MSE: {train_mse:.4f}', fontsize=12, fontweight='bold')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-2, 2.5)

plt.suptitle('Polynomial Fits: From Underfitting to Overfitting', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

### What Just Happened?

- **Degree 1 (line)**: Underfits - too simple to capture the wave pattern (HIGH BIAS)
- **Degree 3-5**: Getting better - captures the main shape
- **Degree 10-15**: Overfits - follows the noise too closely (HIGH VARIANCE)

Notice how training MSE keeps decreasing, but that doesn't mean the model is better!

---

## Part 4: Computing Bias and Variance via Bootstrap

To measure bias and variance, we need to train the model on MANY different training sets. We simulate this using **bootstrap resampling**.

In [None]:
def bootstrap_bias_variance(degree: int, 
                           n_samples: int = 100,
                           noise_std: float = 0.3,
                           n_bootstrap: int = 200,
                           n_test_points: int = 50) -> dict:
    """
    Compute bias and variance using bootstrap resampling.
    
    We generate many training sets, fit a model to each,
    and see how predictions vary.
    
    Args:
        degree: Polynomial degree
        n_samples: Samples per training set
        noise_std: Noise level
        n_bootstrap: Number of bootstrap samples
        n_test_points: Number of test points for evaluation
        
    Returns:
        Dictionary with bias², variance, noise, and total error
    """
    # Fixed test points
    X_test = np.linspace(0.5, 3.5, n_test_points).reshape(-1, 1)
    y_true_test = true_function(X_test.flatten())
    
    # np.zeros() creates an array filled with zeros
    # Shape (n_bootstrap, n_test_points) stores predictions from each model
    all_predictions = np.zeros((n_bootstrap, n_test_points))
    
    for i in range(n_bootstrap):
        # Generate new training data (simulating different training sets)
        X_train, y_train = generate_data(n_samples, noise_std, seed=i)
        
        # Fit model
        model = fit_polynomial(X_train, y_train, degree)
        
        # Predict on test points
        all_predictions[i] = model.predict(X_test)
    
    # Compute bias and variance
    # np.mean(..., axis=0) computes mean across first axis (rows)
    # Result: average prediction at each test point
    mean_prediction = np.mean(all_predictions, axis=0)
    
    # Bias² = E[(E[f̂] - f)²] = mean over test points of (mean_pred - true)²
    bias_squared = np.mean((mean_prediction - y_true_test) ** 2)
    
    # np.var(..., axis=0) computes variance across first axis
    # Variance = E[Var(f̂)] = mean over test points of variance of predictions
    variance = np.mean(np.var(all_predictions, axis=0))
    
    # Irreducible noise
    noise = noise_std ** 2
    
    # Total expected error
    total_error = bias_squared + variance + noise
    
    return {
        'degree': degree,
        'bias_squared': bias_squared,
        'variance': variance,
        'noise': noise,
        'total_error': total_error,
        'predictions': all_predictions,
        'X_test': X_test,
        'y_true_test': y_true_test,
        'mean_prediction': mean_prediction
    }


print("Bootstrap bias-variance function defined!")
print("This may take a moment to run...")

In [None]:
# Compute for various polynomial degrees
degrees_to_test = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15]

results = []
for degree in degrees_to_test:
    result = bootstrap_bias_variance(degree, n_bootstrap=200)
    results.append(result)
    print(f"Degree {degree:2d}: Bias²={result['bias_squared']:.4f}, "
          f"Var={result['variance']:.4f}, Total={result['total_error']:.4f}")

print(f"\nIrreducible noise (σ²): {results[0]['noise']:.4f}")

In [None]:
# Create the classic bias-variance tradeoff plot
degrees = [r['degree'] for r in results]
biases = [r['bias_squared'] for r in results]
variances = [r['variance'] for r in results]
totals = [r['total_error'] for r in results]
noise = results[0]['noise']

plt.figure(figsize=(12, 7))

plt.plot(degrees, biases, 'b-o', linewidth=2, markersize=8, label='Bias²')
plt.plot(degrees, variances, 'r-o', linewidth=2, markersize=8, label='Variance')
plt.plot(degrees, totals, 'g-o', linewidth=3, markersize=8, label='Total Error')
plt.axhline(y=noise, color='gray', linestyle='--', linewidth=2, label=f'Irreducible Noise (σ²={noise:.2f})')

# np.argmin() returns the INDEX of the minimum value in an array
# Example: np.argmin([5, 2, 8, 1, 9]) returns 3 (index of value 1)
# Useful for finding which model/hyperparameter gave the best result
optimal_idx = np.argmin(totals)
optimal_degree = degrees[optimal_idx]
optimal_error = totals[optimal_idx]

plt.scatter([optimal_degree], [optimal_error], s=300, color='gold', edgecolors='black', 
           linewidths=2, zorder=5, marker='*', label=f'Optimal (degree {optimal_degree})')

# Add annotations
plt.annotate('Underfitting\n(High Bias)', xy=(2, 0.08), fontsize=11, 
            ha='center', color='blue', fontweight='bold')
plt.annotate('Overfitting\n(High Variance)', xy=(12, 0.15), fontsize=11, 
            ha='center', color='red', fontweight='bold')

plt.xlabel('Model Complexity (Polynomial Degree)', fontsize=12)
plt.ylabel('Error', fontsize=12)
plt.title('The Bias-Variance Tradeoff\nFinding the Sweet Spot', fontsize=14, fontweight='bold')
plt.legend(loc='upper right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.xticks(degrees)
plt.ylim(0, max(totals) * 1.1)
plt.tight_layout()
plt.show()

print(f"\nOptimal model complexity: Degree {optimal_degree}")
print(f"Minimum total error: {optimal_error:.4f}")
print(f"Breakdown: Bias²={biases[optimal_idx]:.4f} + Var={variances[optimal_idx]:.4f} + Noise={noise:.4f}")

### The U-Shaped Curve

This is one of the most important plots in all of machine learning!

- **Left side (simple models)**: High bias, low variance → Underfitting
- **Right side (complex models)**: Low bias, high variance → Overfitting  
- **Bottom of U**: Optimal tradeoff!

**The key insight:** You CANNOT eliminate both bias and variance. Reducing one typically increases the other. The art is finding the optimal balance.

---

## Part 5: Visualizing Prediction Spread

Let's see what high bias and high variance actually look like in terms of predictions.

In [None]:
# Compare low complexity (high bias) vs high complexity (high variance)
result_low = bootstrap_bias_variance(degree=1, n_bootstrap=50)
result_optimal = bootstrap_bias_variance(degree=optimal_degree, n_bootstrap=50)
result_high = bootstrap_bias_variance(degree=15, n_bootstrap=50)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for ax, result, title in zip(axes, 
                             [result_low, result_optimal, result_high],
                             ['High Bias (Degree 1)', f'Optimal (Degree {optimal_degree})', 'High Variance (Degree 15)']):
    
    # Plot all predictions (as thin lines)
    for pred in result['predictions']:
        ax.plot(result['X_test'], pred, 'b-', alpha=0.1, linewidth=1)
    
    # Plot mean prediction
    ax.plot(result['X_test'], result['mean_prediction'], 'b-', linewidth=3, label='Mean prediction')
    
    # Plot true function
    ax.plot(result['X_test'], result['y_true_test'], 'r-', linewidth=3, label='True function')
    
    ax.set_title(f'{title}\nBias²={result["bias_squared"]:.4f}, Var={result["variance"]:.4f}', 
                fontsize=12, fontweight='bold')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    ax.set_ylim(-2, 3)

plt.suptitle('Prediction Spread Across Different Training Sets', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("Light blue lines = Individual model predictions")
print("Dark blue line = Average prediction across all models")
print("Red line = True function")
print("\nNotice:")
print("  - Degree 1: Predictions are consistent but wrong (biased)")
print(f"  - Degree {optimal_degree}: Good balance - close to true and consistent")
print("  - Degree 15: Average is close to true, but huge spread (variance)")

---

## Part 6: The Dartboard Visualization

Let's make the ELI5 dartboard analogy concrete!

### Matplotlib Drawing Primitives

The dartboard visualization uses matplotlib's patch drawing capabilities:

```python
# plt.Circle() creates a circular patch object
circle = plt.Circle((x_center, y_center), radius, color='red')

# ax.add_patch() adds the patch to the plot
ax.add_patch(circle)
```

**Other useful patches:** `plt.Rectangle()`, `plt.Polygon()`, `plt.Arrow()`

These are useful for creating custom visualizations beyond standard plots!

In [None]:
def create_dartboard_plot():
    """
    Create a dartboard-style visualization of bias vs variance.
    """
    fig, axes = plt.subplots(2, 2, figsize=(12, 12))
    
    scenarios = [
        ('High Bias, Low Variance', 2.0, 0.2),
        ('Low Bias, High Variance', 0.3, 1.5),
        ('High Bias, High Variance', 2.0, 1.5),
        ('Low Bias, Low Variance (Goal!)', 0.2, 0.2),
    ]
    
    for ax, (title, bias, variance) in zip(axes.flatten(), scenarios):
        # Draw dartboard
        for r, color in zip([3, 2, 1, 0.3], ['#eeeeee', '#cccccc', '#aaaaaa', '#ff4444']):
            circle = plt.Circle((0, 0), r, color=color, zorder=0)
            ax.add_patch(circle)
        
        # Generate "dart throws" (predictions)
        n_throws = 30
        # Bias shifts the center, variance spreads the throws
        throws_x = np.random.normal(bias, np.sqrt(variance), n_throws)
        throws_y = np.random.normal(0, np.sqrt(variance), n_throws)
        
        # Plot throws
        ax.scatter(throws_x, throws_y, c='blue', s=100, edgecolors='black', 
                  linewidths=1, zorder=5, alpha=0.7)
        
        # Plot center of throws
        ax.scatter([np.mean(throws_x)], [np.mean(throws_y)], c='yellow', s=200,
                  edgecolors='black', linewidths=2, zorder=6, marker='*')
        
        ax.set_xlim(-4, 4)
        ax.set_ylim(-4, 4)
        ax.set_aspect('equal')
        ax.set_title(title, fontsize=12, fontweight='bold')
        ax.set_xlabel('x error')
        ax.set_ylabel('y error')
        ax.axhline(y=0, color='black', linewidth=0.5, linestyle='--')
        ax.axvline(x=0, color='black', linewidth=0.5, linestyle='--')
    
    plt.suptitle('The Dartboard Analogy for Bias-Variance\n(Blue dots = predictions, Star = average, Red center = target)', 
                fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()


create_dartboard_plot()

---

## Part 7: Practical Implications

### How to Diagnose Your Model

| Symptom | Diagnosis | Solution |
|---------|-----------|----------|
| High train error, High test error | **Underfitting** (High Bias) | Increase model complexity, add features |
| Low train error, High test error | **Overfitting** (High Variance) | Regularization, more data, simpler model |
| Low train error, Low test error | **Just Right!** | Celebrate! |
| High train error, Low test error | Something is wrong | Check for data leakage! |

In [None]:
# Let's compute train and test errors to see this pattern

def compute_train_test_errors(degree: int, n_train: int = 100, 
                             noise_std: float = 0.3, n_trials: int = 50) -> Tuple[float, float]:
    """
    Compute average training and test errors across multiple trials.
    """
    train_errors = []
    test_errors = []
    
    # Fixed test set
    X_test = np.linspace(0.5, 3.5, 100).reshape(-1, 1)
    y_test = true_function(X_test.flatten()) + np.random.normal(0, noise_std, 100)
    
    for trial in range(n_trials):
        X_train, y_train = generate_data(n_train, noise_std, seed=trial)
        
        model = fit_polynomial(X_train, y_train, degree)
        
        train_pred = model.predict(X_train)
        test_pred = model.predict(X_test)
        
        train_errors.append(np.mean((y_train - train_pred) ** 2))
        test_errors.append(np.mean((y_test - test_pred) ** 2))
    
    return np.mean(train_errors), np.mean(test_errors)


# Compute for various degrees
degrees = list(range(1, 16))
train_errs = []
test_errs = []

print("Computing train/test errors...")
for d in degrees:
    tr, te = compute_train_test_errors(d)
    train_errs.append(tr)
    test_errs.append(te)

# Plot
plt.figure(figsize=(12, 6))
plt.plot(degrees, train_errs, 'b-o', linewidth=2, markersize=8, label='Training Error')
plt.plot(degrees, test_errs, 'r-o', linewidth=2, markersize=8, label='Test Error')

# Mark underfitting and overfitting regions
plt.fill_between([1, 3], 0, 0.5, alpha=0.1, color='blue', label='Underfitting zone')
plt.fill_between([10, 15], 0, 0.5, alpha=0.1, color='red', label='Overfitting zone')

optimal_idx = np.argmin(test_errs)
plt.scatter([degrees[optimal_idx]], [test_errs[optimal_idx]], s=200, color='gold', 
           edgecolors='black', linewidths=2, zorder=5, marker='*')

plt.xlabel('Model Complexity (Polynomial Degree)', fontsize=12)
plt.ylabel('Mean Squared Error', fontsize=12)
plt.title('Training vs Test Error\nThe Gap Reveals Overfitting', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xticks(degrees)
plt.ylim(0, max(test_errs) * 1.1)
plt.tight_layout()
plt.show()

print(f"\nOptimal degree: {degrees[optimal_idx]}")
print(f"\nPattern to remember:")
print("  - Train error always decreases with complexity")
print("  - Test error decreases, then INCREASES (overfitting!)")
print("  - The GAP between train and test reveals variance")

---

## Part 8: Regularization Reduces Variance

One of the most practical applications: regularization (like Ridge regression) reduces variance at the cost of slightly increased bias.

In [None]:
def fit_ridge_polynomial(X: np.ndarray, y: np.ndarray, degree: int, alpha: float) -> object:
    """
    Fit a regularized (Ridge) polynomial regression model.
    """
    model = make_pipeline(
        PolynomialFeatures(degree, include_bias=False),
        Ridge(alpha=alpha)
    )
    model.fit(X, y)
    return model


def bootstrap_regularized(degree: int, alpha: float, n_bootstrap: int = 100) -> dict:
    """
    Compute bias-variance for regularized model.
    """
    X_test = np.linspace(0.5, 3.5, 50).reshape(-1, 1)
    y_true_test = true_function(X_test.flatten())
    
    all_predictions = np.zeros((n_bootstrap, 50))
    
    for i in range(n_bootstrap):
        X_train, y_train = generate_data(100, noise_std=0.3, seed=i)
        model = fit_ridge_polynomial(X_train, y_train, degree, alpha)
        all_predictions[i] = model.predict(X_test)
    
    mean_prediction = np.mean(all_predictions, axis=0)
    bias_squared = np.mean((mean_prediction - y_true_test) ** 2)
    variance = np.mean(np.var(all_predictions, axis=0))
    
    return {
        'alpha': alpha,
        'bias_squared': bias_squared,
        'variance': variance,
        'total': bias_squared + variance + 0.09
    }


# Compare unregularized vs regularized for high-degree polynomial
degree = 12
alphas = [0, 0.001, 0.01, 0.1, 1.0, 10.0]

print(f"Regularization effect on Degree {degree} polynomial:")
print("=" * 50)

reg_results = []
for alpha in alphas:
    result = bootstrap_regularized(degree, alpha)
    reg_results.append(result)
    print(f"α={alpha:6.3f}: Bias²={result['bias_squared']:.4f}, "
          f"Var={result['variance']:.4f}, Total={result['total']:.4f}")

In [None]:
# Visualize regularization effect
alphas_log = [r['alpha'] if r['alpha'] > 0 else 1e-4 for r in reg_results]
biases = [r['bias_squared'] for r in reg_results]
variances = [r['variance'] for r in reg_results]
totals = [r['total'] for r in reg_results]

plt.figure(figsize=(12, 6))

# plt.semilogx() plots with logarithmic x-axis, linear y-axis
# Useful when x values span multiple orders of magnitude (like 0.001 to 10)
# Related functions:
#   plt.semilogy() - log y-axis, linear x-axis
#   plt.loglog() - both axes logarithmic
plt.semilogx(alphas_log, biases, 'b-o', linewidth=2, markersize=8, label='Bias²')
plt.semilogx(alphas_log, variances, 'r-o', linewidth=2, markersize=8, label='Variance')
plt.semilogx(alphas_log, totals, 'g-o', linewidth=3, markersize=8, label='Total Error')

plt.xlabel('Regularization Strength (α)', fontsize=12)
plt.ylabel('Error', fontsize=12)
plt.title(f'Regularization Trades Variance for Bias\n(Degree {degree} Polynomial)', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey insight: As regularization increases...")
print("  - Variance DECREASES (model becomes more stable)")
print("  - Bias INCREASES (model becomes simpler)")
print("  - There's an optimal α that minimizes total error!")

---

## Try It Yourself

### Exercise 1: Different Noise Levels

How does the optimal model complexity change with noise level? Test with noise_std = 0.1, 0.3, and 0.6.

<details>
<summary>Hint</summary>
Higher noise makes it harder to distinguish signal from noise. You might expect simpler models to be optimal when noise is high.
</details>

In [None]:
# Exercise 1: Your code here
# Test different noise levels and find optimal degree for each

noise_levels = [0.1, 0.3, 0.6]

# For each noise level, compute bias-variance decomposition
# and find the optimal polynomial degree

# Your code here...

### Exercise 2: More Training Data

Does having more training data allow us to use more complex models? Test n_samples = 50, 200, and 500.

<details>
<summary>Hint</summary>
More data reduces variance, so we can afford more complexity. This is why big data enables deep learning!
</details>

In [None]:
# Exercise 2: Your code here

# Modify bootstrap_bias_variance to accept n_samples parameter
# and test how optimal degree changes with dataset size

# Your code here...

---

## Common Mistakes

### Mistake 1: Using Training Error to Select Models

```python
# WRONG:
# "Degree 15 has the lowest training error, so it's the best model!"

# RIGHT:
# Use validation/test error or cross-validation to select models.
# Training error ALWAYS decreases with complexity!
```

### Mistake 2: Confusing Bias with Dataset Bias

```python
# WRONG:
# "My model has bias because my training data is biased"

# RIGHT:
# Statistical bias (underfitting) ≠ Data bias (unfair representation)
# Same word, completely different concepts!
```

### Mistake 3: Thinking More Data Always Helps

```python
# WRONG:
# "I'll just collect more data to fix my underfitting problem"

# RIGHT:
# More data reduces VARIANCE, not BIAS.
# If your model is too simple (high bias), more data won't help much.
# You need a more expressive model.
```

---

## Checkpoint

You've learned:
- **The decomposition**: Total Error = Bias² + Variance + Irreducible Noise
- **Bias**: Systematic error from model assumptions (underfitting)
- **Variance**: Sensitivity to training data (overfitting)
- **The tradeoff**: Reducing one typically increases the other
- **Bootstrap estimation**: Empirically measuring bias and variance
- **Regularization**: A knob to trade variance for bias
- **Practical diagnosis**: Train-test gap reveals variance

---

## Challenge (Optional)

### Ensemble Methods and Variance Reduction

Prove that averaging predictions from multiple models (bagging) reduces variance but not bias. Implement a simple bagging scheme and show this empirically.

```python
def bagging_predictor(X_train, y_train, X_test, n_models=10, degree=10):
    """
    Train multiple models on bootstrap samples and average predictions.
    """
    # Your implementation here
    pass
```

In [None]:
# Challenge: Your implementation here

---

## Further Reading

- [The Elements of Statistical Learning](https://hastie.su.domains/ElemStatLearn/) - Chapter 7 (Bias-Variance)
- [An Introduction to Statistical Learning](https://www.statlearning.com/) - Chapter 2.2
- [Understanding the Bias-Variance Tradeoff](http://scott.fortmann-roe.com/docs/BiasVariance.html) - Excellent visual explanation

---

## Cleanup

In [None]:
# Clear any large variables
import gc

# Close all matplotlib figures
plt.close('all')

# Garbage collection
gc.collect()

print("Cleanup complete!")
print("\nNext up: Lab A.3 - PAC Learning Bounds")