# Lab A.3: PAC Learning Bounds

**Module:** A - Statistical Learning Theory  
**Time:** 1.5 hours  
**Difficulty:** ⭐⭐⭐⭐ (Advanced)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the PAC (Probably Approximately Correct) learning framework
- [ ] Calculate sample complexity bounds for learning tasks
- [ ] Compare theoretical bounds to empirical performance
- [ ] Apply PAC learning to estimate required dataset sizes
- [ ] Understand why theoretical bounds are often loose

---

## Prerequisites

- Completed: Lab A.1 (VC Dimension)
- Completed: Lab A.2 (Bias-Variance)
- Knowledge of: Probability basics, hypothesis classes

---

## Real-World Context

You're leading an ML project and your manager asks: *"How much training data do we need to deploy this model with 99% confidence that it's at least 95% accurate?"*

This isn't just a theoretical question - it has real business implications:
- **Data collection costs**: Labeling 1 million examples might cost $500K
- **Time to deployment**: More data = longer project timeline
- **Risk management**: Under-trained models fail in production

**PAC learning theory** provides mathematical guarantees for answering these questions. While the bounds are often conservative, they give you worst-case guarantees that regulators and risk managers love.

---

## ELI5: What is PAC Learning?

> **Imagine you're learning to be a doctor...** 
>
> You study 100 patient cases (training data). After studying, you need to diagnose NEW patients you've never seen.
>
> PAC learning asks: *How many cases do you need to study to be **probably** (with 99% chance) **approximately correct** (95% accurate) on new patients?*
>
> - **Probably** (the P): We can't guarantee 100% success, but we can guarantee 99% chance of success
> - **Approximately** (the A): We might make a few mistakes (up to 5%), but we'll be mostly right
> - **Correct** (the C): On new patients we've never seen!
>
> **The magic:** PAC theory tells you EXACTLY how many cases you need. Study fewer, and you might fail. Study more, and you're wasting time.
>
> **In AI terms:**
> - ε (epsilon) = Maximum acceptable error rate (e.g., 0.05 = 5% errors allowed)
> - δ (delta) = Maximum acceptable failure probability (e.g., 0.01 = 1% chance we fail to learn)
> - m = Number of training examples needed = f(ε, δ, hypothesis complexity)

---

## Part 1: Setting Up Our Environment

In [None]:
# Core imports
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, List, Callable, Optional
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Set nice plotting defaults
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

# Seed for reproducibility
np.random.seed(42)

print("Environment ready for PAC Learning exploration!")
print(f"NumPy version: {np.__version__}")

---

## Part 2: The PAC Learning Definition

### Formal Definition

A hypothesis class $\mathcal{H}$ is **PAC learnable** if there exists:
- An algorithm $A$
- A sample complexity function $m(\epsilon, \delta)$

Such that: For any distribution $D$ over inputs, and any $\epsilon, \delta > 0$:

When given $m \geq m(\epsilon, \delta)$ samples drawn i.i.d. from $D$, with probability at least $1 - \delta$, the algorithm outputs a hypothesis $h$ with:

$$\text{Error}(h) \leq \min_{h^* \in \mathcal{H}} \text{Error}(h^*) + \epsilon$$

In other words: with high probability, we're within ε of the best possible hypothesis in our class.

In [None]:
def pac_sample_complexity_basic(epsilon: float, delta: float) -> int:
    """
    Basic sample complexity bound for finite hypothesis classes.
    
    For a hypothesis class with |H| hypotheses:
    m >= (1/ε) * (ln|H| + ln(1/δ))
    
    This is the simplest PAC bound - useful for finite hypothesis spaces.
    
    Args:
        epsilon: Accuracy parameter (maximum allowed error)
        delta: Confidence parameter (probability of failure)
        
    Returns:
        Sample complexity (minimum samples needed)
    """
    # For now, assume |H| = 100 (will parameterize later)
    H_size = 100
    m = (1 / epsilon) * (np.log(H_size) + np.log(1 / delta))
    return int(np.ceil(m))


# Example calculation
print("Sample Complexity for Finite Hypothesis Class (|H|=100):")
print("=" * 55)

for eps in [0.1, 0.05, 0.01]:
    for delt in [0.1, 0.05, 0.01]:
        m = pac_sample_complexity_basic(eps, delt)
        print(f"ε={eps:.2f}, δ={delt:.2f}: Need {m:>6,} samples")

---

## Part 3: VC Dimension-Based Sample Complexity

For infinite hypothesis classes (like linear classifiers), we use the **VC dimension** from Lab A.1:

$$m(\epsilon, \delta) \geq \frac{C}{\epsilon}\left(d \cdot \log\frac{1}{\epsilon} + \log\frac{1}{\delta}\right)$$

where:
- $d$ = VC dimension of the hypothesis class
- $C$ = constant (typically 8-16 depending on the bound used)

In [None]:
def pac_sample_complexity_vc(vc_dim: int, epsilon: float, delta: float, C: float = 8.0) -> int:
    """
    Sample complexity bound based on VC dimension.
    
    This is the fundamental theorem of PAC learning for infinite hypothesis classes.
    
    Args:
        vc_dim: VC dimension of the hypothesis class
        epsilon: Accuracy parameter (maximum allowed error)
        delta: Confidence parameter (probability of failure)
        C: Constant in the bound (typically 8-16)
        
    Returns:
        Sample complexity (minimum samples needed)
    """
    # Using the bound from "Understanding Machine Learning" (Shalev-Shwartz & Ben-David)
    m = (C / epsilon) * (vc_dim * np.log(16 / epsilon) + np.log(2 / delta))
    return int(np.ceil(m))


def vc_dimension_linear(d: int) -> int:
    """VC dimension of linear classifiers in d dimensions = d + 1"""
    return d + 1


# Calculate sample complexity for various settings
print("Sample Complexity for Linear Classifiers (VC = d + 1):")
print("=" * 65)
print(f"{'Dimension':>10} | {'VC Dim':>8} | {'ε=0.05':>12} | {'ε=0.01':>12}")
print("-" * 65)

for d in [10, 50, 100, 500, 1000]:
    vc = vc_dimension_linear(d)
    m_5 = pac_sample_complexity_vc(vc, epsilon=0.05, delta=0.05)
    m_1 = pac_sample_complexity_vc(vc, epsilon=0.01, delta=0.05)
    print(f"{d:>10} | {vc:>8} | {m_5:>12,} | {m_1:>12,}")

print("\n(δ = 0.05 for 95% confidence)")

In [None]:
# Visualize how sample complexity scales
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Sample complexity vs VC dimension
vc_dims = np.arange(1, 101)
m_5 = [pac_sample_complexity_vc(vc, 0.05, 0.05) for vc in vc_dims]
m_1 = [pac_sample_complexity_vc(vc, 0.01, 0.05) for vc in vc_dims]

axes[0].plot(vc_dims, m_5, 'b-', linewidth=2, label='ε = 0.05')
axes[0].plot(vc_dims, m_1, 'r-', linewidth=2, label='ε = 0.01')
axes[0].set_xlabel('VC Dimension', fontsize=12)
axes[0].set_ylabel('Samples Needed', fontsize=12)
axes[0].set_title('Sample Complexity vs VC Dimension\n(δ = 0.05)', fontsize=12, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Plot 2: Sample complexity vs epsilon
epsilons = np.linspace(0.01, 0.2, 50)
vc = 50  # Fixed VC dimension
m_values = [pac_sample_complexity_vc(vc, eps, 0.05) for eps in epsilons]

axes[1].plot(epsilons, m_values, 'g-', linewidth=2)
axes[1].set_xlabel('Epsilon (ε) - Allowed Error', fontsize=12)
axes[1].set_ylabel('Samples Needed', fontsize=12)
axes[1].set_title(f'Sample Complexity vs ε\n(VC = {vc}, δ = 0.05)', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key insight: Sample complexity grows linearly with VC dimension")
print("but scales as 1/ε (inversely with allowed error).")
print("Demanding 1% error instead of 5% requires ~5x more data!")

---

## Part 4: Interactive Sample Complexity Calculator

Let's build a practical tool for estimating data requirements.

In [None]:
class SampleComplexityCalculator:
    """
    A practical tool for estimating training data requirements.
    
    This class provides multiple bounds and helps compare
    theoretical requirements with practical rules of thumb.
    """
    
    def __init__(self, model_type: str, n_features: int):
        """
        Initialize calculator.
        
        Args:
            model_type: 'linear', 'polynomial', 'neural_network'
            n_features: Number of input features
        """
        self.model_type = model_type
        self.n_features = n_features
        self.vc_dim = self._estimate_vc_dimension()
    
    def _estimate_vc_dimension(self) -> int:
        """Estimate VC dimension based on model type."""
        if self.model_type == 'linear':
            return self.n_features + 1
        elif self.model_type == 'polynomial':
            # Degree 2 polynomial
            from math import comb
            return comb(self.n_features + 2, 2)
        elif self.model_type == 'neural_network':
            # Rough estimate: ~O(W*L*log(W)) where W = params
            # Assume small network: 2 hidden layers, 100 neurons each
            n_params = (self.n_features * 100) + (100 * 100) + (100 * 1)
            return n_params * 2  # Very rough upper bound
        else:
            return self.n_features + 1
    
    def pac_bound(self, epsilon: float, delta: float) -> int:
        """PAC learning bound based on VC dimension."""
        return pac_sample_complexity_vc(self.vc_dim, epsilon, delta)
    
    def practical_rule(self) -> int:
        """Practical rule of thumb: 10-30x the number of parameters."""
        if self.model_type == 'linear':
            return 10 * (self.n_features + 1)
        elif self.model_type == 'neural_network':
            n_params = (self.n_features * 100) + (100 * 100) + (100 * 1)
            return 10 * n_params
        else:
            return 10 * self.vc_dim
    
    def summarize(self, epsilon: float = 0.05, delta: float = 0.05):
        """Print a summary of sample complexity estimates."""
        pac = self.pac_bound(epsilon, delta)
        practical = self.practical_rule()
        
        print(f"\nSample Complexity Analysis")
        print("=" * 50)
        print(f"Model Type: {self.model_type}")
        print(f"Input Features: {self.n_features}")
        print(f"Estimated VC Dimension: {self.vc_dim:,}")
        print(f"\nFor ε={epsilon} (max {epsilon*100:.1f}% error), δ={delta} ({(1-delta)*100:.0f}% confidence):")
        print(f"  PAC Theoretical Bound: {pac:>15,} samples")
        print(f"  Practical Rule of Thumb: {practical:>12,} samples")
        print(f"\nRecommendation: Start with {practical:,} samples,")
        print(f"                scale up if validation error > {epsilon*100:.1f}%")


# Example usage
print("Example 1: Spam classifier with 500 word features")
calc = SampleComplexityCalculator('linear', 500)
calc.summarize(epsilon=0.05, delta=0.05)

print("\n" + "="*60 + "\n")

print("Example 2: Image classifier (CNN-like) with 784 pixels")
calc2 = SampleComplexityCalculator('neural_network', 784)
calc2.summarize(epsilon=0.05, delta=0.05)

---

## Part 5: Empirical Verification

Let's verify the PAC bounds empirically by training models with different amounts of data.

In [None]:
def generate_classification_data(n_samples: int, n_features: int = 20, 
                                 noise: float = 0.1, seed: int = None) -> Tuple:
    """
    Generate synthetic classification data.
    
    Args:
        n_samples: Number of samples
        n_features: Number of features
        noise: Amount of label noise
        seed: Random seed
        
    Returns:
        X, y, true_weights (for the linear decision boundary)
    """
    if seed is not None:
        np.random.seed(seed)
    
    # Generate features
    X = np.random.randn(n_samples, n_features)
    
    # True linear decision boundary
    true_weights = np.random.randn(n_features)
    true_weights = true_weights / np.linalg.norm(true_weights)
    
    # Labels (with noise)
    logits = X @ true_weights
    probs = 1 / (1 + np.exp(-3 * logits))  # Scale for sharper boundary
    
    # Add noise
    y = (np.random.random(n_samples) < probs).astype(int)
    flip_mask = np.random.random(n_samples) < noise
    y[flip_mask] = 1 - y[flip_mask]
    
    return X, y, true_weights


def run_learning_experiment(n_train: int, n_test: int = 5000, 
                           n_features: int = 20, n_trials: int = 50) -> Tuple:
    """
    Run learning experiment with given training set size.
    
    Returns:
        (mean_test_error, std_test_error, mean_train_error)
    """
    test_errors = []
    train_errors = []
    
    for trial in range(n_trials):
        # Generate data
        X_train, y_train, _ = generate_classification_data(n_train, n_features, 
                                                           noise=0.1, seed=trial)
        X_test, y_test, _ = generate_classification_data(n_test, n_features, 
                                                         noise=0.1, seed=1000+trial)
        
        # Train linear classifier
        clf = LogisticRegression(max_iter=1000, random_state=trial)
        clf.fit(X_train, y_train)
        
        # Compute errors
        train_error = 1 - clf.score(X_train, y_train)
        test_error = 1 - clf.score(X_test, y_test)
        
        train_errors.append(train_error)
        test_errors.append(test_error)
    
    return np.mean(test_errors), np.std(test_errors), np.mean(train_errors)


print("Experiment functions defined!")

In [None]:
# Run experiments with various training set sizes
n_features = 20
vc_dim = n_features + 1  # Linear classifier

# PAC bound says we need this many samples for ε=0.05, δ=0.05
pac_bound = pac_sample_complexity_vc(vc_dim, 0.05, 0.05)
print(f"PAC theoretical bound for ε=0.05: {pac_bound:,} samples")
print(f"Running experiments...\n")

# Test various training set sizes
train_sizes = [20, 50, 100, 200, 500, 1000, 2000, 5000]
results = []

for n_train in train_sizes:
    mean_err, std_err, train_err = run_learning_experiment(n_train, n_features=n_features, n_trials=30)
    results.append((n_train, mean_err, std_err, train_err))
    print(f"n_train={n_train:5d}: Test Error = {mean_err:.4f} ± {std_err:.4f}, Train Error = {train_err:.4f}")

In [None]:
# Visualize results vs PAC bound
train_sizes = [r[0] for r in results]
mean_errors = [r[1] for r in results]
std_errors = [r[2] for r in results]
train_errors = [r[3] for r in results]

plt.figure(figsize=(12, 6))

# Plot test error with error bars
plt.errorbar(train_sizes, mean_errors, yerr=std_errors, fmt='bo-', 
            linewidth=2, markersize=8, capsize=5, label='Test Error (empirical)')
plt.plot(train_sizes, train_errors, 'g^-', linewidth=2, markersize=8, label='Train Error')

# Mark target epsilon
plt.axhline(y=0.05, color='red', linestyle='--', linewidth=2, label='Target ε = 0.05')

# Mark PAC bound
plt.axvline(x=pac_bound, color='purple', linestyle=':', linewidth=2, label=f'PAC bound = {pac_bound}')

# Mark where we empirically achieve the target
for n, err in zip(train_sizes, mean_errors):
    if err <= 0.05:
        plt.axvline(x=n, color='green', linestyle='-.', linewidth=1, alpha=0.5)
        plt.annotate(f'Achieved at {n}', xy=(n, 0.06), fontsize=10, color='green')
        break

plt.xscale('log')
plt.xlabel('Training Set Size', fontsize=12)
plt.ylabel('Error Rate', fontsize=12)
plt.title(f'Empirical Learning Curve vs PAC Bound\n(d={n_features}, VC={vc_dim})', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=11, loc='upper right')
plt.grid(True, alpha=0.3)
plt.ylim(0, 0.3)
plt.tight_layout()
plt.show()

print("\nObservation: We achieve ε=0.05 with far fewer samples than PAC bound!")
print("PAC bounds are worst-case guarantees - real data is often easier.")

---

## Part 6: Why Are PAC Bounds Loose?

PAC bounds are often pessimistic by 1-2 orders of magnitude. Here's why:

1. **Worst-case over all distributions**: PAC bounds work for ANY data distribution. Real data has structure!
2. **Uniform convergence**: The bounds require uniform convergence over all hypotheses
3. **No distribution assumptions**: Making assumptions (e.g., margin, smoothness) gives tighter bounds
4. **Constant factors**: The constants in theoretical bounds aren't optimized

In [None]:
def compare_bounds_vs_empirical(n_features_list: List[int]) -> dict:
    """
    Compare PAC bounds to empirical sample requirements.
    
    For each feature dimension, find the minimum samples
    needed empirically to achieve 5% error.
    """
    results = {}
    
    for n_features in n_features_list:
        print(f"\nTesting d = {n_features}...")
        vc_dim = n_features + 1
        pac_bound = pac_sample_complexity_vc(vc_dim, 0.05, 0.05)
        
        # Binary search for empirical requirement
        empirical_min = None
        for n_train in [50, 100, 200, 500, 1000, 2000]:
            mean_err, _, _ = run_learning_experiment(n_train, n_features=n_features, n_trials=20)
            if mean_err <= 0.05 and empirical_min is None:
                empirical_min = n_train
                break
        
        if empirical_min is None:
            empirical_min = float('inf')  # Didn't achieve target
        
        results[n_features] = {
            'vc_dim': vc_dim,
            'pac_bound': pac_bound,
            'empirical': empirical_min,
            'ratio': pac_bound / empirical_min if empirical_min < float('inf') else float('inf')
        }
    
    return results


# Run comparison
print("Comparing PAC bounds to empirical requirements:")
comparison = compare_bounds_vs_empirical([10, 20, 50])

print("\n" + "=" * 70)
print(f"{'Features':>10} | {'VC Dim':>8} | {'PAC Bound':>12} | {'Empirical':>10} | {'Ratio':>8}")
print("-" * 70)

for d, res in comparison.items():
    emp_str = f"{res['empirical']:,}" if res['empirical'] < float('inf') else ">2000"
    ratio_str = f"{res['ratio']:.0f}x" if res['ratio'] < float('inf') else "N/A"
    print(f"{d:>10} | {res['vc_dim']:>8} | {res['pac_bound']:>12,} | {emp_str:>10} | {ratio_str:>8}")

print("\nPAC bounds are 50-100x more conservative than needed in practice!")

---

## Part 7: Practical Guidelines

Given that PAC bounds are loose, here are practical rules for data requirements:

In [None]:
def practical_sample_estimate(n_parameters: int, 
                             task_difficulty: str = 'medium',
                             data_quality: str = 'clean') -> Tuple[int, int]:
    """
    Practical sample size estimation based on industry experience.
    
    Args:
        n_parameters: Number of model parameters
        task_difficulty: 'easy', 'medium', 'hard'
        data_quality: 'clean', 'noisy', 'very_noisy'
        
    Returns:
        (minimum_samples, recommended_samples)
    """
    # Base multiplier
    base = 10  # 10x parameters is common rule of thumb
    
    # Adjust for difficulty
    difficulty_mult = {'easy': 1.0, 'medium': 2.0, 'hard': 5.0}
    
    # Adjust for noise
    noise_mult = {'clean': 1.0, 'noisy': 2.0, 'very_noisy': 5.0}
    
    min_samples = int(n_parameters * base * difficulty_mult[task_difficulty])
    rec_samples = int(min_samples * noise_mult[data_quality])
    
    return min_samples, rec_samples


# Example calculations
print("Practical Sample Size Guidelines")
print("=" * 60)

scenarios = [
    ("Spam filter (linear, 1000 features)", 1000, 'easy', 'clean'),
    ("Sentiment analysis (BERT fine-tune, 110M params)", 110_000_000, 'medium', 'noisy'),
    ("Medical diagnosis (complex, 500 features)", 500, 'hard', 'clean'),
    ("Self-driving (huge, 50M params)", 50_000_000, 'hard', 'noisy'),
]

for name, params, diff, quality in scenarios:
    min_s, rec_s = practical_sample_estimate(params, diff, quality)
    print(f"\n{name}:")
    print(f"  Parameters: {params:,}")
    print(f"  Minimum samples: {min_s:,}")
    print(f"  Recommended: {rec_s:,}")

---

## Part 8: The Deep Learning Paradox

Modern neural networks have billions of parameters but generalize well with "only" millions of examples. This seems to violate PAC bounds! What's going on?

### Why Deep Learning "Breaks" Classical Theory

1. **Implicit Regularization**: SGD finds "flat" minima that generalize better
2. **Over-parameterization**: More parameters can actually help (double descent)
3. **Data Structure**: Real images/text have rich structure not captured by worst-case bounds
4. **Pre-training**: Transfer learning leverages massive pre-training data

In [None]:
# The Double Descent Phenomenon
print("The Double Descent Curve")
print("=" * 60)

# Create illustrative plot
fig, ax = plt.subplots(figsize=(12, 6))

# Model complexity (number of parameters)
complexity = np.linspace(0.1, 3, 1000)

# Classical U-curve
def classical_curve(x):
    return 0.5 * (x - 1)**2 + 0.1

# Double descent (with interpolation threshold at x=1)
def double_descent(x):
    # First descent
    if x < 0.9:
        return 0.5 * (x - 1)**2 + 0.1
    # Peak at interpolation threshold
    elif x < 1.1:
        return 0.5 + 0.3 * np.sin((x - 0.9) * np.pi / 0.2)
    # Second descent
    else:
        return 0.3 * np.exp(-(x - 1.1)) + 0.05

y_classical = [classical_curve(x) for x in complexity]
y_double = [double_descent(x) for x in complexity]

ax.plot(complexity, y_classical, 'b--', linewidth=2, label='Classical theory (bias-variance)')
ax.plot(complexity, y_double, 'r-', linewidth=3, label='Modern deep learning')

# Mark regions
ax.axvline(x=1.0, color='gray', linestyle=':', linewidth=1)
ax.annotate('Interpolation\nThreshold', xy=(1.0, 0.6), fontsize=10, ha='center')

ax.fill_between([0.1, 0.7], 0, 1, alpha=0.1, color='blue', label='Underfitting')
ax.fill_between([0.7, 1.3], 0, 1, alpha=0.1, color='orange', label='Classical overfitting')
ax.fill_between([1.3, 3], 0, 1, alpha=0.1, color='green', label='Over-parameterized (good!)')

ax.set_xlabel('Model Complexity (Parameters / Data)', fontsize=12)
ax.set_ylabel('Test Error', fontsize=12)
ax.set_title('Double Descent: Why Over-Parameterization Works\n(Counter-intuitive but true!)', 
            fontsize=14, fontweight='bold')
ax.legend(loc='upper right', fontsize=10)
ax.set_ylim(0, 0.8)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey insight: Modern neural networks operate in the 'over-parameterized' regime")
print("where MORE parameters can lead to LOWER test error - the opposite of classical wisdom!")
print("\nThis is an active research area. Classical PAC bounds don't capture this phenomenon.")

---

## Try It Yourself

### Exercise 1: Sample Complexity for Your Task

You're building a model to predict customer churn with 50 features. Calculate:
1. The PAC bound for 95% accuracy with 99% confidence
2. A practical estimate using rule of thumb

<details>
<summary>Hint</summary>
For linear classifier: VC = 51. Then use pac_sample_complexity_vc with ε=0.05, δ=0.01.
</details>

In [None]:
# Exercise 1: Your code here
n_features = 50
epsilon = 0.05  # 95% accuracy = 5% error
delta = 0.01    # 99% confidence

# Calculate VC dimension
vc_dim = None  # Your answer

# Calculate PAC bound
pac_bound = None  # Your answer

# Practical estimate (10x parameters)
practical = None  # Your answer

# Print results
# print(f"VC dimension: {vc_dim}")
# print(f"PAC bound: {pac_bound:,}")
# print(f"Practical estimate: {practical:,}")

### Exercise 2: Empirical Verification on MNIST

Load a subset of MNIST and verify how test accuracy improves with training set size. Compare to PAC predictions.

<details>
<summary>Hint</summary>
MNIST has 784 features (28x28 pixels). Use sklearn.datasets.fetch_openml to load it.
</details>

In [None]:
# Exercise 2: Your code here
# from sklearn.datasets import fetch_openml

# Load MNIST
# mnist = fetch_openml('mnist_784', version=1, as_frame=False)
# X, y = mnist.data, mnist.target.astype(int)

# Test with various training sizes and plot learning curve
# Compare to PAC bound for d=784

---

## Common Mistakes

### Mistake 1: Taking PAC Bounds Literally

```python
# WRONG:
# "PAC says I need 100,000 samples, so I must wait to collect them all"

# RIGHT:
# PAC bounds are worst-case. Start training with what you have,
# use validation curves to know if you need more data.
```

### Mistake 2: Ignoring Distribution Shift

```python
# WRONG:
# "I achieved 99% accuracy on test set, so PAC guarantees I'll do well in production"

# RIGHT:
# PAC assumes test data comes from same distribution as training.
# Real production data often drifts! Monitor continuously.
```

### Mistake 3: Using the Wrong VC Dimension

```python
# WRONG:
# For neural net with 1M parameters: VC = 1M

# RIGHT:
# Neural net VC bounds are complex and often loose.
# Effective VC is reduced by regularization, architecture, and training dynamics.
# Use practical guidelines instead.
```

---

## Checkpoint

You've learned:
- **PAC Framework**: Probably Approximately Correct learning with (ε, δ) guarantees
- **Sample Complexity**: How many samples needed for given accuracy/confidence
- **VC-based Bounds**: m ∝ (VC/ε) × log(1/ε) + log(1/δ)
- **Bounds are Loose**: Typically 10-100x more conservative than practice
- **Deep Learning Paradox**: Over-parameterization helps (double descent)
- **Practical Guidelines**: 10-30x parameters as rule of thumb

---

## Challenge (Optional)

### PAC-Bayes Bounds

PAC-Bayes provides tighter bounds for neural networks by incorporating prior beliefs. Research and implement a simple PAC-Bayes bound calculation.

The PAC-Bayes bound is:
$$L(Q) \leq \hat{L}(Q) + \sqrt{\frac{KL(Q||P) + \log(2n/\delta)}{2n}}$$

where Q is the posterior, P is the prior, and n is sample size.

In [None]:
# Challenge: Implement PAC-Bayes bound here

---

## Further Reading

- [Understanding Machine Learning: From Theory to Algorithms](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/) - Chapters 2-6
- [Foundations of Machine Learning (Mohri et al.)](https://cs.nyu.edu/~mohri/mlbook/) - Rigorous treatment
- [Deep Double Descent (Nakkiran et al., 2019)](https://arxiv.org/abs/1912.02292) - Modern phenomena
- [PAC-Bayes Tutorial](https://www.cs.cmu.edu/~dpic/ml2013/pac-bayes-tut.pdf) - Tighter bounds

---

## Cleanup

In [None]:
# Clear any large variables
import gc

# Close all matplotlib figures
plt.close('all')

# Garbage collection
gc.collect()

print("Cleanup complete!")
print("\nCongratulations! You've completed Module A: Statistical Learning Theory!")
print("\nYou now understand:")
print("  - VC Dimension (Lab A.1)")
print("  - Bias-Variance Tradeoff (Lab A.2)")
print("  - PAC Learning Bounds (Lab A.3)")
print("\nThese theoretical foundations will help you make better")
print("architectural decisions and debug model issues systematically.")