# Lab A.1: VC Dimension Exploration

**Module:** A - Statistical Learning Theory  
**Time:** 1.5 hours  
**Difficulty:** ⭐⭐⭐⭐ (Advanced)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand what it means to "shatter" a dataset
- [ ] Compute VC dimension for linear classifiers
- [ ] Visualize why 4 points cannot be shattered by lines in 2D
- [ ] Explore VC dimension for polynomial classifiers
- [ ] Connect VC dimension to generalization guarantees

---

## Prerequisites

- Completed: Module 1.4 (Math Foundations)
- Completed: Module 1.5 (Neural Networks basics)
- Knowledge of: Linear algebra, basic classification concepts

---

## Real-World Context

Imagine you're building a spam classifier. You have 1,000 training emails. How confident can you be that your model will work on the millions of emails it will see in production? This isn't just a philosophical question - it has a mathematical answer.

**VC dimension** is one of the most important theoretical tools for answering: *"Will my model generalize?"* Companies like Google and Meta use these insights to decide how much training data they need before deploying models to billions of users.

---

## ELI5: What is VC Dimension?

> **Imagine you're playing a game with crayons...** 
>
> You put some dots on paper. Your friend colors each dot either red or blue, in any pattern they want. Your job is to draw ONE LINE that separates all red dots from all blue dots.
>
> With **2 dots**, you can always win! No matter how your friend colors them, you can draw a line between them.
>
> With **3 dots** (in a triangle), you can still always win! Even if one dot is a different color than the other two, you can find a line that works.
>
> But with **4 dots** in a square? Your friend can beat you! If they color opposite corners the same (like a checkerboard), NO straight line can separate them.
>
> **The VC dimension is the BIGGEST number of dots where you can ALWAYS win, no matter how your friend colors them.**
>
> For straight lines in 2D: VC dimension = 3
>
> **In AI terms:** A model with higher VC dimension can fit more complex patterns, but needs more training data to generalize reliably. It's the model's "expressiveness" measured mathematically!

---

## Part 1: Setting Up Our Environment

In [None]:
# Core imports
import numpy as np
import matplotlib.pyplot as plt
from itertools import product
from typing import List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# For linear separability checking
from sklearn.svm import SVC
from sklearn.linear_model import Perceptron

# Set nice plotting defaults
plt.style.use('default')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Seed for reproducibility
np.random.seed(42)

print("Environment ready for VC dimension exploration!")
print(f"NumPy version: {np.__version__}")

---

## Part 2: Understanding Shattering

### What Does "Shattering" Mean?

A hypothesis class $\mathcal{H}$ **shatters** a set of points if it can perfectly classify those points under **every possible labeling**.

For $n$ points, there are $2^n$ possible binary labelings. If we can find a hypothesis in $\mathcal{H}$ that correctly classifies each of those $2^n$ labelings, we say $\mathcal{H}$ shatters those points.

In [None]:
def can_linearly_separate(points: np.ndarray, labels: np.ndarray) -> bool:
    """
    Check if a set of 2D points with given labels can be linearly separated.
    
    We use a hard-margin SVM (very high C) to check if perfect separation exists.
    
    Args:
        points: Array of shape (n, 2) with point coordinates
        labels: Array of shape (n,) with binary labels (0 or 1)
        
    Returns:
        True if linearly separable, False otherwise
    """
    # Edge case: all same label is trivially separable
    if len(np.unique(labels)) == 1:
        return True
    
    # Use hard-margin SVM (very high C = no slack allowed)
    clf = SVC(kernel='linear', C=1e10, max_iter=10000)
    
    try:
        clf.fit(points, labels)
        predictions = clf.predict(points)
        return np.all(predictions == labels)
    except Exception:
        return False


def check_if_shattered(points: np.ndarray) -> Tuple[bool, List[Tuple]]:
    """
    Check if a linear classifier can shatter the given points.
    
    Args:
        points: Array of shape (n, 2) with point coordinates
        
    Returns:
        (is_shattered, list of failed labelings)
    """
    n = len(points)
    all_labelings = list(product([0, 1], repeat=n))
    
    failed_labelings = []
    
    for labeling in all_labelings:
        labels = np.array(labeling)
        if not can_linearly_separate(points, labels):
            failed_labelings.append(labeling)
    
    is_shattered = len(failed_labelings) == 0
    return is_shattered, failed_labelings


print("Shattering functions defined!")

### Let's Shatter 2 Points

With 2 points, we have $2^2 = 4$ possible labelings. Can a line handle all of them?

In [None]:
# Two points
points_2 = np.array([[0, 0], [2, 2]])

# Check all 4 labelings
is_shattered, failed = check_if_shattered(points_2)

print(f"Can 2 points be shattered by a line? {is_shattered}")
print(f"Number of failed labelings: {len(failed)}")

# Visualize all 4 labelings
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
labelings = list(product([0, 1], repeat=2))

for ax, labeling in zip(axes, labelings):
    colors = ['blue' if l == 0 else 'red' for l in labeling]
    ax.scatter(points_2[:, 0], points_2[:, 1], c=colors, s=200, edgecolors='black', linewidths=2)
    ax.set_title(f"Labels: {labeling}", fontsize=12)
    ax.set_xlim(-1, 3)
    ax.set_ylim(-1, 3)
    ax.grid(True, alpha=0.3)
    ax.set_aspect('equal')

plt.suptitle("All 4 Labelings of 2 Points - All Linearly Separable!", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### What Just Happened?

For 2 points, we can always draw a line that separates them, regardless of their colors! This means 2 points **can be shattered** by linear classifiers.

---

## Part 3: Shattering 3 Points (The Magic Number in 2D)

Now let's try 3 points. We have $2^3 = 8$ possible labelings.

In [None]:
# Three points in "general position" (not collinear)
points_3 = np.array([
    [0, 0],
    [2, 0],
    [1, 1.5]
])

# Check shattering
is_shattered, failed = check_if_shattered(points_3)

print(f"Can 3 points be shattered by a line? {is_shattered}")
print(f"Number of failed labelings: {len(failed)}")

In [None]:
def visualize_all_labelings_with_separator(points: np.ndarray, title: str = ""):
    """
    Visualize all possible labelings and show separating lines where possible.
    """
    n = len(points)
    labelings = list(product([0, 1], repeat=n))
    n_labelings = len(labelings)
    
    # Determine grid size
    cols = 4
    rows = (n_labelings + cols - 1) // cols
    
    fig, axes = plt.subplots(rows, cols, figsize=(4*cols, 4*rows))
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]
    
    for idx, (ax, labeling) in enumerate(zip(axes, labelings)):
        labels = np.array(labeling)
        colors = ['blue' if l == 0 else 'red' for l in labels]
        
        # Plot points
        ax.scatter(points[:, 0], points[:, 1], c=colors, s=200, edgecolors='black', linewidths=2, zorder=5)
        
        # Try to find and plot separator
        separable = can_linearly_separate(points, labels)
        
        if separable and len(np.unique(labels)) > 1:
            # Fit SVM to get the decision boundary
            clf = SVC(kernel='linear', C=1e10)
            clf.fit(points, labels)
            
            # Create mesh for decision boundary
            x_min, x_max = points[:, 0].min() - 1, points[:, 0].max() + 1
            y_min, y_max = points[:, 1].min() - 1, points[:, 1].max() + 1
            
            xx = np.linspace(x_min, x_max, 100)
            
            # Decision boundary: w0*x + w1*y + b = 0  =>  y = -(w0*x + b)/w1
            w = clf.coef_[0]
            b = clf.intercept_[0]
            
            if abs(w[1]) > 1e-10:  # Not vertical
                yy = -(w[0] * xx + b) / w[1]
                mask = (yy >= y_min) & (yy <= y_max)
                ax.plot(xx[mask], yy[mask], 'g-', linewidth=2, label='Separator')
            else:  # Vertical line
                x_line = -b / w[0]
                ax.axvline(x=x_line, color='g', linewidth=2)
            
            ax.set_title(f"{labeling}\nSeparable", fontsize=10, color='green')
        else:
            status = "Trivial" if len(np.unique(labels)) == 1 else "NOT Separable"
            title_color = 'gray' if len(np.unique(labels)) == 1 else 'red'
            ax.set_title(f"{labeling}\n{status}", fontsize=10, color=title_color)
        
        # Set axis limits
        margin = 0.5
        ax.set_xlim(points[:, 0].min() - margin, points[:, 0].max() + margin)
        ax.set_ylim(points[:, 1].min() - margin, points[:, 1].max() + margin)
        ax.grid(True, alpha=0.3)
        ax.set_aspect('equal')
    
    # Hide unused subplots
    for ax in axes[n_labelings:]:
        ax.set_visible(False)
    
    if title:
        plt.suptitle(title, fontsize=14, fontweight='bold', y=1.02)
    
    plt.tight_layout()
    plt.show()


# Visualize all 8 labelings of 3 points
visualize_all_labelings_with_separator(points_3, "All 8 Labelings of 3 Points - All Separable!")

### The Key Insight

3 points in **general position** (not all on a line) CAN be shattered by linear classifiers in 2D. This means:

$$\text{VC}(\text{linear classifiers in 2D}) \geq 3$$

But can we shatter 4 points?

---

## Part 4: The XOR Problem - Why 4 Points Can't Be Shattered

This is the famous result that proves $\text{VC}(\text{linear classifiers in 2D}) = 3$.

In [None]:
# Four points in a square
points_4 = np.array([
    [0, 0],
    [1, 0],
    [0, 1],
    [1, 1]
])

# Check shattering
is_shattered, failed = check_if_shattered(points_4)

print(f"Can 4 points be shattered by a line? {is_shattered}")
print(f"Number of failed labelings: {len(failed)}")
print(f"\nFailed labelings (XOR-like patterns):")
for f in failed:
    print(f"  {f}")

In [None]:
# Let's visualize specifically the XOR problem
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# XOR pattern 1: (0,0,1,1) - diagonal corners
xor_pattern_1 = np.array([0, 1, 1, 0])  # Top-left and bottom-right are class 0
colors_1 = ['blue' if l == 0 else 'red' for l in xor_pattern_1]

axes[0].scatter(points_4[:, 0], points_4[:, 1], c=colors_1, s=400, edgecolors='black', linewidths=3)
for i, (x, y) in enumerate(points_4):
    axes[0].annotate(f'({x},{y})', (x, y), textcoords="offset points", xytext=(0,-25), ha='center', fontsize=10)
axes[0].set_title("XOR Pattern 1\nNo line can separate!", fontsize=14, color='red', fontweight='bold')
axes[0].set_xlim(-0.5, 1.5)
axes[0].set_ylim(-0.5, 1.5)
axes[0].grid(True, alpha=0.3)
axes[0].set_aspect('equal')

# XOR pattern 2: (1,0,0,1) - the other diagonal
xor_pattern_2 = np.array([1, 0, 0, 1])  # Top-right and bottom-left are class 0
colors_2 = ['blue' if l == 0 else 'red' for l in xor_pattern_2]

axes[1].scatter(points_4[:, 0], points_4[:, 1], c=colors_2, s=400, edgecolors='black', linewidths=3)
for i, (x, y) in enumerate(points_4):
    axes[1].annotate(f'({x},{y})', (x, y), textcoords="offset points", xytext=(0,-25), ha='center', fontsize=10)
axes[1].set_title("XOR Pattern 2\nNo line can separate!", fontsize=14, color='red', fontweight='bold')
axes[1].set_xlim(-0.5, 1.5)
axes[1].set_ylim(-0.5, 1.5)
axes[1].grid(True, alpha=0.3)
axes[1].set_aspect('equal')

plt.suptitle("The XOR Problem: Why VC Dimension of Linear Classifiers in 2D = 3", fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nNo matter how you tilt or position a line, you CANNOT separate opposite corners!")
print("This proves VC(2D linear classifiers) = 3, not 4.")

### What Just Happened?

The XOR pattern (opposite corners having the same label) is the classic example of a non-linearly-separable problem. Since there exists at least one labeling that cannot be achieved, 4 points **cannot be shattered**.

Therefore:
- 3 points CAN be shattered
- 4 points CANNOT be shattered

$$\boxed{\text{VC}(\text{linear classifiers in 2D}) = 3}$$

---

## Part 5: The General Formula - VC Dimension in d Dimensions

### The Beautiful Result

For linear classifiers (hyperplanes) in $d$-dimensional space:

$$\text{VC}(\text{linear classifiers in } \mathbb{R}^d) = d + 1$$

This makes intuitive sense:
- In 2D: VC = 3 (we just proved this!)
- In 3D: VC = 4 (a plane can shatter 4 points)
- In 100D: VC = 101

In [None]:
def vc_dimension_linear_classifier(d: int) -> int:
    """
    VC dimension of linear classifiers in d-dimensional space.
    
    Args:
        d: Dimensionality of the input space
        
    Returns:
        VC dimension (d + 1)
    """
    return d + 1


# Let's verify this intuition
print("VC Dimension of Linear Classifiers:")
print("=" * 40)
for d in [1, 2, 3, 10, 100, 1000]:
    vc = vc_dimension_linear_classifier(d)
    print(f"  {d:4d}D space: VC = {vc:5d}")

print("\nImplication: A linear model with 1000 features can 'memorize'")
print("any labeling of up to 1001 training points perfectly!")

---

## Part 6: Polynomial Classifiers - Increasing VC Dimension

What if we use polynomial decision boundaries instead of lines? The VC dimension increases!

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# ============================================================
# Understanding PolynomialFeatures
# ============================================================
# PolynomialFeatures transforms input features into polynomial terms.
# For example, with degree=2 and features [x1, x2]:
#   Input:  [x1, x2]
#   Output: [1, x1, x2, x1², x1*x2, x2²]
#
# This allows linear models to learn non-linear decision boundaries!
# The "kernel trick" in SVMs is related - it implicitly does this.
#
# Key parameters:
#   - degree: Maximum polynomial degree (e.g., 2 for quadratic)
#   - include_bias: Whether to include a column of 1s (intercept term)
#
# Example:
#   poly = PolynomialFeatures(degree=2)
#   X_poly = poly.fit_transform(X)  # Transform features
# ============================================================

def can_polynomial_separate(points: np.ndarray, labels: np.ndarray, degree: int) -> bool:
    """
    Check if points can be separated using a polynomial decision boundary.
    
    We achieve this by mapping to polynomial feature space and using linear SVM.
    """
    if len(np.unique(labels)) == 1:
        return True
    
    # Transform to polynomial features
    poly = PolynomialFeatures(degree=degree)
    points_poly = poly.fit_transform(points)
    
    # Linear SVM in polynomial space = polynomial SVM in original space
    clf = SVC(kernel='linear', C=1e10, max_iter=50000)
    
    try:
        clf.fit(points_poly, labels)
        predictions = clf.predict(points_poly)
        return np.all(predictions == labels)
    except Exception:
        return False


def check_polynomial_shattering(points: np.ndarray, degree: int) -> Tuple[bool, int, int]:
    """
    Check if polynomial classifier can shatter the given points.
    
    Returns:
        (is_shattered, successful_labelings, total_labelings)
    """
    n = len(points)
    all_labelings = list(product([0, 1], repeat=n))
    
    successful = 0
    for labeling in all_labelings:
        labels = np.array(labeling)
        if can_polynomial_separate(points, labels, degree):
            successful += 1
    
    return successful == len(all_labelings), successful, len(all_labelings)


print("Polynomial shattering functions defined!")

In [None]:
# Test on our 4 points with different polynomial degrees
print("Shattering 4 points with polynomial classifiers:")
print("=" * 50)

for degree in [1, 2, 3]:
    is_shattered, success, total = check_polynomial_shattering(points_4, degree)
    status = "YES" if is_shattered else "NO"
    print(f"  Degree {degree}: {success}/{total} labelings separable - Shattered? {status}")

print("\nDegree 2 polynomials CAN shatter 4 points (they handle XOR)!")

### NumPy Array Manipulation for Visualization

The following cell uses some common NumPy patterns for creating decision boundary plots:

```python
# np.meshgrid() - Creates coordinate grids for plotting
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 200), np.linspace(-0.5, 1.5, 200))
# xx and yy are 2D arrays of x and y coordinates

# .ravel() - Flattens an array to 1D (same as .flatten() but may share memory)
xx.ravel()  # [x1, x2, x3, ...] all x-coordinates in a row

# np.c_[] - Column-stacks arrays (combines as columns)
np.c_[xx.ravel(), yy.ravel()]  # Creates (N, 2) array of [x, y] pairs

# .reshape() - Changes array shape
Z.reshape(xx.shape)  # Reshapes 1D predictions back to 2D grid for plotting
```

These patterns are essential for visualizing classifier decision boundaries!

In [None]:
# Visualize polynomial decision boundary solving XOR
from sklearn.svm import SVC

# XOR labels
xor_labels = np.array([0, 1, 1, 0])

# Fit polynomial SVM (degree 2)
clf_poly = SVC(kernel='poly', degree=2, C=1e10)
clf_poly.fit(points_4, xor_labels)

# Create mesh for visualization
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 200), np.linspace(-0.5, 1.5, 200))
Z = clf_poly.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
plt.contour(xx, yy, Z, colors='green', linewidths=2)

colors = ['blue' if l == 0 else 'red' for l in xor_labels]
plt.scatter(points_4[:, 0], points_4[:, 1], c=colors, s=400, edgecolors='black', linewidths=3, zorder=5)

for i, (x, y) in enumerate(points_4):
    plt.annotate(f'Class {xor_labels[i]}', (x, y), textcoords="offset points", xytext=(0, 20), 
                ha='center', fontsize=12, fontweight='bold')

plt.title("Polynomial (Degree 2) Classifier Solving XOR!\nHigher VC dimension = More expressive", 
          fontsize=14, fontweight='bold')
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True, alpha=0.3)
plt.show()

print("The curved decision boundary (hyperbola) separates the XOR pattern!")

### VC Dimension of Polynomial Classifiers

For polynomial classifiers of degree $k$ in $d$ dimensions:

$$\text{VC} = \binom{d + k}{k}$$

This grows quickly with both dimension and polynomial degree!

In [None]:
from math import comb

# ============================================================
# Understanding math.comb (Binomial Coefficient)
# ============================================================
# math.comb(n, k) computes "n choose k" = C(n,k) = n! / (k! * (n-k)!)
# This counts the number of ways to choose k items from n items.
#
# Examples:
#   comb(5, 2) = 10  (ways to choose 2 items from 5)
#   comb(4, 2) = 6   (ways to choose 2 items from 4)
#
# In polynomial classifiers, comb(d+k, k) gives the number of
# monomials up to degree k, which equals the VC dimension.
# ============================================================

def vc_dimension_polynomial(d: int, k: int) -> int:
    """
    VC dimension of polynomial classifiers of degree k in d dimensions.
    
    This equals the number of monomials up to degree k,
    which is C(d+k, k) = (d+k)! / (d! * k!)
    
    Args:
        d: Input dimensionality
        k: Polynomial degree
        
    Returns:
        VC dimension
    """
    return comb(d + k, k)


print("VC Dimension of Polynomial Classifiers:")
print("=" * 50)
print(f"{'Dimension':>10} | {'Degree 1':>10} | {'Degree 2':>10} | {'Degree 3':>10}")
print("-" * 50)

for d in [2, 5, 10, 50, 100]:
    vc1 = vc_dimension_polynomial(d, 1)
    vc2 = vc_dimension_polynomial(d, 2)
    vc3 = vc_dimension_polynomial(d, 3)
    print(f"{d:>10} | {vc1:>10,} | {vc2:>10,} | {vc3:>10,}")

print("\nHigher VC dimension = More expressive but needs more data!")

---

## Part 7: Neural Networks and VC Dimension

What about neural networks? This is where things get interesting!

### ELI5: Neural Network VC Dimension

> **Imagine building with LEGO blocks...** 
>
> A neural network is like having a box of LEGO bricks. The more bricks (parameters) you have, the more complex structures you can build. A network with 1 million parameters can build almost anything - it's incredibly expressive.
>
> The VC dimension of neural networks roughly grows with the number of parameters. More parameters = higher VC dimension = can fit more complex patterns.
>
> **But here's the puzzle:** Modern networks have billions of parameters (huge VC dimension) but still generalize! Theory says they shouldn't, but they do. This is one of the biggest open questions in deep learning theory.

In [None]:
def estimate_nn_vc_dimension(n_weights: int, n_layers: int) -> str:
    """
    Estimate VC dimension of a neural network.
    
    Classical bound (loose): O(W * L * log(W))
    where W = number of weights, L = number of layers
    
    Note: This is a theoretical upper bound; actual effective 
    capacity is often much lower due to implicit regularization.
    
    Args:
        n_weights: Total number of trainable parameters
        n_layers: Number of layers
        
    Returns:
        String describing the VC dimension bound
    """
    import math
    
    # Classical bound from Bartlett et al.
    upper_bound = n_weights * n_layers * math.log(n_weights)
    
    # Simpler bound: O(W^2) for networks with sign/threshold activations
    simple_bound = n_weights ** 2
    
    return f"O({n_weights:,} × {n_layers} × log({n_weights:,})) ≈ {upper_bound:,.0f}"


print("VC Dimension Estimates for Neural Networks:")
print("=" * 60)

# Some example architectures
architectures = [
    ("Small MLP (784→100→10)", 784*100 + 100 + 100*10 + 10, 2),
    ("ResNet-18", 11_700_000, 18),
    ("GPT-2 Small (124M)", 124_000_000, 12),
    ("LLaMA-7B", 7_000_000_000, 32),
    ("GPT-4 (estimated)", 1_800_000_000_000, 120),
]

for name, weights, layers in architectures:
    vc_est = estimate_nn_vc_dimension(weights, layers)
    print(f"\n{name}:")
    print(f"  Parameters: {weights:,}")
    print(f"  VC bound: {vc_est}")

print("\n" + "=" * 60)
print("Key Insight: These bounds are HUGE, yet models generalize!")
print("Modern theory uses 'effective' capacity measures instead.")

---

## Part 8: From VC Dimension to Generalization Bounds

The whole point of VC dimension is to bound generalization error! Here's the famous result:

### The Fundamental Theorem of Statistical Learning

With probability at least $1 - \delta$:

$$\text{Test Error} \leq \text{Training Error} + O\left(\sqrt{\frac{\text{VC} \cdot \log(n/\text{VC}) + \log(1/\delta)}{n}}\right)$$

where $n$ is the number of training samples.

In [None]:
import numpy as np

def generalization_bound(vc_dim: int, n_samples: int, delta: float = 0.05) -> float:
    """
    Compute the generalization gap bound based on VC dimension.
    
    This tells us: with probability (1 - delta), the difference between
    test error and training error is at most this value.
    
    Args:
        vc_dim: VC dimension of the hypothesis class
        n_samples: Number of training samples
        delta: Confidence parameter (default 0.05 for 95% confidence)
        
    Returns:
        Upper bound on generalization gap
    """
    if n_samples <= vc_dim:
        return 1.0  # Bound is trivial
    
    # Simplified VC bound
    gap = np.sqrt((vc_dim * np.log(2 * n_samples / vc_dim) + np.log(4 / delta)) / n_samples)
    return min(gap, 1.0)  # Cap at 1.0


# Visualize how the bound changes with training set size
vc_dims = [3, 10, 100, 1000]
n_samples_range = np.logspace(1, 6, 100).astype(int)

plt.figure(figsize=(12, 6))

for vc in vc_dims:
    bounds = [generalization_bound(vc, n) for n in n_samples_range]
    plt.plot(n_samples_range, bounds, label=f'VC = {vc}', linewidth=2)

plt.xscale('log')
plt.xlabel('Number of Training Samples', fontsize=12)
plt.ylabel('Generalization Gap Bound', fontsize=12)
plt.title('Generalization Bound Decreases with More Data\n(Higher VC = Need More Data)', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.ylim(0, 1)
plt.tight_layout()
plt.show()

print("Key Insight: Higher VC dimension requires more training data for same guarantee!")

In [None]:
# Practical implications
print("Practical Generalization Bounds:")
print("=" * 60)

scenarios = [
    ("Linear classifier, 100D, 10K samples", 101, 10_000),
    ("Polynomial (deg 2), 100D, 10K samples", 5151, 10_000),
    ("Linear classifier, 100D, 100K samples", 101, 100_000),
    ("Neural net (1K params), 10K samples", 10_000, 10_000),
    ("Neural net (1K params), 1M samples", 10_000, 1_000_000),
]

for name, vc, n in scenarios:
    bound = generalization_bound(vc, n)
    print(f"\n{name}:")
    print(f"  Generalization gap bound: {bound:.4f} ({bound*100:.1f}%)")
    if bound < 0.05:
        print(f"  Strong generalization expected!")
    elif bound < 0.2:
        print(f"  Moderate generalization expected.")
    else:
        print(f"  Weak guarantee - need more data or simpler model.")

---

## Try It Yourself

### Exercise 1: Custom Point Configuration

Create 5 points in 2D and check if they can be shattered by linear classifiers. Explain your finding.

<details>
<summary>Hint</summary>
Since VC = 3 for 2D linear classifiers, 5 points definitely cannot be shattered. The question is: how many labelings fail?
</details>

In [None]:
# Exercise 1: Your code here
# Create 5 points and check shattering

points_5 = np.array([
    # Add your 5 points here
    # [x1, y1],
    # [x2, y2],
    # ...
])

# Uncomment and run:
# is_shattered, failed = check_if_shattered(points_5)
# print(f"Shattered: {is_shattered}")
# print(f"Failed labelings: {len(failed)} out of {2**5}")

### Exercise 2: Data Requirements Calculator

You're building a spam classifier with a 500-dimensional feature space (500 word features). Using a linear classifier:

1. What is the VC dimension?
2. How many training emails do you need for a generalization gap bound of 10%?

<details>
<summary>Hint</summary>
For linear classifiers in d dimensions, VC = d + 1. Then solve for n in the generalization bound formula.
</details>

In [None]:
# Exercise 2: Your code here

d = 500  # Feature dimensionality

# Q1: What is the VC dimension?
vc_dim = None  # Your answer

# Q2: How many samples for 10% generalization gap?
# Try different values and find where bound < 0.10
target_gap = 0.10
n_samples = None  # Your answer

# Verify:
# print(f"VC dimension: {vc_dim}")
# print(f"Required samples: {n_samples}")
# print(f"Achieved bound: {generalization_bound(vc_dim, n_samples):.4f}")

---

## Common Mistakes

### Mistake 1: Confusing VC Dimension with Model Accuracy

```python
# WRONG thinking:
# "Higher VC dimension = better model"

# RIGHT thinking:
# "Higher VC dimension = more expressive, but needs more data"
# A model that's TOO expressive will overfit with limited data!
```

**Why:** VC dimension measures capacity, not quality. A model with infinite VC dimension can fit any training set perfectly but will memorize rather than generalize.

### Mistake 2: Thinking VC Bounds Are Tight

```python
# WRONG:
# "Theory says I need 1 million samples, so I must collect 1 million"

# RIGHT:
# "Theory gives worst-case bounds. In practice, I often need far less."
# Modern neural nets violate classical VC predictions routinely!
```

**Why:** VC bounds are worst-case guarantees. Real data has structure that models exploit. Use theory for intuition, not precise predictions.

### Mistake 3: Ignoring Implicit Regularization

```python
# WRONG:
# "My neural net has 1B parameters, VC dimension is astronomical,
#  therefore it must overfit"

# RIGHT:
# "SGD, dropout, batch norm, and early stopping provide implicit
#  regularization that reduces effective capacity far below VC bounds"
```

**Why:** Modern deep learning exploits many regularization mechanisms that classical VC theory doesn't account for.

---

## Checkpoint

You've learned:
- **Shattering**: A hypothesis class shatters points if it can achieve all possible labelings
- **VC Dimension**: The maximum number of points that can be shattered
- **Linear classifiers in 2D have VC = 3** (XOR proves 4 fails)
- **General formula**: Linear in d dimensions has VC = d + 1
- **Generalization bound**: Decreases with more data, increases with VC dimension
- **Neural network paradox**: Huge VC dimension but still generalize!

---

## Challenge (Optional)

### Advanced: Empirical VC Dimension Estimation

Implement a function that empirically estimates the VC dimension of a given classifier by testing shattering on randomly sampled point configurations.

```python
def estimate_vc_empirically(classifier_factory, d=2, max_points=10, n_trials=100):
    """
    Empirically estimate VC dimension by testing shattering.
    
    Args:
        classifier_factory: Function that returns a fresh classifier
        d: Dimensionality of points
        max_points: Maximum number of points to try
        n_trials: Number of random configurations to test per point count
        
    Returns:
        Estimated VC dimension (highest n where shattering succeeded)
    """
    # Your implementation here
    pass
```

In [None]:
# Challenge: Your implementation here

---

## Further Reading

- [Understanding Machine Learning: From Theory to Algorithms](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/) - Chapter 6 (VC Dimension)
- [CS229 Stanford Lecture Notes](https://cs229.stanford.edu/notes2022fall/main_notes.pdf) - Learning Theory section
- [Original Vapnik-Chervonenkis Paper (1971)](https://link.springer.com/article/10.1007/BF01037268) - Historical importance

---

## Cleanup

In [None]:
# Clear any large variables
import gc

# Close all matplotlib figures
plt.close('all')

# Garbage collection
gc.collect()

print("Cleanup complete!")
print("\nNext up: Lab A.2 - Bias-Variance Decomposition")