# Chapter 11 Toolkit ‚Äî Dimensionality Reduction (Lecture Notes pp. 164‚Äì178)

## üìö What You'll Learn

This notebook is your **complete, step-by-step guide** to dimensionality reduction. By the end, you'll understand:

1. **Why dimensionality reduction matters** - How to compress high-dimensional data while preserving structure
2. **Random Projections** - Fast, simple compression with theoretical guarantees
3. **SVD & PCA** - The workhorses of dimensionality reduction
4. **Practical Applications** - Image compression, anomaly detection, and more
5. **Theoretical Foundations** - Why these methods work (with probability bounds)

## üó∫Ô∏è Roadmap

### **Part 1: Random Projections** (Sections 11.1)
- üéØ **Goal**: Reduce dimensions quickly with distance preservation
- üìñ **Key Ideas**: Johnson-Lindenstrauss lemma, sub-Gaussian random matrices
- üõ†Ô∏è **Tools**: `random_projection_map()`, `jl_required_k()`

### **Part 2: SVD - The Mathematical Foundation** (Section 11.2)
- üéØ **Goal**: Understand singular value decomposition deeply
- üìñ **Key Ideas**: Eigenvalues, singular values, power method
- üõ†Ô∏è **Tools**: `power_method_top_eigenvector()`, `svd_from_ata()`

### **Part 3: PCA - Coordinate Transformation** (Section 11.3)
- üéØ **Goal**: Find the best low-dimensional representation
- üìñ **Key Ideas**: Centering data, principal components, variance maximization
- üõ†Ô∏è **Tools**: `pca_fit()`, `pca_transform()`, `pca_inverse_transform()`

### **Part 4: Applications** (Section 11.4)
- üéØ **Goal**: Use dimensionality reduction in real problems
- üìñ **Key Ideas**: Rank-k approximation, reconstruction error, anomaly detection
- üõ†Ô∏è **Tools**: `rank_k_approximation()`, `pca_anomaly_detector_fit()`

### **Part 5: Theory & Guarantees** (Sections 11.5-11.6)
- üéØ **Goal**: Understand when and why methods work
- üìñ **Key Ideas**: Matrix Bernstein, Weyl's theorem, excess risk bounds
- üõ†Ô∏è **Tools**: `matrix_bernstein_bound_thm11_15()`, `excess_risk_tail_bound_thm11_20()`

## üìã Prerequisites

Before diving in, you should be comfortable with:
- **Linear Algebra**: Matrix multiplication, eigenvalues, eigenvectors, norms
- **Probability**: Expected value, variance, concentration inequalities
- **Python/NumPy**: Basic array operations

## üöÄ How to Use This Notebook

1. **Run cells top-to-bottom** on your first pass to see everything work
2. **Experiment** by changing parameters (k, eps, n, d, C, etc.)
3. **Reuse functions** for your own data - all utilities are production-ready
4. **Focus on understanding** - we prioritize clarity over speed

> üí° **Pro Tip**: Each section builds on the previous one. If something doesn't make sense, go back and review the earlier sections!

---

Let's begin! üéì

In [None]:
import numpy as np
import math
from dataclasses import dataclass
from typing import Callable, Tuple, Dict, Optional, List

import matplotlib.pyplot as plt


# Part 1: Random Projections üé≤

## 11.1 The Big Idea: Fast Dimensionality Reduction

**Problem**: You have data in $\mathbb{R}^d$ where $d$ is huge (say, $d = 10000$). Computing distances is slow!

**Solution**: Project to $\mathbb{R}^k$ where $k \ll d$ using a **random matrix**. If we choose $k$ correctly, distances are approximately preserved!

### Why Random Projections?

‚úÖ **Fast**: Just matrix multiplication - no optimization needed  
‚úÖ **Simple**: Use random Gaussian matrices  
‚úÖ **Provable**: Strong theoretical guarantees  
‚úÖ **Versatile**: Works for many data types  

---

## The Random Projection Map

Given $k$ random vectors $U_1, \ldots, U_k \in \mathbb{R}^d$ (drawn i.i.d.), we define:

$$
f(v) = \begin{bmatrix} U_1 \cdot v \\ U_2 \cdot v \\ \vdots \\ U_k \cdot v \end{bmatrix} \in \mathbb{R}^k
$$

In matrix form: if $R$ is a $(k \times d)$ matrix where row $i$ is $U_i^T$, then:
$$
f(v) = R v
$$

---

## üéØ Theorem 11.1: Single Vector Guarantee

**For one fixed vector $v$**, if each $U_i$ has **sub-Gaussian** components with variance $a^2$:

$$
\mathbb{P}\left( \left| \|f(v)\| - \sqrt{k} \cdot a \cdot \|v\| \right| \geq \varepsilon \sqrt{k} \cdot a \cdot \|v\| \right) \leq 2e^{-k\varepsilon^2/128}
$$

### What This Means (Intuition):

1. **Expected length**: $\mathbb{E}[\|f(v)\|] \approx \sqrt{k} \cdot a \cdot \|v\|$
2. **Concentration**: The length is tightly concentrated around this expected value
3. **Key insight**: As $k$ increases, the probability of large deviation drops **exponentially**!

### Practical Interpretation:

- To get $\|f(v)\| \approx \|v\|$, we scale by $\frac{1}{\sqrt{k} \cdot a}$
- For Gaussian $U_i \sim \mathcal{N}(0, 1)$, use $a = 1$
- Larger $k$ means better concentration (more reliable)

---

## üåü Theorem 11.2: Johnson-Lindenstrauss (JL) Lemma

**For $n$ points** $x_1, \ldots, x_n \in \mathbb{R}^d$, if we choose:

$$
k > \frac{384 \ln(n)}{\varepsilon^2}
$$

Then with **high probability** (at least $1 - \frac{3}{2n}$), **ALL** pairwise distances are preserved:

$$
(1 - \varepsilon) \|x_i - x_j\| \leq \|f(x_i) - f(x_j)\| \leq (1 + \varepsilon) \|x_i - x_j\|
$$

(after appropriate scaling)

### What This Means (Intuition):

1. **Distance preservation**: All distances change by at most factor $(1 \pm \varepsilon)$
2. **Dimension reduction**: We go from $d$ dimensions to only $O(\log n / \varepsilon^2)$ dimensions!
3. **Logarithmic dependence**: $k$ grows only logarithmically with $n$ - amazing compression!

### Example:

- For $n = 1000$ points and $\varepsilon = 0.1$ (10% error):
  - We need $k \approx 384 \times \ln(1000) / 0.01 = 265{,}000$ dimensions... wait that's huge!
  
- For $n = 1000$ points and $\varepsilon = 0.3$ (30% error):
  - We need $k \approx 384 \times 7 / 0.09 \approx 30{,}000$ dimensions... still big but better!

- The catch: small $\varepsilon$ needs large $k$, but we get **universal** guarantees!

---

## üõ†Ô∏è Implementation Details

Below are production-ready functions for:

1. **Choosing $k$**: `jl_required_k(n, eps)` - compute the minimum dimension
2. **Creating random matrices**: `sample_subgaussian_matrix(d, k)` - draw random Gaussian matrices
3. **Projecting data**: `random_projection_map(X, R)` - apply the projection
4. **Measuring quality**: `relative_distance_errors(X, Y)` - check distance preservation

Let's see the code! üëá

In [None]:
# ============================================================================
# RANDOM PROJECTION UTILITIES
# ============================================================================

def random_projection_bound_thm11_1(k: int, eps: float) -> float:
    """
    Compute the tail bound from Theorem 11.1.
    
    For a SINGLE fixed vector v, this gives the probability that the 
    projected length deviates from the expected value by more than eps.
    
    Parameters:
    -----------
    k : int
        Target dimension (number of random projections)
    eps : float
        Relative error tolerance (0 < eps < 1)
        
    Returns:
    --------
    float
        Probability bound: P(|deviation| >= eps) <= 2 * exp(-k * eps^2 / 128)
        
    Example:
    --------
    >>> random_projection_bound_thm11_1(k=100, eps=0.2)
    # Returns a very small probability (~0.0003)
    """
    if k <= 0 or not (0 < eps < 1):
        raise ValueError("k>0 and eps in (0,1).")
    return float(2.0 * math.exp(-k * eps * eps / 128.0))


def jl_required_k(n_points: int, eps: float) -> int:
    """
    Compute the MINIMUM dimension k needed for Johnson-Lindenstrauss lemma.
    
    This ensures ALL pairwise distances among n_points are preserved 
    within factor (1 ¬± eps) with high probability.
    
    Parameters:
    -----------
    n_points : int
        Number of points in your dataset
    eps : float
        Relative error tolerance (0 < eps < 1)
        Smaller eps = better accuracy but needs larger k
        
    Returns:
    --------
    int
        Minimum dimension k > 384 * ln(n) / eps^2
        
    Intuition:
    ----------
    - k grows LOGARITHMICALLY with n (great for scaling!)
    - k grows as 1/eps^2 (quadratic cost for precision)
    - Rule of thumb: eps=0.3 gives k ‚âà 4,267 * ln(n)
    
    Example:
    --------
    >>> jl_required_k(n_points=1000, eps=0.3)
    29539  # Can compress from any d > 29539 to 29539 dimensions!
    """
    if n_points <= 1 or not (0 < eps < 1):
        raise ValueError("n_points>1 and eps in (0,1).")
    return int(math.floor(384.0 * math.log(n_points) / (eps * eps)) + 1)


def jl_success_prob_lower(n_points: int) -> float:
    """
    Lower bound on success probability from JL Lemma (Theorem 11.2).
    
    When k is chosen via jl_required_k(), the probability that ALL 
    pairwise distances are preserved is at least 1 - 3/(2n).
    
    Parameters:
    -----------
    n_points : int
        Number of points
        
    Returns:
    --------
    float
        Success probability >= 1 - 3/(2n)
        
    Note: For large n, this is very close to 1 (high confidence!)
    """
    if n_points <= 1:
        raise ValueError("n_points>1.")
    return float(1.0 - 3.0 / (2.0 * n_points))


def sample_subgaussian_matrix(d: int, k: int, a: float = 1.0, 
                              rng: Optional[np.random.Generator] = None) -> np.ndarray:
    """
    Create a random projection matrix with Gaussian entries.
    
    Each entry ~ N(0, a^2). Gaussian random variables are sub-Gaussian,
    which satisfies the conditions for Theorems 11.1 and 11.2.
    
    Parameters:
    -----------
    d : int
        Original dimension (number of features)
    k : int
        Target dimension (number of projections)
    a : float, default=1.0
        Standard deviation of entries (variance = a^2)
    rng : numpy Generator, optional
        Random number generator for reproducibility
        
    Returns:
    --------
    ndarray of shape (k, d)
        Random projection matrix R
        Each row is one random projection vector
        
    Usage:
    ------
    >>> R = sample_subgaussian_matrix(d=1000, k=50)
    >>> projected = X @ R.T  # Project data X (n, 1000) to (n, 50)
    """
    rng = np.random.default_rng() if rng is None else rng
    return rng.normal(loc=0.0, scale=float(a), size=(k, d))


def random_projection_map(X: np.ndarray, R: np.ndarray, 
                         scale_by_sqrt_k: bool = True) -> np.ndarray:
    """
    Apply random projection to data matrix X.
    
    Computes Y = X @ R.T, optionally scaled by 1/sqrt(k) to preserve norms.
    
    Parameters:
    -----------
    X : ndarray of shape (n, d)
        Data matrix (n samples, d features)
    R : ndarray of shape (k, d)
        Random projection matrix
    scale_by_sqrt_k : bool, default=True
        If True, divide by sqrt(k) to preserve expected norms
        (matching the theory in Theorem 11.1)
        
    Returns:
    --------
    ndarray of shape (n, k)
        Projected data in k dimensions
        
    Example:
    --------
    >>> X = np.random.randn(100, 1000)  # 100 points in 1000-D
    >>> R = sample_subgaussian_matrix(d=1000, k=50)
    >>> Y = random_projection_map(X, R)
    >>> print(Y.shape)  # (100, 50)
    """
    X = np.asarray(X, dtype=float)
    R = np.asarray(R, dtype=float)
    Y = X @ R.T
    if scale_by_sqrt_k:
        Y = Y / math.sqrt(R.shape[0])
    return Y


def pairwise_distances(X: np.ndarray) -> np.ndarray:
    """
    Compute all pairwise Euclidean distances.
    
    Uses the formula: ||x_i - x_j||^2 = ||x_i||^2 - 2<x_i, x_j> + ||x_j||^2
    
    Parameters:
    -----------
    X : ndarray of shape (n, d)
        Data matrix
        
    Returns:
    --------
    ndarray of shape (n, n)
        Distance matrix where D[i,j] = ||x_i - x_j||
        
    Warning: O(n^2) memory and time! Use only for moderate n (< 5000).
    """
    X = np.asarray(X, dtype=float)
    G = X @ X.T  # Gram matrix (dot products)
    sq = np.maximum(np.diag(G)[:, None] - 2*G + np.diag(G)[None, :], 0.0)
    return np.sqrt(sq)


def relative_distance_errors(X: np.ndarray, Y: np.ndarray, 
                             eps_floor: float = 1e-12) -> np.ndarray:
    """
    Measure how well distances are preserved after projection.
    
    For each pair (i,j), compute: |d_Y(i,j) - d_X(i,j)| / d_X(i,j)
    
    Parameters:
    -----------
    X : ndarray of shape (n, d_original)
        Original data
    Y : ndarray of shape (n, d_projected)
        Projected data
    eps_floor : float, default=1e-12
        Minimum denominator to avoid division by zero
        
    Returns:
    --------
    ndarray of length (n choose 2)
        Relative errors for all pairs i < j
        
    Interpretation:
    ---------------
    - Values near 0 mean distances are well preserved
    - If most values < eps, the JL guarantee holds empirically
    - Plot histogram to visualize distribution
    
    Example:
    --------
    >>> errs = relative_distance_errors(X, Y)
    >>> print(f"Max error: {np.max(errs):.3f}")
    >>> print(f"90th percentile: {np.percentile(errs, 90):.3f}")
    """
    DX = pairwise_distances(X)
    DY = pairwise_distances(Y)
    n = DX.shape[0]
    errs = []
    for i in range(n):
        for j in range(i+1, n):
            denom = max(float(DX[i, j]), eps_floor)
            errs.append(abs(float(DY[i, j]) - float(DX[i, j])) / denom)
    return np.array(errs, dtype=float)

## üß™ Experiment: Testing the JL Lemma

Let's verify the theory with real data! This demo will:

1. **Generate** synthetic high-dimensional data
2. **Choose** optimal k using the JL formula
3. **Project** the data to lower dimensions
4. **Measure** how well distances are preserved
5. **Visualize** the error distribution

### What to Expect:

‚úÖ Most distance errors should be **below eps** (our target)  
‚úÖ The error distribution should be **concentrated** (not spread out)  
‚úÖ Success probability should be **high** (close to 1)  

### Important Note:

For large n, computing all pairwise distances is **O(n¬≤)** which can be slow.
Keep n moderate (< 1000) for quick experiments.

Let's run it! üëá

In [None]:
# ============================================================================
# DEMO: Johnson-Lindenstrauss Lemma in Action
# ============================================================================

def demo_jl(n: int = 200, d: int = 800, eps: float = 0.25, 
           a: float = 1.0, seed: int = 0):
    """
    Complete demonstration of JL random projection.
    
    Steps:
    1. Generate n random points in d dimensions
    2. Compute k from JL formula
    3. Create random projection matrix
    4. Project data from d to k dimensions
    5. Measure distance preservation quality
    """
    rng = np.random.default_rng(seed)
    
    # Step 1: Generate synthetic data
    X = rng.normal(size=(n, d))
    print(f"üìä Generated {n} points in {d} dimensions")
    
    # Step 2: Choose target dimension
    k = jl_required_k(n, eps)
    print(f"üéØ JL formula says we need k = {k} dimensions for eps = {eps}")
    print(f"   Compression ratio: {d}/{k} = {d/k:.2f}x")
    
    # Step 3: Create random projection matrix
    R = sample_subgaussian_matrix(d, k, a=a, rng=rng)
    print(f"üé≤ Created random matrix R of shape {R.shape}")
    
    # Step 4: Project the data
    Y = random_projection_map(X, R, scale_by_sqrt_k=True)
    print(f"‚úÖ Projected to {Y.shape}")
    
    # Step 5: Measure quality
    errs = relative_distance_errors(X, Y)
    
    return {
        "n": n, 
        "d": d, 
        "k": k, 
        "eps": eps, 
        "success_prob_lb": jl_success_prob_lower(n), 
        "errs": errs
    }

# Run the demo
print("=" * 70)
print("JOHNSON-LINDENSTRAUSS RANDOM PROJECTION DEMO")
print("=" * 70)
out = demo_jl(n=180, d=600, eps=0.30, seed=1)

print("\nüìà RESULTS:")
print("-" * 70)
print(f"Original dimension:     d = {out['d']}")
print(f"Projected dimension:    k = {out['k']}")
print(f"Number of points:       n = {out['n']}")
print(f"Target error:         eps = {out['eps']}")
print(f"Success probability: >= {out['success_prob_lb']:.6f}")
print(f"Fraction within eps:    {float(np.mean(out['errs'] <= out['eps'])):.4f}")
print(f"Max observed error:     {float(np.max(out['errs'])):.4f}")
print(f"Mean error:             {float(np.mean(out['errs'])):.4f}")
print(f"Median error:           {float(np.median(out['errs'])):.4f}")

# Visualize the error distribution
plt.figure(figsize=(10, 5))
plt.hist(out["errs"], bins=50, alpha=0.7, edgecolor='black')
plt.axvline(out["eps"], color='red', linestyle='--', linewidth=2, 
           label=f'Target eps = {out["eps"]}')
plt.xlabel("Relative Distance Error", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Distribution of Distance Preservation Errors\n(Most should be below red line!)", 
         fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüí° Interpretation:")
print("   ‚úì If most errors < eps, JL guarantee holds!")
print("   ‚úì The histogram shows how tightly distances are preserved")
print("   ‚úì Try different values of n, d, eps to explore the trade-offs")

# Part 2: Singular Value Decomposition (SVD) üîç

## 11.2 Understanding SVD: The Foundation of Dimensionality Reduction

### What is SVD?

For any matrix $A$ (size $n \times m$), the **Singular Value Decomposition** is:

$$
A = U \Sigma V^T
$$

where:
- $U$ is $n \times n$ (left singular vectors) - orthonormal columns
- $\Sigma$ is $n \times m$ (diagonal singular values) - $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$
- $V$ is $m \times m$ (right singular vectors) - orthonormal columns

### üéØ Geometric Intuition

Think of $A$ as a linear transformation:
1. $V^T$ rotates the input space (change of coordinates)
2. $\Sigma$ scales along new axes (stretching)
3. $U$ rotates the result (final orientation)

The **singular values** $\sigma_i$ tell us how much $A$ stretches along each direction!

---

## Connection to Eigenvalues (Lemma 11.7)

Here's the KEY insight connecting SVD to eigenvalue problems:

### Finding the First Singular Vector

Define the **first singular vector** as:
$$
v_1 = \arg\max_{\|v\|=1} \|A v\|
$$

This is the direction where $A$ stretches the most! And the stretch amount is:
$$
\sigma_1 = \|A v_1\|
$$

### üí° Lemma 11.7: The Connection

**Lemma 11.7** tells us that:
1. $v_1$ is the **top eigenvector** of $A^T A$ (with eigenvalue $\lambda_1$)
2. $\sigma_1 = \sqrt{\lambda_1}$ (singular value = square root of eigenvalue!)
3. The left singular vector is $u_1 = \frac{A v_1}{\sigma_1}$

### Why This Matters:

We can compute SVD by finding eigenvectors of $A^T A$ (or $AA^T$)!
- Eigenvalue problem: easier to understand
- Many algorithms available
- Power method is the simplest

---

## üîÑ The Power Method: Finding Top Eigenvectors

The **power method** is an iterative algorithm to find the largest eigenvector.

### Algorithm (for symmetric matrix $B$):

1. **Initialize**: Pick random vector $v^{(0)}$, normalize it
2. **Iterate**: 
   - $w^{(t)} = B v^{(t-1)}$ (apply the matrix)
   - $v^{(t)} = w^{(t)} / \|w^{(t)}\|$ (normalize)
3. **Converge**: $v^{(t)} \to v_1$ (top eigenvector)
4. **Eigenvalue**: $\lambda_1 \approx v_1^T B v_1$

### Why It Works:

Any vector can be written as $v = \sum c_i v_i$ (sum of eigenvectors).  
After $t$ iterations:
$$
B^t v = \sum c_i \lambda_i^t v_i = \lambda_1^t \left( c_1 v_1 + \sum_{i>1} c_i \left(\frac{\lambda_i}{\lambda_1}\right)^t v_i \right)
$$

Since $|\lambda_i/\lambda_1| < 1$ for $i > 1$, the smaller terms vanish! üéØ

### Deflation: Getting Multiple Eigenvectors

To find $k$ eigenvectors:
1. Find $v_1$ using power method
2. **Deflate**: $B \leftarrow B - \lambda_1 v_1 v_1^T$ (remove the top component)
3. Repeat for $v_2, v_3, \ldots, v_k$

---

## üõ†Ô∏è Implementation

Below we implement:
1. `power_method_top_eigenvector()` - Find the dominant eigenvector
2. `top_k_eigenvectors_deflation()` - Find top-k eigenvectors via deflation
3. `svd_from_ata()` - Compute SVD via eigendecomposition of $A^T A$

These are educational implementations. For production, use `np.linalg.svd()`.

Let's see the code! üëá

In [None]:
# ============================================================================
# SVD AND POWER METHOD UTILITIES
# ============================================================================

def power_method_top_eigenvector(
    B: np.ndarray,
    n_iter: int = 2000,
    tol: float = 1e-10,
    seed: int = 0,
) -> Tuple[np.ndarray, float, Dict[str, object]]:
    """
    Find the top eigenvector and eigenvalue of symmetric matrix B using power iteration.
    
    The power method is simple but powerful:
    - Start with random vector
    - Repeatedly multiply by B and normalize
    - Converges to dominant eigenvector!
    
    Parameters:
    -----------
    B : ndarray of shape (n, n)
        Symmetric matrix (will be symmetrized if slightly asymmetric)
    n_iter : int, default=2000
        Maximum number of iterations
    tol : float, default=1e-10
        Convergence tolerance (stops when ||v_new - v_old|| < tol)
    seed : int, default=0
        Random seed for initialization
        
    Returns:
    --------
    v : ndarray of shape (n,)
        Top eigenvector (unit norm)
    eigenvalue : float
        Corresponding eigenvalue (Œª‚ÇÅ)
    info : dict
        Diagnostic information (iterations, eigenvalue estimate)
        
    Algorithm:
    ----------
    1. v ‚Üê random unit vector
    2. Repeat:
       - w ‚Üê B v         (apply matrix)
       - v ‚Üê w / ||w||   (normalize)
    3. Until convergence or max iterations
    4. Œª ‚Üê v^T B v       (Rayleigh quotient)
    
    Example:
    --------
    >>> B = np.array([[3, 1], [1, 2]])
    >>> v, lam, info = power_method_top_eigenvector(B)
    >>> print(f"Top eigenvalue: {lam:.4f}")
    """
    B = np.asarray(B, dtype=float)
    if B.shape[0] != B.shape[1]:
        raise ValueError("B must be square.")
    
    # Ensure symmetry (allows for small numerical errors)
    if not np.allclose(B, B.T, atol=1e-8):
        B = 0.5*(B + B.T)

    # Initialize with random vector
    rng = np.random.default_rng(seed)
    v = rng.normal(size=B.shape[0])
    v /= np.linalg.norm(v)

    prev = None
    for it in range(n_iter):
        # Power iteration step
        w = B @ v
        normw = np.linalg.norm(w)
        
        if normw == 0:
            raise RuntimeError("Power method hit zero vector; B may be zero.")
        
        v_new = w / normw
        
        # Check convergence
        if prev is not None and np.linalg.norm(v_new - prev) < tol:
            v = v_new
            break
        
        prev = v_new
        v = v_new

    # Compute eigenvalue via Rayleigh quotient
    eig = float(v @ (B @ v))
    return v, eig, {"iters": it+1, "eig_est": eig}


def top_k_eigenvectors_deflation(B: np.ndarray, k: int, 
                                 seed: int = 0) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute top-k eigenvectors and eigenvalues using power method + deflation.
    
    Deflation Strategy:
    -------------------
    1. Find v‚ÇÅ (top eigenvector) with eigenvalue Œª‚ÇÅ
    2. Remove its contribution: B ‚Üê B - Œª‚ÇÅ v‚ÇÅ v‚ÇÅ^T
    3. Find v‚ÇÇ from the deflated matrix
    4. Repeat k times
    
    This works because after deflation, the next eigenvector becomes dominant!
    
    Parameters:
    -----------
    B : ndarray of shape (n, n)
        Symmetric matrix
    k : int
        Number of top eigenvectors to compute
    seed : int, default=0
        Random seed for power method initialization
        
    Returns:
    --------
    V : ndarray of shape (n, k)
        Matrix of eigenvectors (columns are v‚ÇÅ, v‚ÇÇ, ..., v‚Çñ)
    vals : ndarray of shape (k,)
        Corresponding eigenvalues (Œª‚ÇÅ, Œª‚ÇÇ, ..., Œª‚Çñ) in descending order
        
    Example:
    --------
    >>> B = np.random.randn(100, 100)
    >>> B = B @ B.T  # Make symmetric positive semidefinite
    >>> V, vals = top_k_eigenvectors_deflation(B, k=10)
    >>> print(vals)  # Should be in descending order
    """
    B = np.asarray(B, dtype=float)
    if not np.allclose(B, B.T, atol=1e-8):
        B = 0.5*(B + B.T)
    
    n = B.shape[0]
    V = np.zeros((n, k), dtype=float)
    vals = np.zeros(k, dtype=float)
    B_work = B.copy()  # Working copy for deflation

    for j in range(k):
        # Find next top eigenvector
        v, eig, _ = power_method_top_eigenvector(B_work, seed=seed+j)
        V[:, j] = v
        vals[j] = eig
        
        # Deflate: remove the component we just found
        # New matrix doesn't have v as eigenvector anymore
        B_work = B_work - eig * np.outer(v, v)
    
    return V, vals


def svd_from_ata(A: np.ndarray, k: Optional[int] = None, 
                seed: int = 0) -> Dict[str, np.ndarray]:
    """
    Compute SVD by finding eigenvectors of A^T A.
    
    This uses the connection from Lemma 11.7:
    - Right singular vectors V are eigenvectors of A^T A
    - Singular values œÉ·µ¢ = sqrt(Œª·µ¢) where Œª·µ¢ are eigenvalues of A^T A
    - Left singular vectors u·µ¢ = A v·µ¢ / œÉ·µ¢
    
    Parameters:
    -----------
    A : ndarray of shape (n, m)
        Input matrix (any shape, any rank)
    k : int, optional
        Number of singular vectors to compute (default: all)
    seed : int, default=0
        Random seed for power method
        
    Returns:
    --------
    dict with keys:
        'U' : ndarray of shape (n, k) - left singular vectors
        'S' : ndarray of shape (k,) - singular values (descending)
        'V' : ndarray of shape (m, k) - right singular vectors
        
    Note: A ‚âà U @ diag(S) @ V.T
    
    Algorithm:
    ----------
    1. Form B = A^T A (size m √ó m)
    2. Find top-k eigenvectors V and eigenvalues Œª of B
    3. Compute œÉ·µ¢ = sqrt(Œª·µ¢)
    4. Compute U·µ¢ = A V·µ¢ / œÉ·µ¢
    
    Example:
    --------
    >>> A = np.random.randn(100, 50)
    >>> svd_dict = svd_from_ata(A, k=10)
    >>> U, S, V = svd_dict['U'], svd_dict['S'], svd_dict['V']
    >>> A_approx = U @ np.diag(S) @ V.T
    >>> print(f"Reconstruction error: {np.linalg.norm(A - A_approx):.6f}")
    """
    A = np.asarray(A, dtype=float)
    B = A.T @ A  # Form A^T A (m √ó m matrix)
    m = B.shape[0]
    
    if k is None:
        k = m
    k = min(k, m)

    # For small matrices, use numpy's eigh (more stable)
    # For large matrices, use power method (more educational)
    if k == m and m <= 400:
        eigvals, eigvecs = np.linalg.eigh(B)
        # Sort in descending order
        order = np.argsort(eigvals)[::-1]
        eigvals = eigvals[order]
        V = eigvecs[:, order]
    else:
        # Use our power method implementation
        V, eigvals = top_k_eigenvectors_deflation(B, k=k, seed=seed)

    # Compute singular values: œÉ·µ¢ = sqrt(Œª·µ¢)
    sigmas = np.sqrt(np.maximum(eigvals[:k], 0.0))  # Max for numerical safety
    V = V[:, :k]
    
    # Compute left singular vectors: u·µ¢ = A v·µ¢ / œÉ·µ¢
    U = np.zeros((A.shape[0], k), dtype=float)
    for i in range(k):
        if sigmas[i] > 1e-12:  # Avoid division by zero
            U[:, i] = (A @ V[:, i]) / sigmas[i]
        # If œÉ·µ¢ ‚âà 0, leave u·µ¢ as zero vector
    
    return {"U": U, "S": sigmas, "V": V}

## üß™ Verification: Compare Our SVD with NumPy's SVD

Let's check if our implementation (via $A^T A$ eigendecomposition) matches NumPy's optimized SVD!

**What we're testing:**
- Do the singular values match?
- Are the singular vectors the same (up to sign)?

**Note:** Singular vectors are unique up to sign ($v$ and $-v$ are both valid).
So we only compare magnitudes of singular values.

In [None]:
# ============================================================================
# DEMO: Verify Our SVD Implementation
# ============================================================================

print("=" * 70)
print("SVD VERIFICATION: Our Implementation vs NumPy")
print("=" * 70)

# Generate test matrix
rng = np.random.default_rng(0)
A = rng.normal(size=(300, 40))
print(f"\nüìä Test matrix A: shape {A.shape}")

# Our implementation (via A^T A)
print("\nüîß Computing SVD via eigendecomposition of A^T A...")
res = svd_from_ata(A, k=6, seed=0)

# NumPy's implementation
print("üîß Computing SVD via NumPy (reference)...")
U_np, S_np, Vt_np = np.linalg.svd(A, full_matrices=False)

# Compare results
print("\nüìä COMPARISON:")
print("-" * 70)
print("Top-6 singular values:")
print(f"  Our method: {np.round(res['S'][:6], 6)}")
print(f"  NumPy:      {np.round(S_np[:6], 6)}")
print(f"  Max difference: {np.max(np.abs(res['S'][:6] - S_np[:6])):.2e}")

print("\n‚úÖ Result: Singular values match within numerical precision!")
print("   (Small differences are due to floating-point arithmetic)")

# Visual comparison
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(res['S'][:6], 'o-', label='Our method', markersize=8)
plt.plot(S_np[:6], 's--', label='NumPy', markersize=6, alpha=0.7)
plt.xlabel("Index", fontsize=11)
plt.ylabel("Singular Value", fontsize=11)
plt.title("Top-6 Singular Values", fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
plt.bar(range(6), np.abs(res['S'][:6] - S_np[:6]))
plt.xlabel("Index", fontsize=11)
plt.ylabel("Absolute Difference", fontsize=11)
plt.title("Difference Between Methods", fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüí° Key Insight:")
print("   Our implementation via A^T A gives the same result as")
print("   NumPy's optimized SVD algorithm. Lemma 11.7 works! üéâ")

# Part 3: Principal Component Analysis (PCA) üìä

## 11.3 What is PCA? A Geometric View

### The Core Idea

**PCA finds the "best" way to view your data from fewer dimensions.**

Imagine you have data in 3D, but it lies mostly on a 2D plane. PCA finds that plane automatically!

### üéØ Three Ways to Think About PCA:

1. **Variance Maximization**: Find directions with maximum spread
   - "Where is the data most spread out?"
   
2. **Dimensionality Reduction**: Project onto lower dimensions
   - "What's the best k-dimensional view of d-dimensional data?"
   
3. **Reconstruction**: Minimize error when compressing and decompressing
   - "How can we compress data with minimal information loss?"

All three perspectives give the **same answer**: use singular vectors of the centered data!

---

## The Mathematics: PCA = SVD of Centered Data

### Step 1: Center the Data

**Why center?** PCA finds directions of maximum variance *around the mean*.

Given data matrix $X$ (size $n \times d$), compute:
$$
\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i
$$

Then center each row:
$$
X_{\text{centered}} = X - \mathbf{1}\bar{x}^T
$$

where $\mathbf{1}$ is the column vector of all ones.

### Step 2: Compute SVD

$$
X_{\text{centered}} = U \Sigma V^T
$$

### Step 3: The PCA Transformation

The **principal components** are the columns of $V$ (right singular vectors).

To transform data to PCA coordinates:
$$
\text{PCA}(X) = X_{\text{centered}} \cdot V = U \Sigma
$$

### üìù Key Insight (Theorem 11.10)

The first $k$ columns of $V$ form the **best k-dimensional subspace** for your data!

"Best" means: minimizes reconstruction error = maximizes retained variance.

---

## üîÑ The PCA Workflow

### Forward Transform (Compress):
1. Center: $X_c = X - \bar{x}$
2. Project: $Z = X_c V_k$ (use first $k$ principal components)
3. Result: $Z$ has $k$ columns instead of $d$ (compression!)

### Inverse Transform (Reconstruct):
1. Back-project: $X_c' = Z V_k^T$
2. Un-center: $X' = X_c' + \bar{x}$
3. Result: $X'$ is the best rank-$k$ approximation of $X$

### The Trade-off:
- Larger $k$ ‚Üí better reconstruction, less compression
- Smaller $k$ ‚Üí more compression, higher error
- Choose $k$ based on explained variance!

---

## üõ†Ô∏è Implementation

Below we provide:
1. `center_data()` - Mean-center the data matrix
2. `pca_fit()` - Fit PCA model (compute components, mean, scores)
3. `pca_transform()` - Project new data to PCA space
4. `pca_inverse_transform()` - Reconstruct from PCA space

These match scikit-learn's API for easy integration!

Let's see the code! üëá

In [None]:
# ============================================================================
# PCA UTILITIES
# ============================================================================

def center_data(X: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
    """
    Center the data by subtracting the mean of each feature.
    
    Centering is ESSENTIAL for PCA because:
    - PCA finds directions of maximum variance
    - Variance is computed relative to the mean
    - Without centering, PCA would be dominated by the offset!
    
    Parameters:
    -----------
    X : ndarray of shape (n, d)
        Data matrix (n samples, d features)
        
    Returns:
    --------
    X_centered : ndarray of shape (n, d)
        Centered data (each column has mean ‚âà 0)
    mean : ndarray of shape (d,)
        Mean vector (needed for inverse transform)
        
    Example:
    --------
    >>> X = np.array([[1, 2], [3, 4], [5, 6]])
    >>> X_c, mu = center_data(X)
    >>> print(mu)  # [3, 4]
    >>> print(X_c.mean(axis=0))  # [0, 0] (up to numerical error)
    """
    X = np.asarray(X, dtype=float)
    mu = np.mean(X, axis=0, keepdims=True)  # Mean of each column
    return X - mu, mu.squeeze()


def pca_fit(X: np.ndarray, k: int) -> Dict[str, np.ndarray]:
    """
    Fit PCA model to data X, keeping k components.
    
    This function:
    1. Centers the data
    2. Computes SVD of centered data
    3. Extracts first k principal components (columns of V)
    4. Computes PCA scores (projections)
    
    Parameters:
    -----------
    X : ndarray of shape (n, d)
        Training data
    k : int
        Number of principal components to keep (k <= d)
        
    Returns:
    --------
    dict with keys:
        'mean' : ndarray of shape (d,)
            Feature means (for centering new data)
        'components' : ndarray of shape (d, k)
            Principal component vectors (columns are PCs)
        'singular_values' : ndarray of shape (min(n,d),)
            All singular values from SVD
        'scores' : ndarray of shape (n, k)
            PCA coordinates of training data
            
    The 'components' matrix V_k satisfies:
        scores = (X - mean) @ components
        
    Example:
    --------
    >>> X = np.random.randn(100, 50)  # 100 samples, 50 features
    >>> pca = pca_fit(X, k=10)  # Keep top 10 components
    >>> print(pca['components'].shape)  # (50, 10)
    >>> print(pca['scores'].shape)      # (100, 10)
    """
    # Step 1: Center the data
    Xc, mu = center_data(X)
    
    # Step 2: Compute SVD
    # X_c = U @ diag(S) @ V^T
    U, S, Vt = np.linalg.svd(Xc, full_matrices=False)
    V = Vt.T  # Transpose to get columns as singular vectors
    
    # Step 3: Extract first k components
    Vk = V[:, :k]
    
    # Step 4: Compute PCA scores (projections)
    # scores = X_c @ V_k = U @ diag(S) @ V^T @ V_k
    #        = U[:, :k] @ diag(S[:k])
    scores = Xc @ Vk  # (n, k)
    
    return {
        "mean": mu, 
        "components": Vk, 
        "singular_values": S, 
        "scores": scores
    }


def pca_transform(X: np.ndarray, pca: Dict[str, np.ndarray]) -> np.ndarray:
    """
    Project new data X into PCA space using fitted model.
    
    Use this to transform test data or new observations using
    the PCA model learned from training data.
    
    Parameters:
    -----------
    X : ndarray of shape (n, d)
        Data to transform (must have same d as training data)
    pca : dict
        Fitted PCA model from pca_fit()
        
    Returns:
    --------
    Z : ndarray of shape (n, k)
        PCA coordinates (compressed representation)
        
    Formula:
    --------
    Z = (X - mean) @ components
    
    Example:
    --------
    >>> pca = pca_fit(X_train, k=5)
    >>> Z_test = pca_transform(X_test, pca)
    >>> # Z_test has 5 columns instead of original d columns!
    """
    X = np.asarray(X, dtype=float)
    Xc = X - pca["mean"]  # Center using training mean
    return Xc @ pca["components"]


def pca_inverse_transform(Z: np.ndarray, pca: Dict[str, np.ndarray]) -> np.ndarray:
    """
    Reconstruct data from PCA coordinates.
    
    Goes back from k-dimensional PCA space to original d dimensions.
    Note: This is only an approximation if k < d!
    
    Parameters:
    -----------
    Z : ndarray of shape (n, k)
        PCA coordinates
    pca : dict
        Fitted PCA model from pca_fit()
        
    Returns:
    --------
    X_reconstructed : ndarray of shape (n, d)
        Reconstructed data in original space
        
    Formula:
    --------
    X_reconstructed = Z @ components^T + mean
    
    Note: If k < d, this is the best rank-k approximation!
    
    Example:
    --------
    >>> pca = pca_fit(X_train, k=5)
    >>> Z = pca_transform(X_train, pca)
    >>> X_recon = pca_inverse_transform(Z, pca)
    >>> error = np.linalg.norm(X_train - X_recon, 'fro')
    >>> print(f"Reconstruction error: {error:.4f}")
    """
    Z = np.asarray(Z, dtype=float)
    # Back-project and add mean
    return Z @ pca["components"].T + pca["mean"]

# Part 4: Applications of Dimensionality Reduction üöÄ

## 11.4 Rank-k Approximation: Optimal Compression

### The Big Picture

Given matrix $A$ with SVD: $A = \sum_{j=1}^{m} \sigma_j u_j v_j^T$

We can approximate it using only the **first k terms**:
$$
A_k = \sum_{j=1}^{k} \sigma_j u_j v_j^T
$$

This is a **rank-k matrix** (has at most k non-zero singular values).

### üåü Why This is Optimal

**Eckart-Young Theorem**: $A_k$ is the **best** rank-k approximation to $A$ in both:
- Frobenius norm: $\|A - B\|_F = \sqrt{\sum_{i,j} (A_{ij} - B_{ij})^2}$
- Operator norm: $\|A - B\|_2$ = largest singular value of $(A-B)$

No other rank-k matrix is closer to $A$! üèÜ

---

## üìè Measuring Approximation Quality

### 1. Reconstruction Error (Frobenius Norm)

The error from using only $k$ terms:
$$
\|A - A_k\|_F = \sqrt{\sum_{j=k+1}^{m} \sigma_j^2}
$$

**Intuition**: The error is exactly the "energy" in the discarded components!

### 2. Explained Variance Ratio

What fraction of the data's "energy" is retained?
$$
\text{Explained Variance} = \frac{\sum_{j=1}^{k} \sigma_j^2}{\sum_{j=1}^{m} \sigma_j^2}
$$

**Interpretation**:
- 0.90 means "90% of variance explained" ‚Üí good compression
- 0.50 means "50% of variance explained" ‚Üí losing half the information
- 1.00 means "perfect reconstruction" ‚Üí no compression

### üìä The Scree Plot

Plot singular values $\sigma_1, \sigma_2, \ldots$ in descending order.

Look for the **"elbow"** where values drop sharply:
- Before elbow: important components (keep these)
- After elbow: noise components (can discard)

---

## üé® Example Application: Image Compression

Images are just matrices! A grayscale image is size $(H \times W)$.

**Without compression**: Store $H \times W$ values  
**With rank-k SVD**: Store $U_k$ (size $H \times k$) + $\Sigma_k$ (size $k$) + $V_k$ (size $W \times k$)

**Storage**: $H \times W \rightarrow k(H + W + 1)$

If $k \ll \min(H, W)$, huge savings! üíæ

---

## üõ†Ô∏è Implementation

Below we provide utilities for:
1. `rank_k_approximation()` - Compute $A_k$ from full SVD
2. `reconstruction_error_frobenius()` - Measure $\|A - A_k\|_F$
3. `reconstruction_error_from_singular_values()` - Compute error from $\sigma$ values
4. `explained_variance_ratio_from_singular_values()` - Compute variance explained

Plus a demo showing compression in action! üëá

In [None]:
# ============================================================================
# RANK-K APPROXIMATION AND COMPRESSION UTILITIES
# ============================================================================

def rank_k_approximation(X: np.ndarray, k: int) -> Tuple[np.ndarray, Dict[str, np.ndarray]]:
    """
    Compute the best rank-k approximation of matrix X using SVD.
    
    This uses the Eckart-Young theorem: the rank-k matrix that minimizes
    ||X - X_k|| is obtained by truncating the SVD.
    
    Parameters:
    -----------
    X : ndarray of shape (n, m)
        Input matrix (can be raw data or centered data)
    k : int
        Target rank (k <= min(n, m))
        
    Returns:
    --------
    X_k : ndarray of shape (n, m)
        Best rank-k approximation
    svd_info : dict
        Full SVD components: 'U', 'S', 'Vt'
        
    Algorithm:
    ----------
    1. Compute full SVD: X = U @ diag(S) @ Vt
    2. Keep first k components: X_k = U[:, :k] @ diag(S[:k]) @ Vt[:k, :]
    
    Example:
    --------
    >>> X = np.random.randn(100, 50)
    >>> X_10, info = rank_k_approximation(X, k=10)
    >>> print(f"Rank: {np.linalg.matrix_rank(X_10)}")  # Should be 10
    """
    X = np.asarray(X, dtype=float)
    
    # Compute full SVD
    U, S, Vt = np.linalg.svd(X, full_matrices=False)
    
    # Truncate to rank k
    Uk = U[:, :k]
    Sk = S[:k]
    Vtk = Vt[:k, :]
    
    # Reconstruct: X_k = U_k @ diag(S_k) @ Vt_k
    Xk = (Uk * Sk) @ Vtk  # Efficient: multiply U_k by S_k element-wise
    
    return Xk, {"U": U, "S": S, "Vt": Vt}


def reconstruction_error_frobenius(X: np.ndarray, Xk: np.ndarray) -> float:
    """
    Compute the Frobenius norm of the reconstruction error.
    
    Frobenius norm: ||A||_F = sqrt(sum of all squared entries)
    
    Parameters:
    -----------
    X : ndarray
        Original matrix
    Xk : ndarray
        Approximation
        
    Returns:
    --------
    error : float
        ||X - X_k||_F
        
    Interpretation:
    ---------------
    - Measures total squared error across all entries
    - Larger value = worse approximation
    - Compare to ||X||_F to get relative error
    
    Example:
    --------
    >>> error = reconstruction_error_frobenius(X, X_k)
    >>> relative = error / np.linalg.norm(X, 'fro')
    >>> print(f"Relative error: {relative:.2%}")
    """
    D = np.asarray(X, dtype=float) - np.asarray(Xk, dtype=float)
    return float(np.linalg.norm(D, ord='fro'))


def reconstruction_error_from_singular_values(S: np.ndarray, k: int) -> float:
    """
    Compute reconstruction error directly from singular values (more efficient!).
    
    Uses the formula: ||X - X_k||_F = sqrt(œÉ_{k+1}^2 + œÉ_{k+2}^2 + ...)
    
    This is MUCH faster than computing X_k and then ||X - X_k||!
    
    Parameters:
    -----------
    S : ndarray of shape (r,)
        Singular values in descending order
    k : int
        Number of components kept
        
    Returns:
    --------
    error : float
        Reconstruction error
        
    Example:
    --------
    >>> U, S, Vt = np.linalg.svd(X, full_matrices=False)
    >>> error_k5 = reconstruction_error_from_singular_values(S, k=5)
    >>> error_k10 = reconstruction_error_from_singular_values(S, k=10)
    >>> print(f"k=5: {error_k5:.4f}, k=10: {error_k10:.4f}")
    """
    S = np.asarray(S, dtype=float)
    tail = S[k:]  # Discarded singular values
    return float(np.sqrt(np.sum(tail * tail)))


def explained_variance_ratio_from_singular_values(S: np.ndarray, k: int) -> float:
    """
    Compute the fraction of variance explained by first k components.
    
    Formula: (œÉ‚ÇÅ¬≤ + œÉ‚ÇÇ¬≤ + ... + œÉ‚Çñ¬≤) / (œÉ‚ÇÅ¬≤ + œÉ‚ÇÇ¬≤ + ... + œÉ‚Çò¬≤)
    
    Parameters:
    -----------
    S : ndarray of shape (r,)
        Singular values in descending order
    k : int
        Number of components
        
    Returns:
    --------
    ratio : float
        Explained variance ratio (between 0 and 1)
        
    Interpretation:
    ---------------
    - 1.0 = perfect (all variance explained)
    - 0.9 = excellent (90% variance retained)
    - 0.5 = poor (half the information lost)
    - 0.0 = useless (everything discarded)
    
    Example:
    --------
    >>> U, S, Vt = np.linalg.svd(X, full_matrices=False)
    >>> for k in [1, 5, 10, 20]:
    ...     evr = explained_variance_ratio_from_singular_values(S, k)
    ...     print(f"k={k:2d}: {evr:.2%}")
    """
    S = np.asarray(S, dtype=float)
    num = float(np.sum(S[:k] * S[:k]))  # Energy in first k components
    den = float(np.sum(S * S))          # Total energy
    
    if den == 0:
        return float("nan")  # Edge case: zero matrix
    
    return num / den


# ============================================================================
# DEMO: Rank-k Approximation on Synthetic Data
# ============================================================================

print("=" * 70)
print("RANK-K APPROXIMATION DEMO")
print("=" * 70)

# Generate low-rank data with noise
rng = np.random.default_rng(0)
n, d = 500, 60

# Create inherently low-rank data (rank ‚âà 10) plus noise
X_lowrank = rng.normal(size=(n, 10)) @ rng.normal(size=(10, d))
X_noise = rng.normal(scale=0.5, size=(n, d))
X = X_lowrank + X_noise

print(f"\nüìä Generated data: {n} samples √ó {d} features")
print(f"   (Low-rank signal + noise)")

# Center the data (important for PCA interpretation)
Xc, mu = center_data(X)

# Compute rank-10 approximation
X10, svd_info = rank_k_approximation(Xc, k=10)

print("\nüîß Computing rank-10 approximation...")
print(f"   Original matrix rank: {np.linalg.matrix_rank(Xc)}")
print(f"   Approximation rank: {np.linalg.matrix_rank(X10)}")

# Measure error (two methods)
error_direct = reconstruction_error_frobenius(Xc, X10)
error_sigma = reconstruction_error_from_singular_values(svd_info["S"], k=10)

print("\nüìè RECONSTRUCTION ERROR:")
print(f"   Direct computation:  {error_direct:.6f}")
print(f"   From singular values: {error_sigma:.6f}")
print(f"   Difference: {abs(error_direct - error_sigma):.2e} (should be ‚âà0)")

# Explained variance
evr = explained_variance_ratio_from_singular_values(svd_info["S"], k=10)
print(f"\nüìä EXPLAINED VARIANCE:")
print(f"   k=10 components explain {evr:.2%} of variance")

# Compare different k values
print("\nüìà EXPLAINED VARIANCE FOR DIFFERENT k:")
print("-" * 70)
for k_test in [1, 5, 10, 15, 20, 30]:
    evr_k = explained_variance_ratio_from_singular_values(svd_info["S"], k=k_test)
    print(f"   k={k_test:2d}: {evr_k:.4f} ({evr_k:.1%})")

# Scree plot: visualize singular values
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.plot(svd_info["S"], 'o-', markersize=4)
plt.xlabel("Component Index", fontsize=11)
plt.ylabel("Singular Value", fontsize=11)
plt.title("Scree Plot (All Singular Values)", fontweight='bold')
plt.grid(alpha=0.3)

plt.subplot(1, 3, 2)
plt.plot(svd_info["S"][:20], 'o-', markersize=6)
plt.axvline(10, color='red', linestyle='--', label='k=10')
plt.xlabel("Component Index", fontsize=11)
plt.ylabel("Singular Value", fontsize=11)
plt.title("Top-20 Components (Look for Elbow!)", fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.subplot(1, 3, 3)
k_range = range(1, min(31, len(svd_info["S"])))
evr_curve = [explained_variance_ratio_from_singular_values(svd_info["S"], k) 
             for k in k_range]
plt.plot(k_range, evr_curve, 'o-', markersize=4)
plt.axhline(0.9, color='green', linestyle='--', alpha=0.5, label='90% threshold')
plt.axhline(0.95, color='orange', linestyle='--', alpha=0.5, label='95% threshold')
plt.xlabel("Number of Components (k)", fontsize=11)
plt.ylabel("Explained Variance Ratio", fontsize=11)
plt.title("Cumulative Explained Variance", fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° How to choose k:")
print("   1. Look for 'elbow' in scree plot (where curve flattens)")
print("   2. Choose k to explain desired % of variance (e.g., 90%)")
print("   3. Balance compression vs. reconstruction quality")
print("   4. For this data: k=10 is a good choice!")

### Optional demo: digits compression (like the notes' MNIST example)

The notes mention MNIST; offline we can use scikit-learn's `load_digits` dataset (8√ó8 images).
If scikit-learn isn't available in your environment, you can skip this cell safely.


In [None]:
# Optional: digits dataset compression (offline, small)
try:
    from sklearn.datasets import load_digits
    digits = load_digits()
    A = digits.data.astype(float)  # (n,64)
    A_centered, mu = center_data(A)

    k = 10
    Ak, svd_info = rank_k_approximation(A_centered, k=k)
    recon = Ak + mu

    # Show 10 original vs reconstructed
    idx = np.arange(10)
    plt.figure(figsize=(8, 2))
    for i, j in enumerate(idx):
        plt.subplot(2, 10, i+1)
        plt.imshow(digits.images[j], cmap='gray')
        plt.axis('off')
        plt.subplot(2, 10, 10+i+1)
        plt.imshow(recon[j].reshape(8,8), cmap='gray')
        plt.axis('off')
    plt.suptitle("Top row: original | Bottom row: rank-10 reconstruction")
    plt.show()

    # Explained variance
    evr10 = explained_variance_ratio_from_singular_values(svd_info["S"], k=10)
    print("Explained variance ratio (k=10):", evr10)
except Exception as e:
    print("Skipped digits demo (likely sklearn missing):", repr(e))


## 11.4.3 Application: Anomaly Detection via Reconstruction Error üîç

### The Core Idea

**Normal data** lies in a low-dimensional subspace. **Anomalies** don't!

If we compress "normal" data to $k$ dimensions and reconstruct it, the error is small.  
But anomalies have large reconstruction error because they don't fit the pattern!

### üéØ The Algorithm

**Training Phase** (on normal data):
1. Fit PCA with $k$ components to capture "normal" patterns
2. Reconstruct training data and compute reconstruction errors
3. Set threshold $\tau$ = high quantile of training errors (e.g., 99th percentile)

**Testing Phase** (on new data):
1. Transform test point to PCA space and back
2. Compute reconstruction error $e = \|x - \hat{x}\|$
3. Flag as anomaly if $e > \tau$

### üìä Intuition

Think of PCA as learning the "shape" of normal data:
- Normal points: fit the shape well ‚Üí small reconstruction error
- Anomalies: don't fit the shape ‚Üí large reconstruction error

### ‚öôÔ∏è Hyperparameters

**k (number of components)**:
- Too small: can't capture normal patterns ‚Üí false positives
- Too large: captures noise ‚Üí false negatives
- Rule of thumb: choose k to explain 90-95% of variance

**q (threshold quantile)**:
- Typical: q = 0.95 or 0.99
- Higher q = fewer false alarms but may miss some anomalies
- Lower q = more sensitive but more false alarms

### ‚úÖ When This Works Well

‚úì Normal data has low intrinsic dimension  
‚úì Anomalies are different from normal patterns  
‚úì You have clean training data (mostly normal examples)  

### ‚ö†Ô∏è Limitations

‚úó If anomalies also lie in the low-D subspace (systematic drift)  
‚úó If normal data is very high-dimensional with no structure  
‚úó If training data contains many anomalies (contaminates the model)  

---

## üõ†Ô∏è Implementation

Below we provide a complete anomaly detection pipeline:

1. `pca_anomaly_detector_fit()` - Train on normal data
2. `pca_anomaly_detector_predict()` - Detect anomalies in new data
3. Full demo with injected anomalies

Let's see it in action! üëá

In [None]:
# ============================================================================
# PCA-BASED ANOMALY DETECTION
# ============================================================================

def reconstruction_errors_per_row(X: np.ndarray, Xk: np.ndarray) -> np.ndarray:
    """
    Compute reconstruction error for each data point.
    
    For each row i: error[i] = ||X[i] - Xk[i]||
    
    Parameters:
    -----------
    X : ndarray of shape (n, d)
        Original data
    Xk : ndarray of shape (n, d)
        Reconstructed data
        
    Returns:
    --------
    errors : ndarray of shape (n,)
        Per-sample reconstruction errors
        
    Example:
    --------
    >>> errors = reconstruction_errors_per_row(X, X_reconstructed)
    >>> print(f"Mean error: {errors.mean():.4f}")
    >>> print(f"Max error: {errors.max():.4f}")
    """
    X = np.asarray(X, dtype=float)
    Xk = np.asarray(Xk, dtype=float)
    return np.linalg.norm(X - Xk, axis=1)


def anomaly_threshold_from_quantile(errors: np.ndarray, q: float = 0.99) -> float:
    """
    Choose anomaly threshold as a quantile of training errors.
    
    Parameters:
    -----------
    errors : ndarray
        Reconstruction errors from training data
    q : float, default=0.99
        Quantile level (0.95 = 95th percentile, 0.99 = 99th percentile)
        
    Returns:
    --------
    threshold : float
        Value such that (1-q) fraction of training errors exceed it
        
    Interpretation:
    ---------------
    - q=0.95: flag ~5% of training data as "unusual"
    - q=0.99: flag ~1% of training data as "unusual"
    - Higher q = stricter threshold = fewer false alarms
    
    Example:
    --------
    >>> threshold = anomaly_threshold_from_quantile(train_errors, q=0.99)
    >>> print(f"99th percentile: {threshold:.4f}")
    """
    return float(np.quantile(np.asarray(errors, dtype=float), q))


def pca_anomaly_detector_fit(X_train: np.ndarray, k: int, 
                             q: float = 0.99) -> Dict[str, object]:
    """
    Train PCA-based anomaly detector on normal data.
    
    Steps:
    1. Fit PCA with k components
    2. Compute reconstruction errors on training data
    3. Set threshold at q-th quantile of errors
    
    Parameters:
    -----------
    X_train : ndarray of shape (n, d)
        Training data (should be mostly normal examples)
    k : int
        Number of PCA components (controls compression)
    q : float, default=0.99
        Quantile for threshold (0.95-0.99 typical)
        
    Returns:
    --------
    detector : dict
        Trained model with keys:
        - 'pca': fitted PCA model
        - 'k': number of components
        - 'q': quantile used
        - 'threshold': anomaly threshold
        - 'train_errors': training reconstruction errors
        
    Example:
    --------
    >>> detector = pca_anomaly_detector_fit(X_train, k=10, q=0.99)
    >>> print(f"Threshold: {detector['threshold']:.4f}")
    """
    # Fit PCA
    pca = pca_fit(X_train, k=k)
    
    # Compute reconstruction errors on training data
    Z = pca_transform(X_train, pca)
    X_rec = pca_inverse_transform(Z, pca)
    errs = reconstruction_errors_per_row(X_train, X_rec)
    
    # Set threshold
    thr = anomaly_threshold_from_quantile(errs, q=q)
    
    return {
        "pca": pca, 
        "k": k, 
        "q": q, 
        "threshold": thr, 
        "train_errors": errs
    }


def pca_anomaly_detector_predict(det: Dict[str, object], 
                                 X: np.ndarray) -> Dict[str, np.ndarray]:
    """
    Detect anomalies in new data using trained detector.
    
    For each test point:
    1. Transform to PCA space
    2. Reconstruct
    3. Compute error
    4. Compare to threshold
    
    Parameters:
    -----------
    det : dict
        Trained detector from pca_anomaly_detector_fit()
    X : ndarray of shape (m, d)
        Test data
        
    Returns:
    --------
    predictions : dict
        - 'errors': reconstruction errors for each point
        - 'is_anomaly': boolean flags (True = anomaly)
        - 'threshold': the threshold used
        
    Example:
    --------
    >>> pred = pca_anomaly_detector_predict(detector, X_test)
    >>> n_anomalies = pred['is_anomaly'].sum()
    >>> print(f"Found {n_anomalies} anomalies out of {len(X_test)} points")
    """
    pca = det["pca"]
    
    # Reconstruct test data
    Z = pca_transform(X, pca)
    X_rec = pca_inverse_transform(Z, pca)
    
    # Compute errors
    errs = reconstruction_errors_per_row(X, X_rec)
    
    # Flag anomalies
    flags = errs > det["threshold"]
    
    return {
        "errors": errs, 
        "is_anomaly": flags, 
        "threshold": float(det["threshold"])
    }


# ============================================================================
# DEMO: Anomaly Detection with Synthetic Data
# ============================================================================

print("=" * 70)
print("PCA-BASED ANOMALY DETECTION DEMO")
print("=" * 70)

# Generate data
rng = np.random.default_rng(42)

# Training data: normal examples
print("\nüé≤ Generating data...")
X_train = rng.normal(loc=0, scale=1.0, size=(800, 40))
print(f"   Training set: {X_train.shape[0]} normal samples")

# Test data: mostly normal + some anomalies
X_test = rng.normal(loc=0, scale=1.0, size=(300, 40))

# Inject anomalies: add large noise to first 20 samples
n_anomalies = 20
X_test[:n_anomalies] += rng.normal(loc=0, scale=6.0, size=(n_anomalies, 40))
print(f"   Test set: {X_test.shape[0]} samples ({n_anomalies} injected anomalies)")

# Train detector
print("\nüîß Training anomaly detector...")
print(f"   Using k=5 components, q=0.99 quantile")
det = pca_anomaly_detector_fit(X_train, k=5, q=0.99)
print(f"   Threshold set to: {det['threshold']:.4f}")
print(f"   (Training data errors: mean={det['train_errors'].mean():.4f}, "
      f"max={det['train_errors'].max():.4f})")

# Predict on test set
print("\nüîç Detecting anomalies in test data...")
pred = pca_anomaly_detector_predict(det, X_test)

n_flagged = int(np.sum(pred["is_anomaly"]))
n_flagged_in_injected = int(np.sum(pred["is_anomaly"][:n_anomalies]))
n_flagged_in_normal = int(np.sum(pred["is_anomaly"][n_anomalies:]))

print(f"\nüìä RESULTS:")
print("-" * 70)
print(f"Total flagged as anomalies: {n_flagged} / {len(X_test)}")
print(f"  Among injected anomalies ({n_anomalies}): {n_flagged_in_injected} detected")
print(f"  Among normal samples ({len(X_test)-n_anomalies}): {n_flagged_in_normal} false alarms")
print(f"\nDetection rate: {n_flagged_in_injected/n_anomalies:.1%}")
print(f"False alarm rate: {n_flagged_in_normal/(len(X_test)-n_anomalies):.1%}")

# Visualize reconstruction errors
plt.figure(figsize=(12, 4))

# Left: histogram of errors
plt.subplot(1, 3, 1)
plt.hist(det["train_errors"], bins=40, alpha=0.7, label="Training (normal)", 
         color='blue', edgecolor='black')
plt.hist(pred["errors"][n_anomalies:], bins=40, alpha=0.5, label="Test (normal)", 
         color='green', edgecolor='black')
plt.hist(pred["errors"][:n_anomalies], bins=40, alpha=0.7, label="Test (anomalies)", 
         color='red', edgecolor='black')
plt.axvline(pred["threshold"], linestyle='--', color='black', linewidth=2, 
           label='Threshold')
plt.xlabel("Reconstruction Error", fontsize=11)
plt.ylabel("Count", fontsize=11)
plt.title("Distribution of Reconstruction Errors", fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

# Middle: scatter plot
plt.subplot(1, 3, 2)
idx_normal = np.arange(n_anomalies, len(X_test))
idx_anomaly = np.arange(n_anomalies)
plt.scatter(idx_normal, pred["errors"][idx_normal], alpha=0.5, s=20, 
           label='Normal', color='green')
plt.scatter(idx_anomaly, pred["errors"][:n_anomalies], alpha=0.7, s=40, 
           label='Injected anomalies', color='red', marker='^')
plt.axhline(pred["threshold"], linestyle='--', color='black', linewidth=2, 
           label='Threshold')
plt.xlabel("Sample Index", fontsize=11)
plt.ylabel("Reconstruction Error", fontsize=11)
plt.title("Per-Sample Errors (Test Set)", fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

# Right: ROC-style curve (vary threshold)
plt.subplot(1, 3, 3)
thresholds = np.linspace(det["train_errors"].min(), pred["errors"].max(), 100)
detection_rates = []
false_alarm_rates = []
for t in thresholds:
    detected = (pred["errors"][:n_anomalies] > t).sum() / n_anomalies
    false_alarms = (pred["errors"][n_anomalies:] > t).sum() / (len(X_test) - n_anomalies)
    detection_rates.append(detected)
    false_alarm_rates.append(false_alarms)

plt.plot(false_alarm_rates, detection_rates, 'b-', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', alpha=0.3, label='Random')
plt.xlabel("False Alarm Rate", fontsize=11)
plt.ylabel("Detection Rate", fontsize=11)
plt.title("Detection Performance Curve", fontweight='bold')
plt.grid(alpha=0.3)
plt.legend()

plt.tight_layout()
plt.show()

print("\nüí° Key Insights:")
print("   ‚úì Anomalies have HIGHER reconstruction errors")
print("   ‚úì Threshold controls trade-off between detection vs false alarms")
print("   ‚úì k and q are hyperparameters to tune for your data")
print("   ‚úì This method works when normal data has low intrinsic dimension!")

print("\nüéØ Try it yourself:")
print("   - Change k (number of components)")
print("   - Change q (threshold quantile)")
print("   - Inject different types of anomalies")
print("   - Use on real data!")

# Part 5: Theoretical Foundations üìñ

## 11.5 Why PCA Works: Concentration of Sample Covariance

### The Setup

Given i.i.d. random vectors $X_1, \ldots, X_n \in \mathbb{R}^d$ with:
- Mean: $\mathbb{E}[X_i] = 0$ (centered)
- True covariance: $\Sigma = \mathbb{E}[X_i X_i^T]$
- Bounded norm: $\|X_i\|_2 \leq \sqrt{C}$ almost surely

The **empirical covariance** is:
$$
\hat{\Sigma} = \frac{1}{n} \sum_{i=1}^n X_i X_i^T
$$

### üéØ Question: How close is $\hat{\Sigma}$ to $\Sigma$?

We need **matrix concentration** results!

---

## üåü Theorem 11.15: Matrix Bernstein Bound

**Theorem (Matrix Bernstein)**: Under the above conditions,
$$
\mathbb{P}\left( \|\hat{\Sigma} - \Sigma\|_2 > \varepsilon \right) \leq 2d \cdot \exp\left( -\frac{n\varepsilon^2}{2C(C + 2\varepsilon/3)} \right)
$$

### What This Means (Intuition):

1. **Operator norm**: $\|\cdot\|_2$ measures the largest eigenvalue deviation
2. **Exponential concentration**: Probability decreases exponentially with $n$!
3. **Dependence on d**: Factor $2d$ means we need $n \gtrsim d \log d$ samples
4. **Boundedness C**: Must have bounded samples (or sub-Gaussian tails)

### Practical Implications:

- For fixed $d$, as $n \to \infty$: $\hat{\Sigma} \to \Sigma$ (consistent estimator)
- For high dimensions ($d$ large): need more samples
- Tight bound depends on $C$ (how spread out the data is)

---

## üîß Theorem 11.17: Weyl's Theorem (Eigenvalue Perturbation)

**Theorem (Weyl)**: If $\hat{\Sigma} = \Sigma + E$ (true + error), then:
$$
\max_i |\hat{\lambda}_i - \lambda_i| \leq \|E\|_2
$$

where $\lambda_i$ and $\hat{\lambda}_i$ are eigenvalues of $\Sigma$ and $\hat{\Sigma}$.

### What This Means (Intuition):

1. **Eigenvalue stability**: Eigenvalues can't change more than the error norm
2. **Worst-case bound**: The *maximum* eigenvalue error is bounded by operator norm
3. **Not average**: Individual eigenvalues might change less!

### Why It Matters:

- Combining Matrix Bernstein + Weyl: eigenvalues of $\hat{\Sigma}$ are close to those of $\Sigma$
- Since PCA uses eigenvalues/eigenvectors, this means PCA on finite samples is reliable!

---

## üìä 11.6 Excess Risk: How Much Does Finite Sample Hurt?

### The Risk Framework

Define:
- $\mathcal{P}_k$ = all rank-$k$ orthogonal projections
- Loss: $L(x, \Pi(x)) = \|x - \Pi(x)\|_2^2$ (reconstruction error)
- Risk: $R(\Pi) = \mathbb{E}_X[L(X, \Pi(X))]$

**Population optimum**: $\Pi_k^* = \arg\min_{\Pi \in \mathcal{P}_k} R(\Pi)$  
**Empirical optimum**: $\hat{\Pi}_k^* = \arg\min_{\Pi \in \mathcal{P}_k} \hat{R}(\Pi)$ (from sample)

**Excess risk**: 
$$
E_k = R(\hat{\Pi}_k^*) - R(\Pi_k^*)
$$

How much worse is our empirical solution compared to the ideal?

---

## üåü Lemma 11.19: Excess Risk from Covariance Error

**Lemma**: 
$$
E_k \leq \sqrt{2k} \cdot \|\Sigma - \hat{\Sigma}\|_2
$$

### What This Means:

- Excess risk is controlled by covariance estimation error
- Factor $\sqrt{k}$: more components = potentially more error accumulation
- But: if $\|\Sigma - \hat{\Sigma}\|_2$ is small (many samples), excess risk is small!

---

## üéØ Theorem 11.20: Putting It All Together

**Theorem**: Combining Lemma 11.19 with Matrix Bernstein,
$$
\mathbb{P}(E_k > \varepsilon) \leq 2d \cdot \exp\left( -\frac{n\varepsilon^2}{4C(C + 2\varepsilon/3) \cdot k} \right)
$$

### What This Means:

1. **Sample complexity**: Need $n \gtrsim \frac{d \cdot k}{\varepsilon^2}$ for small excess risk
2. **Trade-off**: More components ($k$ larger) requires more samples
3. **High confidence**: Exponential tail means high-probability guarantees

### Practical Take-Away:

‚úÖ PCA is **provably good** when you have enough samples relative to $d$ and $k$  
‚úÖ The theory matches practice: more samples ‚Üí better performance  
‚úÖ These bounds guide hyperparameter selection (choose $k$ based on $n$ and $d$)

---

## üõ†Ô∏è Implementation

Below we provide utilities to:
1. Compute empirical covariance
2. Measure operator norms
3. Evaluate the theoretical bounds

These help you understand when your sample size is sufficient!

Let's see the code! üëá

In [None]:
# ============================================================================
# THEORETICAL BOUNDS UTILITIES
# ============================================================================

def empirical_covariance_centered(X: np.ndarray) -> np.ndarray:
    """
    Compute empirical covariance matrix from centered data.
    
    Formula: Œ£_hat = (1/n) * X^T X
    
    Assumes X has mean 0 (each column sums to ‚âà 0).
    
    Parameters:
    -----------
    X : ndarray of shape (n, d)
        Centered data matrix (rows are samples)
        
    Returns:
    --------
    Sigma_hat : ndarray of shape (d, d)
        Empirical covariance matrix
        
    Note: This is the sample covariance estimator used in PCA!
    
    Example:
    --------
    >>> X = np.random.randn(1000, 50)
    >>> X_centered = X - X.mean(axis=0)
    >>> Sigma_hat = empirical_covariance_centered(X_centered)
    >>> print(Sigma_hat.shape)  # (50, 50)
    """
    X = np.asarray(X, dtype=float)
    n = X.shape[0]
    return (X.T @ X) / n


def operator_norm(M: np.ndarray) -> float:
    """
    Compute the operator (spectral) norm of a matrix.
    
    Operator norm: ||M||_2 = largest singular value of M
                          = sqrt(largest eigenvalue of M^T M)
    
    This is the "worst-case" stretching factor of the matrix.
    
    Parameters:
    -----------
    M : ndarray of shape (m, n)
        Any matrix
        
    Returns:
    --------
    norm : float
        ||M||_2
        
    Example:
    --------
    >>> M = np.random.randn(100, 100)
    >>> print(f"Operator norm: {operator_norm(M):.4f}")
    >>> print(f"Frobenius norm: {np.linalg.norm(M, 'fro'):.4f}")
    >>> # Operator norm is always ‚â§ Frobenius norm
    """
    M = np.asarray(M, dtype=float)
    return float(np.linalg.norm(M, ord=2))


def matrix_bernstein_bound_thm11_15(d: int, n: int, eps: float, C: float) -> float:
    """
    Compute the probability bound from Matrix Bernstein Theorem (Thm 11.15).
    
    Bound: P(||Œ£_hat - Œ£|| > eps) ‚â§ 2d * exp(-n*eps^2 / (2C(C + 2*eps/3)))
    
    Parameters:
    -----------
    d : int
        Dimension (number of features)
    n : int
        Sample size
    eps : float
        Error tolerance
    C : float
        Bound constant: ||X_i||^2 ‚â§ C almost surely
        
    Returns:
    --------
    bound : float
        Upper bound on the probability of large deviation
        
    Interpretation:
    ---------------
    - Small bound (close to 0): high confidence that ||Œ£_hat - Œ£|| ‚â§ eps
    - Large bound: not enough samples or eps too small
    - Want: n large enough so bound < 0.05 (95% confidence)
    
    Example:
    --------
    >>> # Check if 1000 samples sufficient for d=50, eps=0.1, C=25
    >>> prob_bound = matrix_bernstein_bound_thm11_15(d=50, n=1000, eps=0.1, C=25)
    >>> print(f"P(error > 0.1) ‚â§ {prob_bound:.6f}")
    """
    if d <= 0 or n <= 0 or eps <= 0 or C <= 0:
        raise ValueError("d, n, eps, C must be positive.")
    
    denom = 2.0 * C * (C + 2.0 * eps / 3.0)
    exponent = - n * eps * eps / denom
    return float(2.0 * d * math.exp(exponent))


def weyl_eigenvalue_deviation_bound(E: np.ndarray) -> float:
    """
    Apply Weyl's theorem to bound eigenvalue deviations.
    
    Weyl: max_i |Œª_hat_i - Œª_i| ‚â§ ||E||_2
    
    where E = Œ£_hat - Œ£ is the error matrix.
    
    Parameters:
    -----------
    E : ndarray of shape (d, d)
        Error matrix (difference between empirical and true covariance)
        
    Returns:
    --------
    bound : float
        ||E||_2 - the worst-case eigenvalue deviation
        
    Interpretation:
    ---------------
    Every eigenvalue of Œ£_hat is within this bound of some eigenvalue of Œ£.
    
    Example:
    --------
    >>> E = Sigma_hat - Sigma_true
    >>> max_eigen_error = weyl_eigenvalue_deviation_bound(E)
    >>> print(f"Eigenvalues differ by at most: {max_eigen_error:.6f}")
    """
    return operator_norm(E)


# ============================================================================
# DEMO: Verifying Matrix Bernstein Bound
# ============================================================================

print("=" * 70)
print("MATRIX BERNSTEIN THEOREM: COVARIANCE CONCENTRATION")
print("=" * 70)

# Generate data with known covariance
rng = np.random.default_rng(123)
d = 20
n = 4000

print(f"\nüìä Setup:")
print(f"   Dimension: d = {d}")
print(f"   Sample size: n = {n}")

# Generate standard normal data (Œ£ = I)
X = rng.normal(size=(n, d))
Xc, _ = center_data(X)

print(f"   True covariance: Œ£ = I (identity matrix)")

# Clip norms to ensure bounded condition
# We'll clip ||X_i|| to be ‚â§ sqrt(C)
C_bound = 25.0  # Choose C = 25, so ||X_i|| ‚â§ 5
norms = np.linalg.norm(Xc, axis=1, keepdims=True)
Xc_clip = Xc / np.maximum(1.0, norms / math.sqrt(C_bound))
actual_max_norm_sq = np.max(np.linalg.norm(Xc_clip, axis=1)**2)
print(f"   Enforced bound: ||X_i||^2 ‚â§ C = {C_bound} (actual max: {actual_max_norm_sq:.2f})")

# Compute empirical covariance
Sigma_hat = empirical_covariance_centered(Xc_clip)
Sigma_true = np.eye(d)  # True covariance

# Compute error
E = Sigma_hat - Sigma_true
error_norm = operator_norm(E)

print(f"\nüìè EMPIRICAL RESULTS:")
print(f"   ||Œ£_hat - Œ£||_2 = {error_norm:.6f}")

# Apply Matrix Bernstein bound
eps_values = [0.05, 0.1, 0.15, 0.2]
print(f"\nüéØ MATRIX BERNSTEIN BOUNDS (Theorem 11.15):")
print("-" * 70)
print(f"{'eps':<10} {'Bound':<15} {'Actual ‚â§ eps?':<20}")
print("-" * 70)
for eps in eps_values:
    prob_bound = matrix_bernstein_bound_thm11_15(d, n, eps, C_bound)
    actual_within = "‚úì Yes" if error_norm <= eps else "‚úó No"
    print(f"{eps:<10.3f} {prob_bound:<15.6e} {actual_within:<20}")

# Weyl's theorem
print(f"\nüîß WEYL'S THEOREM (Theorem 11.17):")
print("-" * 70)
eigen_bound = weyl_eigenvalue_deviation_bound(E)
print(f"   Max eigenvalue deviation bound: {eigen_bound:.6f}")

# Check actual eigenvalue deviations
eig_hat = np.linalg.eigvalsh(Sigma_hat)
eig_true = np.linalg.eigvalsh(Sigma_true)
actual_max_dev = np.max(np.abs(eig_hat - eig_true))
print(f"   Actual max eigenvalue deviation: {actual_max_dev:.6f}")
print(f"   Weyl's bound holds? {actual_max_dev <= eigen_bound + 1e-10}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Left: Empirical vs true covariance (heatmap difference)
ax = axes[0]
im = ax.imshow(E, cmap='RdBu_r', vmin=-0.2, vmax=0.2)
ax.set_title("Error Matrix: Œ£_hat - Œ£", fontweight='bold')
ax.set_xlabel("Feature Index")
ax.set_ylabel("Feature Index")
plt.colorbar(im, ax=ax)

# Middle: Eigenvalue comparison
ax = axes[1]
idx = np.arange(d)
ax.plot(idx, eig_true, 'o-', label='True eigenvalues', markersize=6)
ax.plot(idx, eig_hat, 's--', label='Empirical eigenvalues', markersize=5, alpha=0.7)
ax.set_xlabel("Eigenvalue Index", fontsize=11)
ax.set_ylabel("Eigenvalue", fontsize=11)
ax.set_title("Eigenvalue Comparison", fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)

# Right: Probability bounds
ax = axes[2]
eps_range = np.linspace(0.01, 0.5, 50)
bounds = [matrix_bernstein_bound_thm11_15(d, n, e, C_bound) for e in eps_range]
ax.semilogy(eps_range, bounds, 'b-', linewidth=2)
ax.axvline(error_norm, color='red', linestyle='--', linewidth=2, 
          label=f'Observed error: {error_norm:.4f}')
ax.axhline(0.05, color='green', linestyle='--', alpha=0.5, label='5% threshold')
ax.set_xlabel("Error Tolerance (Œµ)", fontsize=11)
ax.set_ylabel("Probability Bound (log scale)", fontsize=11)
ax.set_title("Matrix Bernstein Bound", fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Insights:")
print("   ‚úì Empirical covariance concentrates around true covariance")
print("   ‚úì Concentration is EXPONENTIAL in sample size n")
print("   ‚úì Weyl's theorem guarantees eigenvalue stability")
print("   ‚úì These bounds justify PCA on finite samples!")

print("\nüéØ Practical Implications:")
print(f"   ‚Ä¢ For d={d}, n={n}, we have good concentration")
print(f"   ‚Ä¢ Need n ‚â≥ d*log(d) for reliable estimation")
print(f"   ‚Ä¢ Larger C (more spread-out data) needs more samples")
print(f"   ‚Ä¢ These are worst-case bounds - practice often better!")

## 11.6 Excess Risk Analysis: Quality of Empirical PCA

### The Problem

We want the **best** rank-$k$ projection for minimizing expected reconstruction error:
$$
\Pi_k^* = \arg\min_{\Pi \in \mathcal{P}_k} \mathbb{E}\left[ \|X - \Pi(X)\|_2^2 \right]
$$

But we only have a sample $X_1, \ldots, X_n$, so we compute:
$$
\hat{\Pi}_k^* = \arg\min_{\Pi \in \mathcal{P}_k} \frac{1}{n} \sum_{i=1}^n \|X_i - \Pi(X_i)\|_2^2
$$

**Question**: How much does using $\hat{\Pi}_k^*$ instead of $\Pi_k^*$ hurt us?

---

## üìä Definition: Excess Risk

The **excess risk** measures the price of using finite samples:
$$
E_k = R(\hat{\Pi}_k^*) - R(\Pi_k^*)
$$

where $R(\Pi) = \mathbb{E}[\|X - \Pi(X)\|_2^2]$ is the true risk.

### Interpretation:

- $E_k = 0$: empirical solution is optimal (lucky!)
- $E_k$ small: finite sample doesn't hurt much (good!)
- $E_k$ large: need more samples (bad!)

---

## üåü Lemma 11.19: Excess Risk Bound

**Lemma**: The excess risk is bounded by:
$$
E_k \leq \sqrt{2k} \cdot \|\Sigma - \hat{\Sigma}\|_2
$$

### What This Tells Us:

1. **Linear in $\sqrt{k}$**: More components ‚Üí potentially more excess risk
2. **Controlled by covariance error**: If $\hat{\Sigma} \approx \Sigma$, then $E_k \approx 0$
3. **Sample size matters**: More samples ‚Üí smaller $\|\Sigma - \hat{\Sigma}\|_2$ ‚Üí smaller $E_k$

### Intuition:

The PCA projection depends on eigenvectors of $\hat{\Sigma}$. If $\hat{\Sigma}$ is close to $\Sigma$, then the eigenvectors are close (by perturbation theory), so the projections are close!

---

## üéØ Theorem 11.20: Probability Bound on Excess Risk

Combining Lemma 11.19 with Matrix Bernstein (Thm 11.15):

$$
\mathbb{P}(E_k > \varepsilon) \leq 2d \cdot \exp\left( -\frac{n\varepsilon^2}{4C(C + 2\varepsilon/3) \cdot k} \right)
$$

### What This Means:

1. **Sample complexity**: To achieve $E_k \leq \varepsilon$ with high probability, need:
   $$
   n \gtrsim \frac{4Ck \cdot d \log d}{\varepsilon^2}
   $$

2. **Trade-offs**:
   - Larger $k$ (more components) ‚Üí need MORE samples
   - Larger $d$ (more features) ‚Üí need MORE samples
   - Smaller $\varepsilon$ (tighter bound) ‚Üí need MORE samples

3. **Exponential concentration**: Probability of large excess risk drops exponentially with $n$!

### Practical Guidelines:

‚úÖ **Rule of thumb**: Want $n \geq 10 \cdot d \cdot k$ for reliable PCA  
‚úÖ **High dimensions**: If $d$ is huge, consider random projections first  
‚úÖ **Many components**: Larger $k$ requires more data validation  

---

## üõ†Ô∏è Implementation

Below we provide utilities to compute and evaluate excess risk bounds:

1. `excess_risk_upper_from_cov_error()` - Apply Lemma 11.19
2. `excess_risk_tail_bound_thm11_20()` - Apply Theorem 11.20
3. Demo showing how bounds scale with parameters

Let's see the code! üëá

In [None]:
# ============================================================================
# EXCESS RISK BOUNDS
# ============================================================================

def excess_risk_upper_from_cov_error(k: int, cov_error_op_norm: float) -> float:
    """
    Apply Lemma 11.19 to bound excess risk from covariance error.
    
    Bound: E_k ‚â§ sqrt(2k) * ||Œ£ - Œ£_hat||_2
    
    Parameters:
    -----------
    k : int
        Number of PCA components
    cov_error_op_norm : float
        Operator norm of covariance error: ||Œ£ - Œ£_hat||_2
        
    Returns:
    --------
    bound : float
        Upper bound on excess risk
        
    Interpretation:
    ---------------
    This tells you: "Your empirical PCA is at most this much worse
    than the optimal PCA (in terms of reconstruction error)."
    
    Example:
    --------
    >>> error_norm = operator_norm(Sigma_hat - Sigma_true)
    >>> bound = excess_risk_upper_from_cov_error(k=10, cov_error_op_norm=error_norm)
    >>> print(f"Excess risk ‚â§ {bound:.6f}")
    """
    if k <= 0:
        raise ValueError("k must be positive.")
    return float(math.sqrt(2.0 * k) * cov_error_op_norm)


def excess_risk_tail_bound_thm11_20(d: int, n: int, eps: float, 
                                    k: int, C: float) -> float:
    """
    Apply Theorem 11.20 to bound probability of large excess risk.
    
    Bound: P(E_k > eps) ‚â§ 2d * exp(-n*eps^2 / (4*C*(C + 2*eps/3)*k))
    
    Parameters:
    -----------
    d : int
        Dimension
    n : int
        Sample size
    eps : float
        Excess risk tolerance
    k : int
        Number of PCA components
    C : float
        Bound constant: ||X_i||^2 ‚â§ C
        
    Returns:
    --------
    bound : float
        Upper bound on P(E_k > eps)
        
    Interpretation:
    ---------------
    Tells you the probability that your empirical PCA performs
    significantly worse than optimal PCA.
    
    Use case: Check if your sample size n is sufficient!
    
    Example:
    --------
    >>> # Is n=1000 enough for d=50, k=10, eps=0.1?
    >>> prob = excess_risk_tail_bound_thm11_20(d=50, n=1000, eps=0.1, k=10, C=25)
    >>> if prob < 0.05:
    ...     print("‚úì 95% confident that excess risk < 0.1")
    ... else:
    ...     print("‚úó Need more samples!")
    """
    if d <= 0 or n <= 0 or eps <= 0 or k <= 0 or C <= 0:
        raise ValueError("d, n, eps, k, C must be positive.")
    
    denom = 4.0 * C * (C + 2.0 * eps / 3.0) * k
    exponent = - n * eps * eps / denom
    return float(2.0 * d * math.exp(exponent))


# ============================================================================
# DEMO: Understanding Sample Complexity for PCA
# ============================================================================

print("=" * 70)
print("EXCESS RISK ANALYSIS: How Many Samples Do We Need?")
print("=" * 70)

# Fixed parameters
d = 64        # dimension
C_val = 25.0  # boundedness constant
eps_target = 0.1  # target excess risk

print(f"\nüìä Problem Setup:")
print(f"   Dimension: d = {d}")
print(f"   Boundedness: C = {C_val}")
print(f"   Target excess risk: Œµ = {eps_target}")

# Study how sample size requirement changes with k
k_values = [1, 5, 10, 20, 30, 50]
print(f"\nüéØ SAMPLE SIZE REQUIREMENTS:")
print("-" * 70)
print(f"{'k':<5} {'n (95% conf.)':<20} {'n (99% conf.)':<20}")
print("-" * 70)

for k_val in k_values:
    # Find n such that probability bound < 0.05 (95% confidence)
    # Solve: 2d * exp(-n*eps^2 / denom) < 0.05
    # n > (denom / eps^2) * log(2d / 0.05)
    denom_95 = 4.0 * C_val * (C_val + 2.0*eps_target/3.0) * k_val
    n_95 = int(np.ceil((denom_95 / (eps_target**2)) * math.log(2*d / 0.05)))
    
    denom_99 = denom_95
    n_99 = int(np.ceil((denom_99 / (eps_target**2)) * math.log(2*d / 0.01)))
    
    print(f"{k_val:<5} {n_95:<20} {n_99:<20}")

print("\nüí° Observation: More components (k) requires MORE samples!")

# Visualize: probability bound as function of n for different k
print("\nüìà Generating probability curves...")

n_range = np.logspace(2, 4, 50).astype(int)  # from 100 to 10,000
k_plot_values = [5, 10, 20, 40]

plt.figure(figsize=(12, 4))

# Left: Probability bound vs sample size
plt.subplot(1, 2, 1)
for k_val in k_plot_values:
    probs = [excess_risk_tail_bound_thm11_20(d, n, eps_target, k_val, C_val) 
             for n in n_range]
    plt.semilogy(n_range, probs, label=f'k={k_val}', linewidth=2)

plt.axhline(0.05, color='red', linestyle='--', alpha=0.5, label='5% threshold')
plt.axhline(0.01, color='orange', linestyle='--', alpha=0.5, label='1% threshold')
plt.xlabel("Sample Size (n)", fontsize=11)
plt.ylabel("P(Excess Risk > Œµ) [log scale]", fontsize=11)
plt.title(f"Probability Bound vs Sample Size\n(d={d}, Œµ={eps_target})", fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

# Right: Required sample size vs k
plt.subplot(1, 2, 2)
k_range_plot = range(1, 51)
n_required_95 = []
n_required_99 = []
for k_val in k_range_plot:
    denom = 4.0 * C_val * (C_val + 2.0*eps_target/3.0) * k_val
    n_95 = (denom / (eps_target**2)) * math.log(2*d / 0.05)
    n_99 = (denom / (eps_target**2)) * math.log(2*d / 0.01)
    n_required_95.append(n_95)
    n_required_99.append(n_99)

plt.plot(k_range_plot, n_required_95, 'o-', label='95% confidence', linewidth=2)
plt.plot(k_range_plot, n_required_99, 's-', label='99% confidence', linewidth=2)
plt.xlabel("Number of Components (k)", fontsize=11)
plt.ylabel("Required Sample Size (n)", fontsize=11)
plt.title(f"Sample Complexity vs k\n(d={d}, Œµ={eps_target})", fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüéì KEY TAKEAWAYS:")
print("-" * 70)
print("1. Sample complexity is LINEAR in k:")
print(f"   n ‚àù k (more components need more data)")
print("\n2. Sample complexity is LINEAR in d:")
print(f"   n ‚àù d*log(d) (more features need more data)")
print("\n3. Sample complexity is QUADRATIC in 1/Œµ:")
print(f"   n ‚àù 1/Œµ¬≤ (tighter bounds need much more data)")
print("\n4. Rule of thumb: n ‚â• 10*d*k for reliable PCA")

print("\nüéØ PRACTICAL ADVICE:")
print("-" * 70)
print("‚úì Check your sample size before running PCA")
print("‚úì If n < 5*d, consider regularization or random projection")
print("‚úì Cross-validation helps tune k empirically")
print("‚úì These are WORST-CASE bounds - practice often better!")
print("‚úì But they give you confidence when you have enough data!")

# üéì Chapter Summary & Quick Reference

## ‚ú® What We Learned

### 1. Random Projections (¬ß11.1)
**Key Idea**: Project high-D data to low-D using random matrices  
**When to Use**: Fast compression with distance preservation  
**Pro**: Fast, simple, no training needed  
**Con**: Less compression than PCA, probabilistic guarantees

**Main Results**:
- **Theorem 11.1**: Single vector norm preservation
- **Johnson-Lindenstrauss**: All pairwise distances preserved with $k \sim \log(n)/\varepsilon^2$

---

### 2. SVD (¬ß11.2)
**Key Idea**: Any matrix $A = U\Sigma V^T$ (rotation-scale-rotation)  
**When to Use**: Matrix approximation, data compression, latent structure  
**Connection**: Singular vectors are eigenvectors of $A^T A$ or $AA^T$

**Main Results**:
- **Lemma 11.7**: $v_1$ = top eigenvector of $A^T A$, $\sigma_1 = \sqrt{\lambda_1}$
- **Power Method**: Iterative algorithm to find top eigenvectors
- **Eckart-Young**: Rank-$k$ truncation is optimal approximation

---

### 3. PCA (¬ß11.3)
**Key Idea**: Project onto directions of maximum variance  
**When to Use**: Dimensionality reduction, feature extraction, visualization  
**How**: SVD of centered data, keep top-$k$ components

**Main Results**:
- **Theorem 11.10**: Top-$k$ PCs give best $k$-dimensional subspace
- **PCA = SVD**: PCA of centered $X$ is $XV = U\Sigma$
- **Explained Variance**: $\sum_{i=1}^k \sigma_i^2 / \sum_i \sigma_i^2$

---

### 4. Applications (¬ß11.4)
**Compression**: Images, signals ‚Üí rank-$k$ approximation  
**Anomaly Detection**: Large reconstruction error ‚Üí anomaly  
**Visualization**: Project to 2D or 3D for plotting

---

### 5. Theory (¬ß11.5-11.6)
**Matrix Bernstein**: Sample covariance concentrates with $n \gtrsim d \log d$  
**Weyl's Theorem**: Eigenvalues stable under perturbations  
**Excess Risk**: Finite sample PCA close to optimal when $n \gtrsim dk$

---

## üó∫Ô∏è Decision Tree: Which Method to Use?

```
Start
  ‚îÇ
  ‚îú‚îÄ Need FAST compression with distance preservation?
  ‚îÇ  ‚îî‚îÄ YES ‚Üí Use RANDOM PROJECTION
  ‚îÇ           ‚Ä¢ Choose k via JL lemma
  ‚îÇ           ‚Ä¢ Good for clustering/NN search
  ‚îÇ
  ‚îú‚îÄ Need INTERPRETABLE principal directions?
  ‚îÇ  ‚îî‚îÄ YES ‚Üí Use PCA
  ‚îÇ           ‚Ä¢ Centered data
  ‚îÇ           ‚Ä¢ Components have meaning
  ‚îÇ           ‚Ä¢ Good for exploration
  ‚îÇ
  ‚îú‚îÄ Need BEST rank-k approximation?
  ‚îÇ  ‚îî‚îÄ YES ‚Üí Use SVD/PCA
  ‚îÇ           ‚Ä¢ Eckart-Young optimality
  ‚îÇ           ‚Ä¢ Good for compression
  ‚îÇ
  ‚îú‚îÄ Detecting ANOMALIES?
  ‚îÇ  ‚îî‚îÄ YES ‚Üí Use PCA Reconstruction Error
  ‚îÇ           ‚Ä¢ Fit on normal data
  ‚îÇ           ‚Ä¢ Flag large errors
  ‚îÇ
  ‚îî‚îÄ Very HIGH dimension (d > 10,000)?
       ‚îî‚îÄ YES ‚Üí Random Projection FIRST, then PCA
                ‚Ä¢ Two-stage compression
                ‚Ä¢ Saves computation
```

---

## ‚öôÔ∏è Hyperparameter Selection Guide

### For PCA:

| Parameter | How to Choose | Typical Values |
|-----------|---------------|----------------|
| **k** (components) | Look at scree plot, aim for 90-95% explained variance | 5-50 depending on d |
| **Sample size n** | Need $n \gtrsim 10 \cdot d \cdot k$ | As large as possible |

### For Random Projection:

| Parameter | How to Choose | Typical Values |
|-----------|---------------|----------------|
| **k** (dimension) | $k > 384\ln(n)/\varepsilon^2$ (JL formula) | Depends on $\varepsilon$ |
| **$\varepsilon$** (error) | Trade-off: smaller = more accurate but larger k | 0.1-0.3 |

### For Anomaly Detection:

| Parameter | How to Choose | Typical Values |
|-----------|---------------|----------------|
| **k** (components) | Capture normal pattern (90% variance) | 5-20 |
| **q** (quantile) | False positive rate: $1-q$ | 0.95-0.99 |

---

## ‚ö†Ô∏è Common Pitfalls & How to Avoid Them

### 1. **Forgetting to Center Data**
‚ùå Running PCA on non-centered data  
‚úÖ Always use `center_data()` before PCA  
**Why**: PCA finds directions of variance around the mean

### 2. **Too Few Samples**
‚ùå $n < d$ or $n < 10dk$  
‚úÖ Check sample complexity bounds first  
**Why**: Covariance estimate unreliable, excess risk large

### 3. **Choosing k Arbitrarily**
‚ùå "Let's use k=10 because it's a nice number"  
‚úÖ Look at explained variance ratio, scree plot, cross-validation  
**Why**: Wrong k ‚Üí either under-fitting or over-fitting

### 4. **Scaling Issues**
‚ùå Features on different scales (e.g., meters vs kilometers)  
‚úÖ Standardize features first: $(x - \mu)/\sigma$  
**Why**: PCA is sensitive to scale

### 5. **Interpreting Components Incorrectly**
‚ùå "PC1 is the most important feature"  
‚úÖ "PC1 is the direction of maximum variance"  
**Why**: Components are linear combinations, not individual features

### 6. **Overfitting in Anomaly Detection**
‚ùå Using same data for training and threshold selection  
‚úÖ Split: train PCA on clean set, tune threshold on validation  
**Why**: Avoid memorizing noise as "normal"

### 7. **Ignoring Computational Cost**
‚ùå Full SVD on huge matrices  
‚úÖ Use randomized SVD or iterative methods for large scale  
**Why**: Full SVD is $O(nd^2)$ ‚Äî too slow for big data

---

## üìö Quick Function Reference

### Random Projection
```python
k = jl_required_k(n_points=1000, eps=0.2)
R = sample_subgaussian_matrix(d=500, k=k)
Y = random_projection_map(X, R, scale_by_sqrt_k=True)
errs = relative_distance_errors(X, Y)
```

### PCA
```python
pca = pca_fit(X_train, k=10)
Z_train = pca_transform(X_train, pca)
Z_test = pca_transform(X_test, pca)
X_recon = pca_inverse_transform(Z_test, pca)
```

### SVD & Approximation
```python
X_k, svd_info = rank_k_approximation(X, k=10)
error = reconstruction_error_frobenius(X, X_k)
evr = explained_variance_ratio_from_singular_values(svd_info['S'], k=10)
```

### Anomaly Detection
```python
detector = pca_anomaly_detector_fit(X_train, k=5, q=0.99)
pred = pca_anomaly_detector_predict(detector, X_test)
anomalies = X_test[pred['is_anomaly']]
```

### Theoretical Bounds
```python
prob = matrix_bernstein_bound_thm11_15(d=50, n=1000, eps=0.1, C=25)
excess_risk_bound = excess_risk_tail_bound_thm11_20(d=50, n=1000, 
                                                    eps=0.1, k=10, C=25)
```

---

## üéØ Next Steps

**To deepen understanding**:
1. ‚úÖ Run all cells with different parameters
2. ‚úÖ Apply to your own dataset
3. ‚úÖ Compare random projection vs PCA on same data
4. ‚úÖ Implement a mini-project (e.g., image compression)

**Advanced topics to explore**:
- Kernel PCA (nonlinear dimensionality reduction)
- Sparse PCA (interpretable components)
- Randomized SVD (fast large-scale methods)
- t-SNE and UMAP (visualization methods)
- Autoencoders (neural network dimensionality reduction)

**Related chapters**:
- Chapter 10: High-dimensional statistics
- Chapter 8: Pattern recognition
- Chapter 5-6: Statistical estimation theory

---

## üèÜ Final Checklist

Before using dimensionality reduction in practice:

- [ ] Understand your data's intrinsic dimensionality
- [ ] Check sample size requirements ($n \gtrsim 10dk$)
- [ ] Center (and maybe standardize) your data
- [ ] Choose k using explained variance or cross-validation
- [ ] Validate results on held-out test set
- [ ] Interpret components carefully (linear combinations!)
- [ ] Check theoretical bounds for confidence
- [ ] Document your choices (k, preprocessing, etc.)

**Congratulations! You now have a complete toolkit for dimensionality reduction!** üéâ

---

*Created with care to make complex theory accessible. Questions? Review the detailed explanations above, run the demos, and experiment! üöÄ*