# 6) High Dimensional Phenomena

### 6.1 Distance Concentration

**What is this?**
A surprising phenomenon: In high dimensions, distances between random points become very similar!

**The Curse of Dimensionality:**
As dimension $d$ increases:
- All points appear equally far apart
- Ratio $\frac{\text{max distance}}{\text{min distance}} \to 1$
- Intuition: In high-D space, there's "so much room" that everything spreads out

**Why it matters:**
- **Nearest neighbor methods fail**: "Nearest" and "farthest" become meaningless
- **Clustering becomes hard**: All points seem equally distant
- **Similarity search breaks down**: Need dimension reduction!

**Example:** 
In 2D: Some points close, some far (ratio ~5-10)
In 1000D: All points ~same distance (ratio ~1.1)

**Takeaway:** High-dimensional data needs special treatment (PCA, manifold learning, etc.)

In [24]:
def distance_concentration_demo(n=2000, d=2, seed=0):
    rng = np.random.default_rng(seed)
    X = rng.normal(0, 1, size=(n, d))
    # distances from first point
    diffs = X - X[0]
    dist = np.linalg.norm(diffs, axis=1)[1:]
    return {
        "mean_dist": float(np.mean(dist)),
        "min_dist": float(np.min(dist)),
        "max_dist": float(np.max(dist)),
        "ratio_max_min": float(np.max(dist)/np.min(dist))
    }


# 7) Dimensionality Reduction & PCA

### 7.1 PCA using SVD

**What is this?**
Principal Component Analysis - finds the directions of maximum variance in your data.

**How it works:**
1. **Center the data**: Subtract mean from each feature
2. **Apply SVD**: $X_c = U \Sigma V^T$
3. **Principal components**: Columns of $V$ (or rows of $V^T$)
4. **Variance explained**: Proportional to squared singular values $\sigma_i^2$

**Key outputs:**
- **mu**: Mean of original data (for centering)
- **Vt**: Principal component directions (rows are PCs)
- **s**: Singular values (related to variance)
- **evr**: Explained variance ratio per component
- **cum_evr**: Cumulative explained variance

**Transform:** Project data onto first $k$ components: $Z = X_c V_k$
**Inverse:** Reconstruct: $\hat{X} = Z V_k^T + \mu$

**Use cases:** Dimensionality reduction, visualization, noise removal, feature extraction

In [25]:
def pca_fit(X, center=True):
    X = np.asarray(X, float)
    mu = X.mean(axis=0) if center else np.zeros(X.shape[1])
    Xc = X - mu
    U, s, Vt = np.linalg.svd(Xc, full_matrices=False)
    # explained variance ratio
    var = s**2
    evr = var / var.sum()
    return {"mu": mu, "U": U, "s": s, "Vt": Vt, "evr": evr, "cum_evr": np.cumsum(evr)}

def pca_transform(X, pca, k):
    X = np.asarray(X, float)
    Xc = X - pca["mu"]
    V = pca["Vt"][:k].T
    Z = Xc @ V
    return Z

def pca_inverse_transform(Z, pca, k):
    Z = np.asarray(Z, float)
    V = pca["Vt"][:k].T
    Xhat = Z @ V.T + pca["mu"]
    return Xhat


### 7.2 Choosing Number of Components (Keep X% Variance)

**What is this?**
Determines how many principal components to keep based on desired variance explained.

**Algorithm:**
Find the smallest $k$ such that cumulative variance ≥ target (e.g., 90%)

**Example:**
```
Components: [1,    2,    3,    4,    5]
Cum. Var:   [0.5, 0.7, 0.85, 0.92, 0.95]
Target = 0.90 → Choose k=4
```

**Common targets:**
- **90%**: Good balance (retains most information, removes noise)
- **95%**: More conservative (less information loss)
- **99%**: Very conservative (mostly for compression)

**Trade-off:** More components = More information but higher dimension

In [26]:
def choose_k(cum_evr, target=0.90):
    cum_evr = np.asarray(cum_evr, float)
    return int(np.searchsorted(cum_evr, target) + 1)


### Scree Plot for Explained Variance

**What is this?**
A visualization showing variance explained by each principal component, helping decide how many components to keep.

**How to read it:**
- X-axis: Component number
- Y-axis: Explained variance (or variance ratio)
- Look for "elbow" where curve flattens

**Elbow Method:**
- Sharp drop initially = Components capture real structure
- Flat tail = Components capture noise
- Choose k at the elbow (where diminishing returns start)

**Alternative: Cumulative plot:**
Shows running total of variance explained
- Find where curve reaches target (e.g., 90%, 95%)

**Example interpretation:**
```
Components 1-3: Steep drop (keep these)
Components 4+: Flat line (likely noise)
→ Choose k=3
```

In [None]:
def plot_scree(pca_result, max_components=None):
    """
    Create data for scree plot visualization.
    
    Parameters:
    -----------
    pca_result : dict
        Result from pca_fit function
    max_components : int, optional
        Maximum number of components to show
    
    Returns:
    --------
    component_nums : array
        Component numbers (1, 2, 3, ...)
    variance_ratios : array
        Explained variance ratio for each component
    cumulative_variance : array
        Cumulative explained variance
    """
    evr = pca_result["evr"]
    cum_evr = pca_result["cum_evr"]
    
    if max_components is not None:
        evr = evr[:max_components]
        cum_evr = cum_evr[:max_components]
    
    component_nums = np.arange(1, len(evr) + 1)
    
    return component_nums, evr, cum_evr

def find_elbow_point(variance_ratios, method='max_curvature'):
    """
    Automatically detect elbow point in scree plot.
    
    Parameters:
    -----------
    variance_ratios : array-like
        Explained variance ratios
    method : str, default='max_curvature'
        Method to find elbow ('max_curvature' or 'threshold')
    
    Returns:
    --------
    elbow_idx : int
        Index of elbow point (0-based)
    """
    var = np.asarray(variance_ratios, float)
    
    if method == 'max_curvature':
        # Find point with maximum curvature
        # Approximate second derivative
        if len(var) < 3:
            return 0
        
        curvature = np.abs(np.diff(var, n=2))
        elbow_idx = int(np.argmax(curvature))
        return elbow_idx
    
    elif method == 'threshold':
        # Find where variance drops below threshold
        threshold = 0.05  # 5% variance
        below_threshold = np.where(var < threshold)[0]
        if len(below_threshold) > 0:
            return int(below_threshold[0])
        return len(var) - 1
    
    return 0

### 7.3 Reconstruction Error / Anomaly Detection

**What is this?**
Uses PCA reconstruction error to detect outliers/anomalies in data.

**How it works:**
1. **Fit PCA** with $k$ components (keeping most variance)
2. **Project & reconstruct**: $\hat{X} = \text{PCA}^{-1}(\text{PCA}(X))$
3. **Compute error**: $\text{error}_i = \|X_i - \hat{X}_i\|$
4. **Find anomalies**: Largest errors are outliers

**Why it works:**
- Normal data follows principal patterns (low reconstruction error)
- Anomalies don't fit main patterns (high reconstruction error)
- PCA captures "normal" structure, anomalies deviate

**Use cases:**
- Fraud detection
- Manufacturing defect detection
- Network intrusion detection
- Data quality checks

**Tip:** Use fewer components (smaller k) for more sensitive anomaly detection

In [27]:
def reconstruction_error(X, Xhat):
    X = np.asarray(X, float)
    Xhat = np.asarray(Xhat, float)
    return np.linalg.norm(X - Xhat, axis=1)

def top_anomalies(errors, k=10):
    idx = np.argsort(errors)[::-1][:k]
    return idx, errors[idx]


### 7.4 PCA Standardization

**What is this?**
Scales features to have mean=0 and standard deviation=1 before PCA.

**Why standardize?**
- Features with large scales dominate PCA
- Example: Feature A in [0,1000], Feature B in [0,1]
  - Without standardization: PC1 ≈ Feature A direction
  - With standardization: Equal importance to both

**When to standardize:**
- ✓ Features have different units (meters, dollars, age)
- ✓ Features have very different scales
- ✗ Features already on same scale (e.g., all pixel values 0-255)
- ✗ Scale is meaningful for your problem

**Formula:** $X_{\text{std}} = \frac{X - \mu}{\sigma}$

**Best practice:** Fit standardization on training data, apply same transform to test data

In [28]:
def standardize_fit(X, eps=1e-12):
    X = np.asarray(X, float)
    mu = X.mean(axis=0)
    sd = X.std(axis=0, ddof=0)
    sd = np.maximum(sd, eps)
    return mu, sd

def standardize_apply(X, mu, sd):
    X = np.asarray(X, float)
    return (X - mu) / sd


### 7.5 Whitening Transformation

**What is this?**
Transforms data so features are uncorrelated AND have unit variance (stricter than standardization).

**Formula:** $X_{\text{whitened}} = X_c \Sigma^{-1/2} V^T$

Where:
- $X_c$ = Centered data
- $\Sigma$ = Diagonal matrix of singular values
- $V$ = PCA component directions

**Effect:**
- Removes correlation between features
- Makes all features have variance = 1
- Spheres the data distribution

**When to use:**
- Before ICA (Independent Component Analysis)
- Some neural networks benefit
- When you want truly independent features

**Difference from standardization:**
- Standardization: Per-feature scaling (doesn't remove correlation)
- Whitening: Decorrelates AND scales

In [None]:
def pca_whiten(X, pca=None, eps=1e-5):
    """
    Apply whitening transformation using PCA.
    
    Parameters:
    -----------
    X : array-like
        Data to whiten
    pca : dict, optional
        Pre-fitted PCA result. If None, fits PCA on X
    eps : float, default=1e-5
        Small constant to prevent division by zero
    
    Returns:
    --------
    X_whitened : array
        Whitened data
    pca_result : dict
        PCA parameters (for inverse transform)
    """
    X = np.asarray(X, float)
    
    # Fit PCA if not provided
    if pca is None:
        pca = pca_fit(X, center=True)
    
    # Center data
    X_centered = X - pca["mu"]
    
    # Whiten: X_white = X_c @ V @ Sigma^(-1)
    # This decorrelates and scales to unit variance
    V = pca["Vt"].T  # Principal components as columns
    s = pca["s"]
    
    # Prevent division by very small singular values
    s_inv = 1.0 / np.maximum(s, eps)
    
    X_whitened = X_centered @ V @ np.diag(s_inv)
    
    return X_whitened, pca

def pca_unwhiten(X_whitened, pca):
    """
    Inverse whitening transformation.
    
    Parameters:
    -----------
    X_whitened : array-like
        Whitened data
    pca : dict
        PCA parameters from pca_whiten
    
    Returns:
    --------
    X : array
        Original-scale data
    """
    X_whitened = np.asarray(X_whitened, float)
    
    V = pca["Vt"].T
    s = pca["s"]
    
    # Reverse transformation
    X_centered = X_whitened @ np.diag(s) @ V.T
    X = X_centered + pca["mu"]
    
    return X