# üßÆ Deep Mathematical Foundations of Machine Learning

## From Theory to Industrial Applications

---

**Author:** AI-Mastery-2026 Project  
**Purpose:** A comprehensive, white-box exploration of the mathematical pillars of Machine Learning.

### üéØ Learning Objectives

By the end of this notebook, you will:

1. **Linear Algebra**: Understand vectors, matrices, eigenvalues, SVD, and their applications (PageRank, Netflix Recommendations)
2. **Calculus**: Master gradients, Jacobians, backpropagation, and the Transformer's self-attention mechanism
3. **Optimization**: Apply Lagrange multipliers, understand SVM's dual form, and learn ADMM for industrial scale
4. **Probability & Information Theory**: Grasp entropy, KL divergence, VAEs, and t-SNE
5. **Advanced Methods**: Implement Monte Carlo integration, importance sampling, and normalizing flows
6. **Network Analysis**: Build link prediction models (Facebook's supervised random walks)
7. **Bayesian Optimization**: Use Gaussian Processes and acquisition functions

---

### üìö Prerequisites

- Basic Python programming
- Familiarity with NumPy arrays
- High school algebra and calculus concepts

---

In [None]:
# ============================================
# IMPORTS AND SETUP
# ============================================

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats, linalg
from typing import Tuple, List, Callable
import warnings

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
np.random.seed(42)
np.set_printoptions(precision=4, suppress=True)

print("‚úÖ Setup complete! NumPy version:", np.__version__)

---

# Chapter 1: Linear Algebra - The Language of Data

---

Linear algebra provides the formal framework for representing and manipulating data. In ML, we rarely work with single numbers (scalars); instead, we work with **vectors** (points in space), **matrices** (transformations), and **tensors** (higher-dimensional arrays).

> **Key Insight**: A matrix is not just a table of numbers‚Äîit is a *function* that transforms one vector space into another.

## 1.1 Vector Spaces and the Dot Product

### Theory

In ML, a **vector** $\mathbf{x} \in \mathbb{R}^n$ represents an entity (an image, a user, a word). The **dot product** measures the relationship between two vectors:

$$\mathbf{x} \cdot \mathbf{y} = \mathbf{x}^T \mathbf{y} = \sum_{i=1}^n x_i y_i = \|\mathbf{x}\| \|\mathbf{y}\| \cos(\theta)$$

Where:
- $\|\mathbf{x}\| = \sqrt{\sum x_i^2}$ is the L2 norm (vector length)
- $\theta$ is the angle between the vectors

### üìê Mathematical Example

**Given:** $\mathbf{x} = [1, 2, 3]$, $\mathbf{y} = [4, 5, 6]$

**Compute the dot product:**

$$\mathbf{x} \cdot \mathbf{y} = (1 \times 4) + (2 \times 5) + (3 \times 6) = 4 + 10 + 18 = 32$$

**Compute the norms:**

$$\|\mathbf{x}\| = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14} \approx 3.742$$

$$\|\mathbf{y}\| = \sqrt{4^2 + 5^2 + 6^2} = \sqrt{77} \approx 8.775$$

**Compute the angle:**

$$\cos(\theta) = \frac{32}{3.742 \times 8.775} = \frac{32}{32.83} \approx 0.9746$$

$$\theta = \arccos(0.9746) \approx 12.93¬∞$$

The vectors are nearly aligned (small angle)!

In [None]:
# ============================================
# 1.1 DOT PRODUCT IMPLEMENTATION
# ============================================

def dot_product_from_scratch(x: np.ndarray, y: np.ndarray) -> float:
    """
    Compute dot product manually.
    
    Args:
        x: First vector of shape (n,)
        y: Second vector of shape (n,)
    
    Returns:
        Scalar dot product value
    """
    assert len(x) == len(y), "Vectors must have same dimension"
    return sum(xi * yi for xi, yi in zip(x, y))


def compute_angle_between_vectors(x: np.ndarray, y: np.ndarray) -> float:
    """
    Compute angle (in degrees) between two vectors.
    """
    dot = np.dot(x, y)
    norm_x = np.linalg.norm(x)
    norm_y = np.linalg.norm(y)
    cos_theta = dot / (norm_x * norm_y)
    # Clamp to [-1, 1] to handle numerical errors
    cos_theta = np.clip(cos_theta, -1.0, 1.0)
    return np.degrees(np.arccos(cos_theta))


# Example from the mathematical derivation
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

print("="*50)
print("DOT PRODUCT EXAMPLE")
print("="*50)
print(f"x = {x}")
print(f"y = {y}")
print(f"\nManual dot product: {dot_product_from_scratch(x, y)}")
print(f"NumPy dot product:  {np.dot(x, y)}")
print(f"\n||x|| = {np.linalg.norm(x):.4f}")
print(f"||y|| = {np.linalg.norm(y):.4f}")
print(f"\nAngle between vectors: {compute_angle_between_vectors(x, y):.2f}¬∞")

## 1.2 Cosine Similarity in NLP

### Theory

When vectors are **normalized** (unit length), their dot product equals the **cosine similarity**:

$$\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$$

**Interpretation**:
- $\text{sim} = 1$: Vectors point in the same direction (identical meaning)
- $\text{sim} = 0$: Vectors are perpendicular (unrelated)
- $\text{sim} = -1$: Vectors point in opposite directions (opposite meaning)

### üè≠ Industrial Application: Word Embeddings

In NLP models like Word2Vec or BERT, words are converted to vectors (embeddings). Semantically similar words have high cosine similarity.

**Famous example**: $\text{king} - \text{man} + \text{woman} \approx \text{queen}$

### üìê Mathematical Example

**Given embeddings:**
- "cat" ‚Üí $[0.7, 0.5, 0.1]$
- "dog" ‚Üí $[0.6, 0.6, 0.2]$
- "car" ‚Üí $[0.1, 0.2, 0.9]$

**Compute similarity between "cat" and "dog":**

$$\text{sim}(\text{cat}, \text{dog}) = \frac{0.7 \times 0.6 + 0.5 \times 0.6 + 0.1 \times 0.2}{\sqrt{0.75} \times \sqrt{0.76}} = \frac{0.74}{0.755} \approx 0.98$$

High similarity! Both are animals.

In [None]:
# ============================================
# 1.2 COSINE SIMILARITY IMPLEMENTATION
# ============================================

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """
    Compute cosine similarity between two vectors.
    
    Args:
        a, b: Input vectors
    
    Returns:
        Similarity score in [-1, 1]
    """
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


# Simulated word embeddings
embeddings = {
    "cat": np.array([0.7, 0.5, 0.1]),
    "dog": np.array([0.6, 0.6, 0.2]),
    "kitten": np.array([0.65, 0.55, 0.12]),
    "car": np.array([0.1, 0.2, 0.9]),
    "truck": np.array([0.15, 0.18, 0.85]),
}

# Compute similarity matrix
words = list(embeddings.keys())
n = len(words)
sim_matrix = np.zeros((n, n))

for i, w1 in enumerate(words):
    for j, w2 in enumerate(words):
        sim_matrix[i, j] = cosine_similarity(embeddings[w1], embeddings[w2])

# Visualization
plt.figure(figsize=(8, 6))
sns.heatmap(sim_matrix, annot=True, fmt='.3f', 
            xticklabels=words, yticklabels=words,
            cmap='RdYlGn', center=0.5, vmin=0, vmax=1)
plt.title('Word Embedding Cosine Similarity Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüí° Insight: 'cat' and 'kitten' are most similar (both felines).")
print("   'cat' and 'car' are least similar (animal vs vehicle).")

## 1.3 Matrices as Linear Transformations

### Theory

A matrix $A \in \mathbb{R}^{m \times n}$ can be viewed as a **function** that transforms an $n$-dimensional vector space to an $m$-dimensional space:

$$\mathbf{y} = A\mathbf{x}$$

Common transformations:
- **Scaling**: Stretch or shrink vectors
- **Rotation**: Rotate vectors around origin
- **Projection**: Project onto lower-dimensional subspace
- **Shearing**: Skew the space

### üìê Mathematical Example: 2D Rotation

The rotation matrix for angle $\theta$ is:

$$R_\theta = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}$$

**Rotate $\mathbf{v} = [1, 0]$ by 45¬∞:**

$$R_{45¬∞} = \begin{bmatrix} 0.707 & -0.707 \\ 0.707 & 0.707 \end{bmatrix}$$

$$\mathbf{v}' = R_{45¬∞} \mathbf{v} = \begin{bmatrix} 0.707 \\ 0.707 \end{bmatrix}$$

In [None]:
# ============================================
# 1.3 LINEAR TRANSFORMATIONS VISUALIZATION
# ============================================

def rotation_matrix(theta_deg: float) -> np.ndarray:
    """Create 2D rotation matrix for given angle in degrees."""
    theta = np.radians(theta_deg)
    return np.array([
        [np.cos(theta), -np.sin(theta)],
        [np.sin(theta),  np.cos(theta)]
    ])


def scaling_matrix(sx: float, sy: float) -> np.ndarray:
    """Create 2D scaling matrix."""
    return np.array([[sx, 0], [0, sy]])


def shear_matrix(k: float) -> np.ndarray:
    """Create 2D shear matrix."""
    return np.array([[1, k], [0, 1]])


# Create a unit square
square = np.array([
    [0, 1, 1, 0, 0],  # x coordinates
    [0, 0, 1, 1, 0]   # y coordinates
])

# Define transformations
transformations = [
    ("Original", np.eye(2)),
    ("Rotation 45¬∞", rotation_matrix(45)),
    ("Scale (2x, 0.5x)", scaling_matrix(2, 0.5)),
    ("Shear (k=0.5)", shear_matrix(0.5)),
]

# Visualization
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, (name, T) in zip(axes, transformations):
    transformed = T @ square
    ax.fill(transformed[0], transformed[1], alpha=0.4, color='steelblue')
    ax.plot(transformed[0], transformed[1], 'b-', linewidth=2)
    ax.set_xlim(-1, 3)
    ax.set_ylim(-1, 2)
    ax.set_aspect('equal')
    ax.set_title(name, fontsize=12, fontweight='bold')
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.grid(True, alpha=0.3)

plt.suptitle('Matrix Transformations on a Unit Square', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nüîë Key Insight: Neural network layers are sequences of linear transformations")
print("   followed by non-linear activation functions!")

## 1.4 Convolution as Matrix Multiplication (Toeplitz Matrices)

### Theory

In **Convolutional Neural Networks (CNNs)**, the convolution operation can be reformulated as matrix multiplication using **Toeplitz matrices**. This allows:

1. Efficient GPU computation (GPUs are optimized for matrix operations)
2. Unified backpropagation with standard linear algebra

### The Process (im2col)

1. Flatten the input image patches into column vectors
2. Arrange them as columns of a matrix
3. Multiply by the flattened kernel

### üìê Mathematical Example: 1D Convolution

**Input**: $x = [1, 2, 3, 4, 5]$  
**Kernel**: $k = [1, 0, -1]$ (edge detector)

The Toeplitz matrix representation:

$$T = \begin{bmatrix} 3 & 2 & 1 \\ 4 & 3 & 2 \\ 5 & 4 & 3 \end{bmatrix}$$

$$y = T \cdot k = \begin{bmatrix} 3-1 \\ 4-2 \\ 5-3 \end{bmatrix} = \begin{bmatrix} 2 \\ 2 \\ 2 \end{bmatrix}$$

In [None]:
# ============================================
# 1.4 CONVOLUTION AS MATRIX MULTIPLICATION
# ============================================

def conv1d_standard(x: np.ndarray, kernel: np.ndarray) -> np.ndarray:
    """Standard 1D convolution (valid mode)."""
    k_size = len(kernel)
    out_size = len(x) - k_size + 1
    result = np.zeros(out_size)
    for i in range(out_size):
        result[i] = np.sum(x[i:i+k_size] * kernel)
    return result


def conv1d_as_matmul(x: np.ndarray, kernel: np.ndarray) -> np.ndarray:
    """
    1D convolution implemented as matrix multiplication.
    Demonstrates the Toeplitz matrix approach.
    """
    k_size = len(kernel)
    out_size = len(x) - k_size + 1
    
    # Build the Toeplitz-like matrix (im2col)
    # Each row contains a patch of the input
    patches = np.zeros((out_size, k_size))
    for i in range(out_size):
        patches[i] = x[i:i+k_size]
    
    # Matrix multiplication: patches @ kernel
    return patches @ kernel


# Example
x = np.array([1, 2, 3, 4, 5], dtype=float)
kernel = np.array([1, 0, -1], dtype=float)  # Edge detector

print("="*50)
print("CONVOLUTION AS MATRIX MULTIPLICATION")
print("="*50)
print(f"Input x = {x}")
print(f"Kernel  = {kernel}")

# Standard convolution
y_standard = conv1d_standard(x, kernel)
print(f"\nStandard convolution: {y_standard}")

# Matrix multiplication approach
y_matmul = conv1d_as_matmul(x, kernel)
print(f"Matrix multiplication: {y_matmul}")

# Show the patch matrix
out_size = len(x) - len(kernel) + 1
patches = np.zeros((out_size, len(kernel)))
for i in range(out_size):
    patches[i] = x[i:i+len(kernel)]

print(f"\nPatch matrix (im2col):")
print(patches)
print(f"\nComputation: patches @ kernel = result")

# Verify they match
assert np.allclose(y_standard, y_matmul), "Results should match!"
print("\n‚úÖ Both methods produce identical results!")

## 1.5 Eigenvalues and Google's PageRank

### Theory

An **eigenvector** $\mathbf{v}$ of matrix $A$ is a vector that only gets scaled (not rotated) when $A$ is applied:

$$A\mathbf{v} = \lambda \mathbf{v}$$

Where $\lambda$ is the **eigenvalue** (the scaling factor).

### üè≠ Industrial Application: Google PageRank

The web is modeled as a directed graph. PageRank finds the **stationary distribution** of a random surfer‚Äîthis is the eigenvector corresponding to eigenvalue $\lambda = 1$ of the transition matrix.

**Modified PageRank equation** (with damping $d = 0.85$):

$$\mathbf{R} = d \cdot M \cdot \mathbf{R} + \frac{1-d}{N} \mathbf{1}$$

### üìê Mathematical Example: 4-Page Web

Consider a web with 4 pages:
- Page 1 ‚Üí links to 2, 3
- Page 2 ‚Üí links to 1, 3
- Page 3 ‚Üí links to 1
- Page 4 ‚Üí links to 1, 2, 3

Transition matrix $M$ (column-stochastic):

$$M = \begin{bmatrix} 0 & 0.5 & 1 & 0.33 \\ 0.5 & 0 & 0 & 0.33 \\ 0.5 & 0.5 & 0 & 0.33 \\ 0 & 0 & 0 & 0 \end{bmatrix}$$

In [None]:
# ============================================
# 1.5 PAGERANK FROM SCRATCH
# ============================================

def pagerank(adj_matrix: np.ndarray, damping: float = 0.85, 
             max_iter: int = 100, tol: float = 1e-6) -> Tuple[np.ndarray, List[float]]:
    """
    Compute PageRank using power iteration.
    
    Args:
        adj_matrix: Adjacency matrix (adj[i,j]=1 if i links to j)
        damping: Damping factor (probability of following a link)
        max_iter: Maximum iterations
        tol: Convergence tolerance
    
    Returns:
        ranks: PageRank scores
        history: Convergence history
    """
    n = adj_matrix.shape[0]
    
    # Create column-stochastic transition matrix
    out_degree = adj_matrix.sum(axis=1)
    out_degree[out_degree == 0] = 1  # Handle dangling nodes
    M = (adj_matrix.T / out_degree).T
    M = M.T  # Column stochastic
    
    # Initialize uniform distribution
    r = np.ones(n) / n
    history = []
    
    # Power iteration
    for _ in range(max_iter):
        r_new = damping * M @ r + (1 - damping) / n
        diff = np.linalg.norm(r_new - r)
        history.append(diff)
        
        if diff < tol:
            break
        r = r_new
    
    return r_new / r_new.sum(), history  # Normalize


# Create adjacency matrix for our 4-page example
# adj[i,j] = 1 means page i links to page j
adj = np.array([
    [0, 1, 1, 0],  # Page 0 ‚Üí 1, 2
    [1, 0, 1, 0],  # Page 1 ‚Üí 0, 2
    [1, 0, 0, 0],  # Page 2 ‚Üí 0
    [1, 1, 1, 0],  # Page 3 ‚Üí 0, 1, 2
])

ranks, history = pagerank(adj)

print("="*50)
print("PAGERANK RESULTS")
print("="*50)
for i, rank in enumerate(ranks):
    print(f"Page {i}: {rank:.4f} ({rank*100:.1f}%)")

print(f"\nüèÜ Most important page: Page {np.argmax(ranks)}")
print(f"   (Page 0 receives the most incoming links)")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart of ranks
axes[0].bar(range(4), ranks, color='steelblue', edgecolor='black')
axes[0].set_xlabel('Page')
axes[0].set_ylabel('PageRank Score')
axes[0].set_title('PageRank Scores', fontweight='bold')
axes[0].set_xticks(range(4))

# Convergence plot
axes[1].semilogy(history, 'b-', linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Change (log scale)')
axes[1].set_title('Convergence of Power Iteration', fontweight='bold')
axes[1].axhline(y=1e-6, color='r', linestyle='--', label='Tolerance')
axes[1].legend()

plt.tight_layout()
plt.show()

## 1.6 SVD and Netflix Recommendation System

### Theory

**Singular Value Decomposition (SVD)** factorizes any matrix $R$ into:

$$R = U \Sigma V^T$$

Where:
- $U$: Left singular vectors (user features)
- $\Sigma$: Diagonal matrix of singular values (importance)
- $V^T$: Right singular vectors (item features)

### üè≠ Industrial Application: Netflix Prize

For recommendation systems, we approximate the sparse rating matrix:

$$R \approx P \times Q^T$$

Where:
- $P$ (users √ó k): User latent factors
- $Q$ (items √ó k): Item latent factors
- $k$: Number of latent features

The optimization objective (with regularization):

$$\min_{P, Q} \sum_{(u, i) \in \mathcal{K}} (r_{ui} - \mathbf{p}_u^T \mathbf{q}_i)^2 + \lambda (\|\mathbf{p}_u\|^2 + \|\mathbf{q}_i\|^2)$$

### Update Rules (SGD)

For each observed rating $(u, i, r_{ui})$:

$$e_{ui} = r_{ui} - \mathbf{p}_u^T \mathbf{q}_i$$

$$\mathbf{p}_u \leftarrow \mathbf{p}_u + \alpha (e_{ui} \mathbf{q}_i - \lambda \mathbf{p}_u)$$

$$\mathbf{q}_i \leftarrow \mathbf{q}_i + \alpha (e_{ui} \mathbf{p}_u - \lambda \mathbf{q}_i)$$

In [None]:
# ============================================
# 1.6 NETFLIX-STYLE RECOMMENDER (FunkSVD)
# ============================================

class MatrixFactorization:
    """
    Matrix Factorization for Collaborative Filtering.
    Implements the FunkSVD algorithm used in Netflix Prize.
    """
    
    def __init__(self, n_factors: int = 10, lr: float = 0.005, 
                 reg: float = 0.02, n_epochs: int = 100):
        self.n_factors = n_factors
        self.lr = lr
        self.reg = reg
        self.n_epochs = n_epochs
        
    def fit(self, R: np.ndarray) -> List[float]:
        """
        Train the model on rating matrix R.
        
        Args:
            R: Rating matrix (users √ó items), 0 = missing
            
        Returns:
            Training loss history
        """
        self.n_users, self.n_items = R.shape
        
        # Initialize latent factors with small random values
        self.P = np.random.normal(0, 0.1, (self.n_users, self.n_factors))
        self.Q = np.random.normal(0, 0.1, (self.n_items, self.n_factors))
        
        # Find observed ratings
        self.samples = [
            (u, i, R[u, i])
            for u in range(self.n_users)
            for i in range(self.n_items)
            if R[u, i] > 0
        ]
        
        history = []
        for epoch in range(self.n_epochs):
            np.random.shuffle(self.samples)
            
            for u, i, r in self.samples:
                # Prediction and error
                pred = self.P[u] @ self.Q[i]
                error = r - pred
                
                # Gradient descent updates
                P_u = self.P[u].copy()
                self.P[u] += self.lr * (error * self.Q[i] - self.reg * self.P[u])
                self.Q[i] += self.lr * (error * P_u - self.reg * self.Q[i])
            
            # Compute training loss
            loss = self._compute_loss(R)
            history.append(loss)
            
            if (epoch + 1) % 20 == 0:
                print(f"Epoch {epoch+1:3d}: Loss = {loss:.4f}")
        
        return history
    
    def _compute_loss(self, R: np.ndarray) -> float:
        """Compute regularized MSE loss."""
        loss = 0
        for u, i, r in self.samples:
            pred = self.P[u] @ self.Q[i]
            loss += (r - pred) ** 2
        # Add regularization
        loss += self.reg * (np.sum(self.P**2) + np.sum(self.Q**2))
        return loss / len(self.samples)
    
    def predict(self) -> np.ndarray:
        """Predict all ratings."""
        return self.P @ self.Q.T


# Create sample rating matrix (0 = missing)
R = np.array([
    [5, 3, 0, 1, 0],
    [4, 0, 0, 1, 0],
    [1, 1, 0, 5, 0],
    [0, 0, 5, 4, 0],
    [0, 1, 5, 4, 0],
], dtype=float)

print("="*50)
print("NETFLIX-STYLE RECOMMENDATION SYSTEM")
print("="*50)
print("Original Rating Matrix (0 = missing):")
print(R)

# Train model
model = MatrixFactorization(n_factors=3, lr=0.01, reg=0.01, n_epochs=100)
history = model.fit(R)

# Get predictions
predictions = model.predict()
print("\nPredicted Ratings:")
print(np.round(predictions, 2))

# Highlight filled-in values
print("\nüé¨ Recommendations (previously missing ratings):")
for u in range(R.shape[0]):
    for i in range(R.shape[1]):
        if R[u, i] == 0:
            print(f"  User {u} ‚Üí Item {i}: Predicted {predictions[u, i]:.2f}")

---

# Chapter 2: Calculus - The Engine of Learning

---

If linear algebra is the skeleton of ML, calculus is the muscle that drives learning. The process of **training** a model is fundamentally about finding parameters that minimize a loss function‚Äîthis is **optimization**, which relies entirely on derivatives.

> **Key Insight**: The gradient always points toward the direction of steepest increase. To minimize, we move in the *opposite* direction.

## 2.1 Gradients and the Jacobian Matrix

### Theory

For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$, the **gradient** is the vector of partial derivatives:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$

For a **vector-valued function** $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, we use the **Jacobian matrix**:

$$J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$

### üìê Mathematical Example

**Function**: $f(x, y) = x^2 + xy + y^2$

**Partial derivatives**:
- $\frac{\partial f}{\partial x} = 2x + y$
- $\frac{\partial f}{\partial y} = x + 2y$

**At point $(x, y) = (1, 2)$**:

$$\nabla f(1, 2) = \begin{bmatrix} 2(1) + 2 \\ 1 + 2(2) \end{bmatrix} = \begin{bmatrix} 4 \\ 5 \end{bmatrix}$$

The gradient magnitude: $\|\nabla f\| = \sqrt{16 + 25} = \sqrt{41} \approx 6.4$

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Callable, Tuple

# ============================================
# 2.1 GRADIENT COMPUTATION
# ============================================

def numerical_gradient(f: Callable, x: np.ndarray, h: float = 1e-5) -> np.ndarray:
    """
    Compute gradient numerically using central differences.
    
    Args:
        f: Scalar function
        x: Point to evaluate gradient
        h: Step size for finite differences
    
    Returns:
        Gradient vector
    """
    grad = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[i] += h
        x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad


def analytical_gradient(x: np.ndarray) -> np.ndarray:
    """Analytical gradient of f(x,y) = x^2 + xy + y^2"""
    return np.array([2*x[0] + x[1], x[0] + 2*x[1]])


# Define our function
def f(x):
    return x[0]**2 + x[0]*x[1] + x[1]**2


# Test at point (1, 2)
point = np.array([1.0, 2.0])

print("="*50)
print("GRADIENT COMPUTATION")
print("="*50)
print(f"Function: f(x,y) = x¬≤ + xy + y¬≤")
print(f"Point: ({point[0]}, {point[1]})")
print(f"f({point[0]}, {point[1]}) = {f(point)}")
print(f"\nNumerical gradient:  {numerical_gradient(f, point)}")
print(f"Analytical gradient: {analytical_gradient(point)}")
print(f"\nGradient magnitude: {np.linalg.norm(analytical_gradient(point)):.4f}")

In [None]:
# Visualization: Gradient field
x = np.linspace(-3, 3, 20)
y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)
Z = X**2 + X*Y + Y**2

# Gradient components
U = 2*X + Y  # df/dx
V = X + 2*Y  # df/dy

plt.figure(figsize=(10, 8))
plt.contour(X, Y, Z, levels=20, cmap='viridis', alpha=0.7)
plt.colorbar(label='f(x, y)')
plt.quiver(X, Y, U, V, color='red', alpha=0.6)
plt.scatter([1], [2], c='yellow', s=200, marker='*', edgecolors='black', zorder=5)
plt.annotate('Point (1,2)', (1.1, 2.2), fontsize=12, fontweight='bold')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Gradient Field of f(x,y) = x¬≤ + xy + y¬≤', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.show()

print("üî¥ Red arrows show gradient direction (steepest ascent)")
print("üìâ To minimize, move OPPOSITE to the gradient!")

## 2.2 Backpropagation via Chain Rule

### Theory

Neural networks are **composite functions**. If we have layers:

$$L = \text{Loss}(f_3(f_2(f_1(x))))$$

The **chain rule** lets us compute derivatives through the composition:

$$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial f_3} \cdot \frac{\partial f_3}{\partial f_2} \cdot \frac{\partial f_2}{\partial f_1} \cdot \frac{\partial f_1}{\partial x}$$

### üìê Mathematical Example: 2-Layer Network

**Forward pass**:
- Input: $x = 2$
- **Layer 1**: $h = wx = 3 \times 2 = 6$ (weight $w = 3$)
- **Activation**: $a = \text{ReLU}(h) = \max(0, 6) = 6$
- **Layer 2**: $\hat{y} = va = 0.5 \times 6 = 3$ (weight $v = 0.5$)
- **Loss**: $L = (\hat{y} - y)^2 = (3 - 4)^2 = 1$ (target $y = 4$)

**Backward pass**:

$$\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y) = 2(3 - 4) = -2$$

$$\frac{\partial L}{\partial v} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial v} = -2 \times a = -2 \times 6 = -12$$

$$\frac{\partial L}{\partial a} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a} = -2 \times v = -2 \times 0.5 = -1$$

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial h} \cdot \frac{\partial h}{\partial w} = -1 \times 1 \times x = -1 \times 1 \times 2 = -2$$

In [None]:
# ============================================
# 2.2 BACKPROPAGATION FROM SCRATCH
# ============================================

class SimpleNeuron:
    """A simple 2-layer network demonstrating backpropagation."""
    
    def __init__(self):
        self.w = 3.0  # Layer 1 weight
        self.v = 0.5  # Layer 2 weight
        
    def forward(self, x: float) -> Tuple[float, dict]:
        """Forward pass with caching for backprop."""
        h = self.w * x           # Linear layer 1
        a = max(0, h)            # ReLU activation
        y_hat = self.v * a       # Linear layer 2
        
        # Cache for backward pass
        cache = {'x': x, 'h': h, 'a': a, 'y_hat': y_hat}
        return y_hat, cache
    
    def backward(self, y_true: float, cache: dict) -> dict:
        """Backward pass computing all gradients."""
        x, h, a, y_hat = cache['x'], cache['h'], cache['a'], cache['y_hat']
        
        # Loss: L = (y_hat - y_true)^2
        dL_dy_hat = 2 * (y_hat - y_true)
        
        # Layer 2: y_hat = v * a
        dL_dv = dL_dy_hat * a
        dL_da = dL_dy_hat * self.v
        
        # ReLU: a = max(0, h)
        dL_dh = dL_da * (1 if h > 0 else 0)
        
        # Layer 1: h = w * x
        dL_dw = dL_dh * x
        dL_dx = dL_dh * self.w
        
        return {
            'dL/dy_hat': dL_dy_hat,
            'dL/dv': dL_dv,
            'dL/da': dL_da,
            'dL/dh': dL_dh,
            'dL/dw': dL_dw,
            'dL/dx': dL_dx,
        }


# Example from theory
net = SimpleNeuron()
x, y_true = 2.0, 4.0

print("="*60)
print("BACKPROPAGATION STEP-BY-STEP")
print("="*60)

# Forward pass
y_hat, cache = net.forward(x)
loss = (y_hat - y_true) ** 2

print("üì• FORWARD PASS:")
print(f"  Input x = {x}")
print(f"  h = w*x = {net.w}*{x} = {cache['h']}")
print(f"  a = ReLU(h) = {cache['a']}")
print(f"  ≈∑ = v*a = {net.v}*{cache['a']} = {y_hat}")
print(f"  Loss L = (≈∑ - y)¬≤ = ({y_hat} - {y_true})¬≤ = {loss}")

# Backward pass
grads = net.backward(y_true, cache)

print("\nüì§ BACKWARD PASS (Chain Rule):")
for name, value in grads.items():
    print(f"  {name} = {value}")

print("\n‚úÖ These gradients tell us how to update w and v to reduce loss!")

## 2.3 Self-Attention Mechanism (Transformers)

### Theory

The **Self-Attention** mechanism is the heart of Transformers (GPT, BERT). It computes relationships between all positions in a sequence using:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- $Q$ (Query): "What am I looking for?"
- $K$ (Key): "What do I contain?"
- $V$ (Value): "What information do I provide?"
- $d_k$: Dimension of keys (for scaling)

### Why $\sqrt{d_k}$?

The dot product $QK^T$ grows with dimension $d_k$. Large values cause softmax to saturate (output near 0 or 1), leading to **vanishing gradients**. Scaling by $\sqrt{d_k}$ keeps values in a good range.

### üìê Mathematical Example: 3-Word Sentence

**Sentence**: "The cat sat"

Suppose each word has a 4-dimensional embedding:
- "The" ‚Üí $[0.1, 0.2, 0.1, 0.3]$
- "cat" ‚Üí $[0.5, 0.4, 0.6, 0.2]$
- "sat" ‚Üí $[0.3, 0.5, 0.2, 0.4]$

We compute Q, K, V by multiplying with learned weight matrices.

In [None]:
# ============================================
# 2.3 SELF-ATTENTION FROM SCRATCH
# ============================================

def softmax(x: np.ndarray) -> np.ndarray:
    """Numerically stable softmax."""
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)


def scaled_dot_product_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute scaled dot-product attention.
    
    Args:
        Q: Query matrix (seq_len, d_k)
        K: Key matrix (seq_len, d_k)
        V: Value matrix (seq_len, d_v)
    
    Returns:
        output: Attention output
        attention_weights: Attention weights matrix
    """
    d_k = Q.shape[-1]
    
    # Step 1: Compute attention scores
    scores = Q @ K.T  # (seq_len, seq_len)
    
    # Step 2: Scale by sqrt(d_k)
    scaled_scores = scores / np.sqrt(d_k)
    
    # Step 3: Apply softmax
    attention_weights = softmax(scaled_scores)
    
    # Step 4: Weighted sum of values
    output = attention_weights @ V
    
    return output, attention_weights


# Example: 3-word sentence with 4-dim embeddings
np.random.seed(42)

# Word embeddings (in practice, these come from an embedding layer)
X = np.array([
    [0.1, 0.2, 0.1, 0.3],  # "The"
    [0.5, 0.4, 0.6, 0.2],  # "cat"
    [0.3, 0.5, 0.2, 0.4],  # "sat"
])

# Weight matrices (in practice, these are learned)
d_model = 4
d_k = 3  # Query/Key dimension
d_v = 3  # Value dimension

W_Q = np.random.randn(d_model, d_k) * 0.5
W_K = np.random.randn(d_model, d_k) * 0.5
W_V = np.random.randn(d_model, d_v) * 0.5

# Compute Q, K, V
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

print("="*60)
print("SELF-ATTENTION MECHANISM")
print("="*60)
print("Sentence: ['The', 'cat', 'sat']")
print(f"\nInput embeddings X (3 words √ó 4 dims):\n{X}")

# Compute attention
output, attn_weights = scaled_dot_product_attention(Q, K, V)

print(f"\nüìä Attention Weights (each row sums to 1):")
print(f"          The    cat    sat")
for i, word in enumerate(['The', 'cat', 'sat']):
    print(f"{word:>6}:  {attn_weights[i]}")

print(f"\n‚úÖ Row sums (should be 1.0): {attn_weights.sum(axis=1)}")

In [None]:
# Visualization of attention weights
import seaborn as sns

words = ['The', 'cat', 'sat']

plt.figure(figsize=(6, 5))
sns.heatmap(attn_weights, annot=True, fmt='.3f', 
            xticklabels=words, yticklabels=words,
            cmap='Blues', cbar_kws={'label': 'Attention Weight'})
plt.xlabel('Key (attending to)')
plt.ylabel('Query (from)')
plt.title('Self-Attention Weights\n"How much does each word attend to others?"', fontweight='bold')
plt.tight_layout()
plt.show()

print("üí° Each word (Query) distributes its attention across all words (Keys).")
print("   The attention weights determine how much information flows from each position.")

---

# Chapter 3: Optimization - Finding the Best Solution

---

Once we have a model and a loss function, the goal is to find parameters that minimize the loss. This is **optimization**.

> **Key Insight**: Convex problems have a single global minimum. Non-convex problems (deep learning) have many local minima, but good optimizers find good solutions anyway.

## 3.1 Convex Optimization and Lagrange Multipliers

### Theory

**Convex optimization** is the "easy" case: any local minimum is a global minimum. Many classical ML algorithms (linear regression, SVM, logistic regression) are convex.

For **constrained optimization**, we use **Lagrange multipliers**. Convert:

$$\min_x f(x) \quad \text{subject to} \quad g(x) = 0$$

Into the **Lagrangian**:

$$\mathcal{L}(x, \lambda) = f(x) + \lambda g(x)$$

Then solve:
$$\nabla_x \mathcal{L} = 0 \quad \text{and} \quad \nabla_\lambda \mathcal{L} = 0$$

### üìê Mathematical Example

**Minimize** $f(x, y) = x^2 + y^2$ **subject to** $x + y = 1$

**Lagrangian**:
$$\mathcal{L} = x^2 + y^2 + \lambda(x + y - 1)$$

**Solve**:
- $\frac{\partial \mathcal{L}}{\partial x} = 2x + \lambda = 0 \Rightarrow x = -\lambda/2$
- $\frac{\partial \mathcal{L}}{\partial y} = 2y + \lambda = 0 \Rightarrow y = -\lambda/2$
- $\frac{\partial \mathcal{L}}{\partial \lambda} = x + y - 1 = 0$

From equations 1 and 2: $x = y$. From equation 3: $2x = 1 \Rightarrow x = y = 0.5$

**Solution**: $(x^*, y^*) = (0.5, 0.5)$ with $f^* = 0.5$

In [None]:
# ============================================
# 3.1 LAGRANGE MULTIPLIERS VISUALIZATION
# ============================================

from scipy.optimize import minimize

# Objective function
def objective(vars):
    x, y = vars
    return x**2 + y**2

# Constraint: x + y = 1 (equality, so we write it as x + y - 1 = 0)
constraint = {'type': 'eq', 'fun': lambda vars: vars[0] + vars[1] - 1}

# Solve
result = minimize(objective, [0, 0], constraints=constraint)

print("="*60)
print("LAGRANGE MULTIPLIERS EXAMPLE")
print("="*60)
print(f"Minimize: f(x,y) = x¬≤ + y¬≤")
print(f"Subject to: x + y = 1")
print(f"\nSolution: x* = {result.x[0]:.4f}, y* = {result.x[1]:.4f}")
print(f"Optimal value: f* = {result.fun:.4f}")

# Visualization
fig, ax = plt.subplots(figsize=(8, 8))

# Contours of objective function
x = np.linspace(-1, 2, 100)
y = np.linspace(-1, 2, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2

contours = ax.contour(X, Y, Z, levels=20, cmap='viridis')
ax.clabel(contours, inline=True, fontsize=8)

# Constraint line
ax.plot(x, 1 - x, 'r-', linewidth=2, label='Constraint: x + y = 1')

# Optimal point
ax.scatter([0.5], [0.5], c='red', s=200, marker='*', 
           edgecolors='black', zorder=5, label=f'Optimal: (0.5, 0.5)')

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Constrained Optimization with Lagrange Multipliers', fontweight='bold')
ax.legend()
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
plt.show()

print("\n‚≠ê The optimal point is where the constraint line is TANGENT to a contour.")

## 3.2 SVM and the Dual Formulation

### Theory

**Support Vector Machines (SVM)** find the maximum-margin hyperplane. The **primal** problem:

$$\min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1$$

Using Lagrange multipliers, we get the **dual** problem:

$$\max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$

**Key insight**: The dual form uses only **dot products** $K(x_i, x_j) = x_i^T x_j$, which can be replaced by any kernel function (the **kernel trick**).

### üìê Mathematical Example

**Data** (simple 2D case):
- Class +1: $(1, 2), (2, 3)$
- Class -1: $(0, 0), (1, 0)$

The dual problem becomes a **Quadratic Programming** problem.

In [None]:
# ============================================
# 3.2 SVM FROM SCRATCH (SIMPLIFIED)
# ============================================

from scipy.optimize import minimize

class SimpleSVM:
    """Simplified SVM using scipy optimization."""
    
    def __init__(self, C: float = 1.0):
        self.C = C
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        n_samples, n_features = X.shape
        
        # Compute Gram matrix K[i,j] = x_i ¬∑ x_j
        K = X @ X.T
        
        # Dual objective (to maximize, so we negate for minimization)
        def dual_objective(alpha):
            return 0.5 * np.sum((alpha * y)[:, None] * (alpha * y)[None, :] * K) - np.sum(alpha)
        
        # Gradients
        def dual_gradient(alpha):
            return (alpha * y) @ (K * (y[:, None] * y[None, :])) - 1
        
        # Constraints: 0 <= alpha <= C and sum(alpha * y) = 0
        constraints = [
            {'type': 'eq', 'fun': lambda a: np.dot(a, y)}
        ]
        bounds = [(0, self.C) for _ in range(n_samples)]
        
        # Solve
        result = minimize(
            dual_objective,
            np.zeros(n_samples),
            jac=dual_gradient,
            bounds=bounds,
            constraints=constraints,
            method='SLSQP'
        )
        
        self.alpha = result.x
        
        # Find support vectors (alpha > threshold)
        sv_mask = self.alpha > 1e-5
        self.support_vectors = X[sv_mask]
        self.support_labels = y[sv_mask]
        self.support_alphas = self.alpha[sv_mask]
        
        # Compute w = sum(alpha_i * y_i * x_i)
        self.w = np.sum((self.alpha * y)[:, None] * X, axis=0)
        
        # Compute b using support vectors
        self.b = np.mean(self.support_labels - self.support_vectors @ self.w)
        
        return self
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        return np.sign(X @ self.w + self.b)


# Example data
X = np.array([
    [1, 2],   # +1
    [2, 3],   # +1  
    [0, 0],   # -1
    [1, 0],   # -1
], dtype=float)
y = np.array([1, 1, -1, -1], dtype=float)

# Train SVM
svm = SimpleSVM(C=100.0)
svm.fit(X, y)

print("="*60)
print("SUPPORT VECTOR MACHINE (DUAL FORM)")
print("="*60)
print(f"Data points: {len(X)}")
print(f"Lagrange multipliers (Œ±): {svm.alpha}")
print(f"\nSupport vectors (Œ± > 0):")
for sv, label, alpha in zip(svm.support_vectors, svm.support_labels, svm.support_alphas):
    print(f"  {sv} (y={int(label)}, Œ±={alpha:.4f})")
print(f"\nWeight vector w: {svm.w}")
print(f"Bias b: {svm.b:.4f}")
print(f"\nDecision boundary: {svm.w[0]:.3f}x + {svm.w[1]:.3f}y + {svm.b:.3f} = 0")

In [None]:
# Visualization
plt.figure(figsize=(8, 6))

# Plot data points
plt.scatter(X[y == 1, 0], X[y == 1, 1], c='blue', s=100, label='Class +1', edgecolors='black')
plt.scatter(X[y == -1, 0], X[y == -1, 1], c='red', s=100, label='Class -1', edgecolors='black')

# Plot support vectors
plt.scatter(svm.support_vectors[:, 0], svm.support_vectors[:, 1], 
            s=200, facecolors='none', edgecolors='green', linewidths=2, label='Support Vectors')

# Plot decision boundary and margins
x_range = np.linspace(-0.5, 3, 100)
y_boundary = -(svm.w[0] * x_range + svm.b) / svm.w[1]
y_margin_pos = -(svm.w[0] * x_range + svm.b - 1) / svm.w[1]
y_margin_neg = -(svm.w[0] * x_range + svm.b + 1) / svm.w[1]

plt.plot(x_range, y_boundary, 'k-', linewidth=2, label='Decision Boundary')
plt.plot(x_range, y_margin_pos, 'k--', linewidth=1)
plt.plot(x_range, y_margin_neg, 'k--', linewidth=1)
plt.fill_between(x_range, y_margin_neg, y_margin_pos, alpha=0.1, color='yellow')

plt.xlabel('x‚ÇÅ')
plt.ylabel('x‚ÇÇ')
plt.title('SVM: Maximum Margin Classifier', fontweight='bold')
plt.legend(loc='upper left')
plt.xlim(-0.5, 3)
plt.ylim(-1, 4)
plt.grid(True, alpha=0.3)
plt.show()

print("üí° The margin (yellow area) is maximized.")
print("   Only support vectors determine the decision boundary!")

## 3.3 ADMM: Industrial-Scale Optimization (Uber Case)

### Theory

**Alternating Direction Method of Multipliers (ADMM)** solves problems of the form:

$$\min_x f(x) + g(z) \quad \text{s.t.} \quad Ax + Bz = c$$

The algorithm alternates between:

1. **x-update**: $x^{k+1} = \arg\min_x (f(x) + \frac{\rho}{2}\|Ax + Bz^k - c + u^k\|_2^2)$
2. **z-update**: $z^{k+1} = \arg\min_z (g(z) + \frac{\rho}{2}\|Ax^{k+1} + Bz - c + u^k\|_2^2)$
3. **Dual update**: $u^{k+1} = u^k + Ax^{k+1} + Bz^{k+1} - c$

### üè≠ Industrial Application: Uber

Uber uses ADMM for budget allocation across cities. Each city solves a local problem, while the global constraint ensures total budget is respected.

### üìê Mathematical Example: LASSO Regression

$$\min_x \frac{1}{2}\|Ax - b\|_2^2 + \lambda\|x\|_1$$

We introduce $z = x$ and solve using ADMM.

In [None]:
# ============================================
# 3.3 ADMM FOR LASSO
# ============================================

def soft_threshold(x: np.ndarray, threshold: float) -> np.ndarray:
    """Soft thresholding operator (proximal operator for L1)."""
    return np.sign(x) * np.maximum(np.abs(x) - threshold, 0)


def admm_lasso(A: np.ndarray, b: np.ndarray, lam: float, 
               rho: float = 1.0, max_iter: int = 100, tol: float = 1e-6):
    """
    Solve LASSO using ADMM.
    
    min (1/2)||Ax - b||¬≤ + Œª||x||‚ÇÅ
    """
    n = A.shape[1]
    
    # Initialize
    x = np.zeros(n)
    z = np.zeros(n)
    u = np.zeros(n)
    
    # Precompute (A^T A + œÅI)^{-1} A^T b for efficiency
    AtA = A.T @ A
    Atb = A.T @ b
    L = AtA + rho * np.eye(n)
    L_inv = np.linalg.inv(L)
    
    history = []
    
    for k in range(max_iter):
        # x-update: solve (A^T A + œÅI)x = A^T b + œÅ(z - u)
        x = L_inv @ (Atb + rho * (z - u))
        
        # z-update: soft thresholding
        z_new = soft_threshold(x + u, lam / rho)
        
        # u-update (dual variable)
        u = u + x - z_new
        
        # Check convergence
        primal_residual = np.linalg.norm(x - z_new)
        history.append(primal_residual)
        
        z = z_new
        
        if primal_residual < tol:
            break
    
    return x, history


# Generate example data
np.random.seed(42)
n_samples, n_features = 50, 20
A = np.random.randn(n_samples, n_features)
true_x = np.zeros(n_features)
true_x[:5] = [3, -2, 1.5, -1, 0.5]  # Only 5 non-zero coefficients
b = A @ true_x + 0.1 * np.random.randn(n_samples)

# Solve LASSO
lam = 0.5
x_lasso, history = admm_lasso(A, b, lam, rho=1.0, max_iter=200)

print("="*60)
print("ADMM FOR LASSO REGRESSION")
print("="*60)
print(f"Problem: min (1/2)||Ax - b||¬≤ + {lam}||x||‚ÇÅ")
print(f"True sparse coefficients:   {true_x[:8]}...")
print(f"LASSO solution (rounded):   {np.round(x_lasso[:8], 3)}...")
print(f"\nConverged in {len(history)} iterations")
print(f"Non-zero coefficients: {np.sum(np.abs(x_lasso) > 0.01)} (true: 5)")

---

# Chapter 4: Probability & Information Theory

---

Dealing with **uncertainty** is fundamental to AI. Probability provides the language to describe uncertainty, and **information theory** provides measures of it.

> **Key Insight**: The real world is noisy and uncertain. Probabilistic models embrace this reality rather than fighting it.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

np.random.seed(42)
plt.style.use('seaborn-v0_8-whitegrid')

## 4.1 Entropy and KL Divergence

### Theory

**Entropy** $H(P)$ measures the uncertainty or randomness in a distribution:

$$H(P) = -\sum_x P(x) \log P(x)$$

Properties:
- Maximum for uniform distribution
- Zero for deterministic distribution (one outcome has probability 1)

**KL Divergence** $D_{KL}(P||Q)$ measures how different distribution $Q$ is from $P$:

$$D_{KL}(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$

Properties:
- Always non-negative ($\geq 0$)
- Zero if and only if $P = Q$
- **Not symmetric**: $D_{KL}(P||Q) \neq D_{KL}(Q||P)$

### üìê Mathematical Example

**Distributions**:
- $P = [0.25, 0.75]$ (biased coin)
- $Q = [0.5, 0.5]$ (fair coin)

**Entropy of P**:
$$H(P) = -0.25 \log(0.25) - 0.75 \log(0.75) = 0.562 \text{ bits}$$

**KL Divergence**:
$$D_{KL}(P||Q) = 0.25 \log\frac{0.25}{0.5} + 0.75 \log\frac{0.75}{0.5} = 0.0589 \text{ bits}$$

In [None]:
# ============================================
# 4.1 ENTROPY AND KL DIVERGENCE
# ============================================

def entropy(p: np.ndarray) -> float:
    """Compute entropy in bits."""
    p = np.array(p)
    p = p[p > 0]  # Avoid log(0)
    return -np.sum(p * np.log2(p))


def kl_divergence(p: np.ndarray, q: np.ndarray) -> float:
    """Compute KL divergence D_KL(P||Q) in bits."""
    p, q = np.array(p), np.array(q)
    # Only consider where p > 0
    mask = p > 0
    return np.sum(p[mask] * np.log2(p[mask] / q[mask]))


# Example distributions
P = np.array([0.25, 0.75])  # Biased coin
Q = np.array([0.5, 0.5])   # Fair coin
R = np.array([0.1, 0.9])   # Very biased

print("="*60)
print("ENTROPY AND KL DIVERGENCE")
print("="*60)

print(f"\nüìä Distributions:")
print(f"  P (biased):      {P}")
print(f"  Q (fair coin):   {Q}")
print(f"  R (very biased): {R}")

print(f"\nüìê Entropy (uncertainty):")
print(f"  H(P) = {entropy(P):.4f} bits")
print(f"  H(Q) = {entropy(Q):.4f} bits (maximum for 2 outcomes)")
print(f"  H(R) = {entropy(R):.4f} bits (low uncertainty)")

print(f"\nüìê KL Divergence (difference from Q):")
print(f"  D_KL(P||Q) = {kl_divergence(P, Q):.4f} bits")
print(f"  D_KL(R||Q) = {kl_divergence(R, Q):.4f} bits (R is more different from Q)")
print(f"  D_KL(Q||Q) = {kl_divergence(Q, Q):.4f} bits (same distribution = 0)")

In [None]:
# Visualization: Entropy vs probability
p_range = np.linspace(0.001, 0.999, 100)
entropy_values = [entropy([p, 1-p]) for p in p_range]

plt.figure(figsize=(10, 5))
plt.plot(p_range, entropy_values, 'b-', linewidth=2)
plt.axvline(x=0.5, color='r', linestyle='--', label='Maximum entropy (p=0.5)')
plt.scatter([0.25, 0.5, 0.1], [entropy([0.25, 0.75]), entropy([0.5, 0.5]), entropy([0.1, 0.9])],
            c=['green', 'red', 'orange'], s=100, zorder=5)
plt.annotate('P', (0.25, entropy([0.25, 0.75])+0.05), fontsize=12, fontweight='bold')
plt.annotate('Q', (0.5, entropy([0.5, 0.5])+0.05), fontsize=12, fontweight='bold')
plt.annotate('R', (0.1, entropy([0.1, 0.9])+0.05), fontsize=12, fontweight='bold')
plt.xlabel('Probability p (for outcome 1)', fontsize=12)
plt.ylabel('Entropy H(p) [bits]', fontsize=12)
plt.title('Binary Entropy Function', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("üí° Entropy is maximum when outcomes are equally likely (maximum uncertainty).")

## 4.2 Variational Autoencoders (VAE)

### Theory

**VAEs** are generative models that learn to map data to a latent space that follows a standard Gaussian $N(0, I)$.

The loss function is the **Evidence Lower Bound (ELBO)**:

$$\mathcal{L} = \underbrace{-\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{Reconstruction Loss}} + \underbrace{D_{KL}(q(z|x) || p(z))}_{\text{KL Regularization}}$$

### KL Between Two Gaussians (Closed Form)

For $q(z|x) = N(\mu, \sigma^2)$ and $p(z) = N(0, 1)$:

$$D_{KL}(q||p) = -\frac{1}{2} \sum_{j=1}^{d} (1 + \log(\sigma_j^2) - \mu_j^2 - \sigma_j^2)$$

### üìê Mathematical Example

**Encoder output**: $\mu = 0.5$, $\sigma = 0.8$

$$D_{KL} = -\frac{1}{2}(1 + \log(0.64) - 0.25 - 0.64) = -\frac{1}{2}(1 - 0.446 - 0.25 - 0.64) = 0.168$$

In [None]:
# ============================================
# 4.2 VAE KL DIVERGENCE COMPUTATION
# ============================================

def kl_divergence_gaussian(mu: np.ndarray, log_var: np.ndarray) -> float:
    """
    Compute KL divergence between N(mu, exp(log_var)) and N(0, 1).
    This is the closed-form solution used in VAEs.
    
    Args:
        mu: Mean of the encoder distribution
        log_var: Log variance of the encoder distribution
    
    Returns:
        KL divergence value
    """
    return -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var))


def reparameterize(mu: np.ndarray, log_var: np.ndarray) -> np.ndarray:
    """
    Reparameterization trick: z = mu + sigma * epsilon
    where epsilon ~ N(0, 1)
    """
    std = np.exp(0.5 * log_var)
    eps = np.random.randn(*mu.shape)
    return mu + std * eps


# Example: encoder outputs
mu = np.array([0.5, -0.3, 0.8])
sigma = np.array([0.8, 0.5, 1.2])
log_var = 2 * np.log(sigma)  # log(sigma^2)

print("="*60)
print("VAE KL DIVERGENCE (CLOSED FORM)")
print("="*60)
print(f"Encoder output:")
print(f"  Œº = {mu}")
print(f"  œÉ = {sigma}")
print(f"  log(œÉ¬≤) = {log_var}")

kl = kl_divergence_gaussian(mu, log_var)
print(f"\nD_KL(q(z|x) || p(z)) = {kl:.4f}")

# Show reparameterization trick
print(f"\nüé≤ Reparameterization samples:")
for i in range(3):
    z = reparameterize(mu, log_var)
    print(f"  z_{i+1} = {z}")

print("\nüí° The reparameterization trick allows gradients to flow through sampling!")

In [None]:
# Visualization: Effect of KL divergence on latent space
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Different encoder distributions
configs = [
    (0, 1, 'Standard Normal N(0,1)'),
    (0.5, 0.8, 'Shifted: N(0.5, 0.64)'),
    (2, 0.3, 'Far from prior: N(2, 0.09)'),
]

x = np.linspace(-4, 5, 200)

for ax, (mu, sigma, title) in zip(axes, configs):
    # Prior p(z)
    prior = stats.norm.pdf(x, 0, 1)
    # Encoder q(z|x)
    encoder = stats.norm.pdf(x, mu, sigma)
    
    ax.fill_between(x, prior, alpha=0.3, label='Prior p(z) = N(0,1)')
    ax.fill_between(x, encoder, alpha=0.3, label=f'Encoder q(z|x)')
    ax.plot(x, prior, 'b-', linewidth=2)
    ax.plot(x, encoder, 'r-', linewidth=2)
    
    # Compute KL
    log_var = 2 * np.log(sigma)
    kl = kl_divergence_gaussian(np.array([mu]), np.array([log_var]))
    
    ax.set_title(f'{title}\nD_KL = {kl:.3f}', fontweight='bold')
    ax.set_xlabel('z')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.3)

plt.suptitle('KL Divergence in VAE: Encoder vs Prior', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("üìà Higher KL = encoder distribution is far from the prior.")
print("üìâ VAE training minimizes KL to keep latent space regular.")

## 4.3 t-SNE Algorithm

### Theory

**t-SNE** (t-Distributed Stochastic Neighbor Embedding) visualizes high-dimensional data in 2D/3D.

**Key idea**: Convert distances to probabilities, then minimize the difference between high-dim and low-dim probability distributions.

**High-dimensional space**: Gaussian similarities
$$p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}$$

**Low-dimensional space**: t-distribution (heavier tails to prevent crowding)
$$q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}}$$

**Loss**: KL divergence between P and Q
$$C = \sum_i D_{KL}(P_i || Q_i) = \sum_i \sum_j p_{ij} \log \frac{p_{ij}}{q_{ij}}$$

In [None]:
# ============================================
# 4.3 t-SNE SIMPLIFIED DEMO
# ============================================

from sklearn.manifold import TSNE
from sklearn.datasets import make_blobs

# Generate high-dimensional clustered data
np.random.seed(42)
n_samples = 300
n_features = 50  # High dimensional
n_clusters = 4

X, y = make_blobs(n_samples=n_samples, n_features=n_features, 
                  centers=n_clusters, cluster_std=2.0, random_state=42)

print("="*60)
print("t-SNE DIMENSIONALITY REDUCTION")
print("="*60)
print(f"Original data: {X.shape[0]} samples √ó {X.shape[1]} dimensions")

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X)

print(f"t-SNE output: {X_tsne.shape[0]} samples √ó {X_tsne.shape[1]} dimensions")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original (first 2 dimensions only)
scatter1 = axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', alpha=0.7)
axes[0].set_title('Original Data (First 2 of 50 dims)\nClusters not clearly separated', fontweight='bold')
axes[0].set_xlabel('Dimension 1')
axes[0].set_ylabel('Dimension 2')

# t-SNE
scatter2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', alpha=0.7)
axes[1].set_title('t-SNE Projection (2D)\nClusters clearly separated!', fontweight='bold')
axes[1].set_xlabel('t-SNE Dimension 1')
axes[1].set_ylabel('t-SNE Dimension 2')

plt.tight_layout()
plt.show()

print("\nüí° t-SNE reveals cluster structure hidden in high dimensions!")
print("   It uses KL divergence to preserve local neighborhood structure.")

---

# Chapter 5: Advanced Integration Methods

---

Many ML problems require computing integrals that have no closed-form solution. **Monte Carlo methods** provide numerical approximations.

## 5.1 Monte Carlo Integration and Importance Sampling

### Theory

**Monte Carlo Integration** approximates expectations:

$$\mathbb{E}_{x \sim p}[f(x)] = \int f(x) p(x) dx \approx \frac{1}{N} \sum_{i=1}^N f(x_i), \quad x_i \sim p$$

**Problem**: What if $p(x)$ is hard to sample from?

**Solution**: **Importance Sampling** - sample from an easier distribution $q(x)$:

$$\mathbb{E}_p[f(x)] = \mathbb{E}_q\left[f(x) \frac{p(x)}{q(x)}\right] \approx \frac{1}{N} \sum_{i=1}^N f(x_i) \frac{p(x_i)}{q(x_i)}, \quad x_i \sim q$$

The ratio $w_i = p(x_i)/q(x_i)$ is called the **importance weight**.

### üìê Mathematical Example

Estimate $I = \int_0^1 e^x dx$ (true value: $e - 1 \approx 1.718$)

Using uniform proposal $q(x) = 1$ on $[0, 1]$:

$$\hat{I} = \frac{1}{N} \sum_{i=1}^N e^{x_i}$$

In [None]:
# ============================================
# 5.1 MONTE CARLO AND IMPORTANCE SAMPLING
# ============================================

def monte_carlo_estimate(f, n_samples=10000):
    """Simple Monte Carlo: sample from Uniform[0,1]."""
    x = np.random.uniform(0, 1, n_samples)
    return np.mean(f(x))


def importance_sampling_estimate(f, p, q_sample, q_pdf, n_samples=10000):
    """
    Importance sampling estimate.
    
    Args:
        f: Function to integrate
        p: Target probability density
        q_sample: Function to sample from proposal
        q_pdf: Proposal probability density
    """
    x = q_sample(n_samples)
    weights = p(x) / q_pdf(x)
    return np.mean(f(x) * weights)


# Example: Integrate e^x from 0 to 1
f = lambda x: np.exp(x)
true_value = np.e - 1  # ‚âà 1.718

print("="*60)
print("MONTE CARLO INTEGRATION")
print("="*60)
print(f"Integral: ‚à´‚ÇÄ¬π eÀ£ dx")
print(f"True value: {true_value:.6f}")

# Simple Monte Carlo
estimates = [monte_carlo_estimate(f, n) for n in [100, 1000, 10000, 100000]]
print(f"\nüìä Simple Monte Carlo estimates:")
for n, est in zip([100, 1000, 10000, 100000], estimates):
    print(f"  N={n:6d}: {est:.6f} (error: {abs(est - true_value):.6f})")

print("\n‚úÖ Error decreases as ‚àöN (law of large numbers)")

In [None]:
# Visualization: Convergence of Monte Carlo
np.random.seed(42)
n_max = 5000
samples = np.random.uniform(0, 1, n_max)
cumulative_mean = np.cumsum(np.exp(samples)) / (np.arange(1, n_max + 1))

plt.figure(figsize=(10, 5))
plt.plot(cumulative_mean, 'b-', alpha=0.7, linewidth=1)
plt.axhline(y=true_value, color='r', linestyle='--', linewidth=2, label=f'True value = {true_value:.4f}')
plt.fill_between(range(n_max), true_value - 0.1, true_value + 0.1, alpha=0.2, color='red')
plt.xlabel('Number of samples', fontsize=12)
plt.ylabel('Estimate', fontsize=12)
plt.title('Monte Carlo Convergence: ‚à´‚ÇÄ¬π eÀ£ dx', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim(0, n_max)
plt.show()

print("üí° The estimate fluctuates but converges to the true value.")

## 5.2 Normalizing Flows

### Theory

**Normalizing Flows** transform a simple distribution (e.g., Gaussian) into a complex one through a series of invertible transformations.

Given $z \sim p_z(z)$ and an invertible function $x = f(z)$, the density of $x$ is:

$$p_x(x) = p_z(f^{-1}(x)) \left| \det \frac{\partial f^{-1}}{\partial x} \right| = p_z(z) \left| \det \frac{\partial f}{\partial z} \right|^{-1}$$

The key is choosing $f$ such that the **Jacobian determinant** is easy to compute.

### üìê Mathematical Example: Affine Flow

Simplest flow: $x = az + b$ (scale and shift)

Jacobian determinant: $|a|$

$$p_x(x) = p_z\left(\frac{x-b}{a}\right) \cdot \frac{1}{|a|}$$

In [None]:
# ============================================
# 5.2 NORMALIZING FLOWS DEMONSTRATION
# ============================================

class AffineFlow:
    """Simple affine flow: x = a*z + b"""
    
    def __init__(self, a: float, b: float):
        self.a = a
        self.b = b
    
    def forward(self, z: np.ndarray) -> np.ndarray:
        return self.a * z + self.b
    
    def inverse(self, x: np.ndarray) -> np.ndarray:
        return (x - self.b) / self.a
    
    def log_det_jacobian(self) -> float:
        return np.log(np.abs(self.a))


# Start with standard normal
np.random.seed(42)
z = np.random.randn(5000)

# Apply flow: x = 2z + 3 (scale by 2, shift by 3)
flow = AffineFlow(a=2.0, b=3.0)
x = flow.forward(z)

print("="*60)
print("NORMALIZING FLOWS")
print("="*60)
print(f"Base distribution: z ~ N(0, 1)")
print(f"Flow: x = 2z + 3")
print(f"Resulting distribution: x ~ N(3, 4)")
print(f"\nlog|det(J)| = log|2| = {flow.log_det_jacobian():.4f}")

# Verification
print(f"\nüìä Empirical statistics:")
print(f"  z: mean={np.mean(z):.3f}, std={np.std(z):.3f}")
print(f"  x: mean={np.mean(x):.3f}, std={np.std(x):.3f}")

In [None]:
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Base distribution
axes[0].hist(z, bins=50, density=True, alpha=0.7, color='blue')
x_plot = np.linspace(-4, 4, 100)
axes[0].plot(x_plot, stats.norm.pdf(x_plot), 'k-', linewidth=2)
axes[0].set_title('Base: z ~ N(0, 1)', fontweight='bold')
axes[0].set_xlabel('z')

# Flow transformation
axes[1].annotate('', xy=(0.9, 0.5), xytext=(0.1, 0.5),
                 arrowprops=dict(arrowstyle='->', lw=3, color='green'),
                 xycoords='axes fraction')
axes[1].text(0.5, 0.6, 'x = 2z + 3', fontsize=16, ha='center', transform=axes[1].transAxes, fontweight='bold')
axes[1].text(0.5, 0.4, 'Invertible!', fontsize=12, ha='center', transform=axes[1].transAxes)
axes[1].axis('off')
axes[1].set_title('Affine Flow', fontweight='bold')

# Transformed distribution
axes[2].hist(x, bins=50, density=True, alpha=0.7, color='orange')
x_plot = np.linspace(-3, 9, 100)
axes[2].plot(x_plot, stats.norm.pdf(x_plot, loc=3, scale=2), 'k-', linewidth=2)
axes[2].set_title('Result: x ~ N(3, 4)', fontweight='bold')
axes[2].set_xlabel('x')

plt.suptitle('Normalizing Flow: Transform Simple ‚Üí Complex', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("üí° By stacking many invertible layers, flows can model very complex distributions!")

---

# Chapter 6: Network Analysis

---

## 6.1 Supervised Random Walks (Facebook Link Prediction)

### Theory

**Link Prediction**: Given a social network, predict which new edges (friendships) will form.

**Supervised Random Walks** (Backstrom & Leskovec, 2011) learn edge weights based on features to bias a random walk toward future friends.

**Key idea**: Learn a function $\text{strength}(u, v) = f_w(\text{features}(u, v))$ that weights edges.

The optimization objective:

$$\min_w \|w\|^2 + \lambda \sum_{(d, l) \in D} h(p_l - p_d)$$

Where:
- $d$: A node that becomes a friend (destination)
- $l$: A node that does not (non-link)
- $p_d, p_l$: Random walk probabilities of reaching these nodes
- $h$: Hinge-like loss

In [None]:
# ============================================
# 6.1 LINK PREDICTION DEMO
# ============================================

def compute_common_neighbors(adj: np.ndarray) -> np.ndarray:
    """Compute common neighbors score for all node pairs."""
    # CN(u,v) = |N(u) ‚à© N(v)| = (A @ A)[u,v]
    return adj @ adj


def compute_adamic_adar(adj: np.ndarray) -> np.ndarray:
    """Compute Adamic-Adar score for all node pairs."""
    n = adj.shape[0]
    degrees = adj.sum(axis=1)
    scores = np.zeros((n, n))
    
    for u in range(n):
        for v in range(n):
            if u != v:
                # Find common neighbors
                common = np.where((adj[u] == 1) & (adj[v] == 1))[0]
                # Sum 1/log(degree) for each common neighbor
                for w in common:
                    if degrees[w] > 1:
                        scores[u, v] += 1 / np.log(degrees[w])
    
    return scores


# Create a sample social network
# 6 users: edges represent friendships
adj = np.array([
    [0, 1, 1, 0, 0, 0],  # User 0: friends with 1, 2
    [1, 0, 1, 1, 0, 0],  # User 1: friends with 0, 2, 3
    [1, 1, 0, 1, 1, 0],  # User 2: friends with 0, 1, 3, 4
    [0, 1, 1, 0, 1, 1],  # User 3: friends with 1, 2, 4, 5
    [0, 0, 1, 1, 0, 1],  # User 4: friends with 2, 3, 5
    [0, 0, 0, 1, 1, 0],  # User 5: friends with 3, 4
])

print("="*60)
print("LINK PREDICTION IN SOCIAL NETWORKS")
print("="*60)

# Compute scores
cn_scores = compute_common_neighbors(adj)
aa_scores = compute_adamic_adar(adj)

# Find potential links (pairs that aren't already connected)
print("\nüìä Potential new friendships (ranked by Common Neighbors):")
potential_links = []
for i in range(6):
    for j in range(i+1, 6):
        if adj[i, j] == 0:  # Not already friends
            potential_links.append((i, j, cn_scores[i, j], aa_scores[i, j]))

# Sort by common neighbors
potential_links.sort(key=lambda x: x[2], reverse=True)

print(f"{'Pair':<10} {'Common Neighbors':<18} {'Adamic-Adar':<12}")
print("-" * 40)
for i, j, cn, aa in potential_links[:5]:
    print(f"({i}, {j})     {int(cn):<18} {aa:.3f}")

In [None]:
# Visualization of the network
import matplotlib.patches as mpatches

# Node positions (manually arranged for visualization)
pos = {
    0: (0, 1),
    1: (1, 2),
    2: (2, 1),
    3: (3, 2),
    4: (4, 1),
    5: (4, 2.5)
}

plt.figure(figsize=(10, 6))

# Draw existing edges
for i in range(6):
    for j in range(i+1, 6):
        if adj[i, j] == 1:
            plt.plot([pos[i][0], pos[j][0]], [pos[i][1], pos[j][1]], 
                     'b-', linewidth=2, alpha=0.6)

# Draw predicted edge (highest score)
best_i, best_j = potential_links[0][0], potential_links[0][1]
plt.plot([pos[best_i][0], pos[best_j][0]], [pos[best_i][1], pos[best_j][1]], 
         'g--', linewidth=3, label='Predicted friendship')

# Draw nodes
for node, (x, y) in pos.items():
    plt.scatter(x, y, c='steelblue', s=800, zorder=5, edgecolors='black', linewidths=2)
    plt.annotate(f'User {node}', (x, y), fontsize=10, ha='center', va='center', color='white', fontweight='bold')

plt.title('Social Network with Predicted Link', fontsize=14, fontweight='bold')
plt.legend(loc='upper left')
plt.axis('off')
plt.tight_layout()
plt.show()

print(f"\nüéØ Most likely new friendship: Users {best_i} and {best_j}")
print(f"   (They have {int(potential_links[0][2])} common friends!)")

---

# Chapter 7: Bayesian Optimization

---

When the objective function is expensive to evaluate (hyperparameter tuning, drug discovery), gradient-based methods are impractical. **Bayesian Optimization** uses a probabilistic model to guide the search.

## 7.1 Gaussian Processes

### Theory

A **Gaussian Process (GP)** is a distribution over functions:

$$f(x) \sim \mathcal{GP}(m(x), k(x, x'))$$

Where:
- $m(x)$: Mean function
- $k(x, x')$: Kernel (covariance) function

**Key property**: GP gives both a prediction AND uncertainty!

### Squared Exponential (RBF) Kernel

$$k(x, x') = \sigma^2 \exp\left(-\frac{\|x - x'\|^2}{2\ell^2}\right)$$

Parameters:
- $\sigma^2$: Signal variance (amplitude)
- $\ell$: Length scale (how quickly function varies)

In [None]:
# ============================================
# 7.1 GAUSSIAN PROCESS REGRESSION
# ============================================

def rbf_kernel(X1: np.ndarray, X2: np.ndarray, 
               length_scale: float = 1.0, variance: float = 1.0) -> np.ndarray:
    """
    Compute RBF (squared exponential) kernel matrix.
    """
    X1 = X1.reshape(-1, 1) if X1.ndim == 1 else X1
    X2 = X2.reshape(-1, 1) if X2.ndim == 1 else X2
    
    # Compute squared Euclidean distances
    sqdist = np.sum(X1**2, axis=1, keepdims=True) + \
             np.sum(X2**2, axis=1) - 2 * X1 @ X2.T
    
    return variance * np.exp(-0.5 * sqdist / (length_scale**2))


def gp_predict(X_train: np.ndarray, y_train: np.ndarray, 
               X_test: np.ndarray, length_scale: float = 1.0,
               noise: float = 1e-6):
    """
    GP prediction with posterior mean and variance.
    """
    K = rbf_kernel(X_train, X_train, length_scale) + noise * np.eye(len(X_train))
    K_star = rbf_kernel(X_train, X_test, length_scale)
    K_star_star = rbf_kernel(X_test, X_test, length_scale)
    
    # Compute posterior
    K_inv = np.linalg.inv(K)
    mu = K_star.T @ K_inv @ y_train
    cov = K_star_star - K_star.T @ K_inv @ K_star
    
    return mu, np.sqrt(np.diag(cov))


# Generate training data
np.random.seed(42)
X_train = np.array([1, 3, 5, 6, 7])
y_train = np.sin(X_train) + 0.1 * np.random.randn(len(X_train))

# Test points
X_test = np.linspace(0, 10, 100)

# GP prediction
mu, std = gp_predict(X_train, y_train, X_test, length_scale=1.0)

print("="*60)
print("GAUSSIAN PROCESS REGRESSION")
print("="*60)
print(f"Training points: {len(X_train)}")
print(f"Test points: {len(X_test)}")
print(f"\nGP provides:")
print(f"  - Mean prediction Œº(x)")
print(f"  - Uncertainty œÉ(x) at each point!")

In [None]:
# Visualization
plt.figure(figsize=(12, 5))

# True function
plt.plot(X_test, np.sin(X_test), 'k--', label='True function: sin(x)', linewidth=2)

# GP prediction
plt.plot(X_test, mu, 'b-', label='GP Mean prediction', linewidth=2)
plt.fill_between(X_test, mu - 2*std, mu + 2*std, alpha=0.3, color='blue', label='95% confidence')

# Training points
plt.scatter(X_train, y_train, c='red', s=100, zorder=5, label='Training data', edgecolors='black')

plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)
plt.title('Gaussian Process: Prediction with Uncertainty', fontsize=14, fontweight='bold')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.show()

print("üí° Notice: Uncertainty is LOW near training points, HIGH far from them!")

## 7.2 Acquisition Functions

### Theory

**Acquisition functions** decide where to sample next by balancing:
- **Exploitation**: Sample where we expect good values
- **Exploration**: Sample where we're uncertain

### Expected Improvement (EI)

$$\alpha_{EI}(x) = (\mu(x) - f^* - \xi) \Phi(Z) + \sigma(x) \phi(Z)$$

Where:
- $Z = \frac{\mu(x) - f^* - \xi}{\sigma(x)}$
- $f^*$: Best value observed so far
- $\xi$: Exploration parameter
- $\Phi$, $\phi$: CDF and PDF of standard normal

In [None]:
# ============================================
# 7.2 BAYESIAN OPTIMIZATION WITH EI
# ============================================

def expected_improvement(mu: np.ndarray, std: np.ndarray, 
                         f_best: float, xi: float = 0.01) -> np.ndarray:
    """
    Compute Expected Improvement acquisition function.
    """
    std = np.maximum(std, 1e-9)  # Avoid division by zero
    Z = (mu - f_best - xi) / std
    ei = (mu - f_best - xi) * stats.norm.cdf(Z) + std * stats.norm.pdf(Z)
    return ei


# Objective function to optimize (expensive black-box function)
def objective(x):
    return -((x - 2)**2 * np.sin(3*x))


# Initial samples
X_train = np.array([0.5, 2.0, 4.5])
y_train = objective(X_train)
f_best = np.max(y_train)

# Test points
X_test = np.linspace(0, 5, 200)

# GP prediction
mu, std = gp_predict(X_train, y_train, X_test, length_scale=0.5)

# Expected Improvement
ei = expected_improvement(mu, std, f_best)

print("="*60)
print("BAYESIAN OPTIMIZATION")
print("="*60)
print(f"Current best: {f_best:.4f} at x = {X_train[np.argmax(y_train)]:.2f}")
print(f"Next point to sample: x = {X_test[np.argmax(ei)]:.4f}")
print(f"Expected Improvement at that point: {np.max(ei):.4f}")

In [None]:
# Visualization
fig, axes = plt.subplots(2, 1, figsize=(12, 8), sharex=True)

# Top: GP surrogate
ax1 = axes[0]
ax1.plot(X_test, objective(X_test), 'k--', label='True objective', linewidth=2)
ax1.plot(X_test, mu, 'b-', label='GP prediction', linewidth=2)
ax1.fill_between(X_test, mu - 2*std, mu + 2*std, alpha=0.3, color='blue')
ax1.scatter(X_train, y_train, c='red', s=100, zorder=5, label='Observations', edgecolors='black')
ax1.axhline(y=f_best, color='green', linestyle=':', label=f'Best so far = {f_best:.2f}')
ax1.set_ylabel('f(x)', fontsize=12)
ax1.set_title('Gaussian Process Surrogate Model', fontsize=14, fontweight='bold')
ax1.legend(loc='upper right')
ax1.grid(True, alpha=0.3)

# Bottom: Acquisition function
ax2 = axes[1]
ax2.fill_between(X_test, 0, ei, alpha=0.5, color='orange')
ax2.plot(X_test, ei, 'orange', linewidth=2)
next_x = X_test[np.argmax(ei)]
ax2.axvline(x=next_x, color='red', linestyle='--', linewidth=2, label=f'Next sample: x={next_x:.2f}')
ax2.set_xlabel('x', fontsize=12)
ax2.set_ylabel('Expected Improvement', fontsize=12)
ax2.set_title('Acquisition Function (Expected Improvement)', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüéØ Bayesian Optimization balances:")
print("   - Exploitation: Sample where Œº(x) is high")
print("   - Exploration: Sample where œÉ(x) is high (uncertain regions)")

---

# üìö Conclusion

---

This notebook has taken you through the **deep mathematical foundations** that power modern machine learning:

1. **Linear Algebra**: Vectors, matrices, eigenvalues, SVD ‚Üí PageRank, recommendations
2. **Calculus**: Gradients, chain rule, Jacobian ‚Üí Backpropagation, Transformers
3. **Optimization**: Lagrange, convex, ADMM ‚Üí SVMs, industrial-scale systems
4. **Probability**: Entropy, KL, Bayes ‚Üí VAEs, generative models
5. **Advanced Methods**: Monte Carlo, flows ‚Üí Sampling, density estimation
6. **Network Analysis**: Random walks ‚Üí Link prediction
7. **Bayesian Optimization**: GPs, acquisition ‚Üí Hyperparameter tuning

> **The Key Insight**: Mathematics is not just a prerequisite‚Äîit is the *language* in which ML discoveries are written. Mastering these foundations enables you to:
> - Debug models effectively
> - Design novel architectures
> - Understand cutting-edge research papers
> - Build production-grade ML systems

---

**Next Steps**: Apply these concepts to real datasets and implement more complex models!