Here is **Chapter 1: Mathematical Foundations for AI** — the complete first chapter of your AI Engineer Workbook.

---

# **CHAPTER 1: MATHEMATICAL FOUNDATIONS FOR AI**

*The Language of Intelligence*

## **Chapter Overview**

Before writing a single line of machine learning code, you must understand the mathematical machinery that powers modern AI. This chapter transforms abstract mathematics into practical tools. We will not just learn formulas—we will implement them in NumPy to build intuition through code.

**Estimated Time:** 40-50 hours (2-3 weeks)  
**Prerequisites:** High school algebra, basic Python syntax

---

## **1.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Manipulate vectors, matrices, and tensors as fluently as Python lists
2. Compute gradients by hand and verify them programmatically
3. Apply Bayes' theorem to update beliefs given evidence
4. Design hypothesis tests to validate model performance
5. Recognize when correlation implies causation (and when it doesn't)

---

## **1.1 Linear Algebra: The Language of Data**

### **Why Linear Algebra Matters**

Every image is a matrix of pixels. Every sentence is a vector of embeddings. Every dataset is a tensor of features. Linear algebra provides the syntax for describing high-dimensional spaces where AI operates.

#### **1.1.1 Scalars, Vectors, and Matrices**

**Scalar:** A single number (0-dimensional). Temperature, age, price.
$$ s \in \mathbb{R} $$

**Vector:** An ordered list of numbers (1-dimensional array). Represents a point in space or a direction.
$$ \mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \in \mathbb{R}^n $$

In AI, vectors represent:
- Feature vectors (house: [sqft, bedrooms, age])
- Word embeddings ("king" → [0.2, -0.5, ...])
- Model weights

**Matrix:** A 2-dimensional array of numbers.
$$ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} \in \mathbb{R}^{m \times n} $$

Matrices represent:
- Image data (height × width × channels)
- Layers of neural networks (weight matrices)
- Tabular datasets (samples × features)

**Tensor:** Generalization to $n$ dimensions. A scalar is a 0-tensor, vector is 1-tensor, matrix is 2-tensor.
$$ \mathcal{T} \in \mathbb{R}^{d_1 \times d_2 \times \cdots \times d_n} $$

In deep learning, a batch of 32 RGB images of size 224×224 is a tensor of shape $(32, 224, 224, 3)$.

```python
import numpy as np

# Creating tensors
scalar = np.array(5)
vector = np.array([1, 2, 3])
matrix = np.array([[1, 2], [3, 4], [5, 6]])  # 3x2 matrix
tensor = np.random.rand(32, 224, 224, 3)     # Batch of images

print(f"Scalar shape: {scalar.shape}")       # ()
print(f"Vector shape: {vector.shape}")       # (3,)
print(f"Matrix shape: {matrix.shape}")       # (3, 2)
print(f"Tensor shape: {tensor.shape}")       # (32, 224, 224, 3)
```

#### **1.1.2 Vector Operations**

**Dot Product (Inner Product):** Measures alignment between vectors.
$$ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^n a_i b_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos(\theta) $$

The dot product is the heart of neural networks—every layer computes weighted sums (dot products between inputs and weights).

```python
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Dot product
dot_product = np.dot(a, b)  # 1*4 + 2*5 + 3*6 = 32

# Geometric interpretation
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
cos_theta = dot_product / (norm_a * norm_b)
angle = np.arccos(cos_theta)  # In radians
```

**Hadamard Product (Element-wise):**
$$ (\mathbf{a} \circ \mathbf{b})_i = a_i \cdot b_i $$

**Outer Product:** Creates a matrix from two vectors.
$$ (\mathbf{a} \otimes \mathbf{b})_{ij} = a_i \cdot b_j $$

```python
# Hadamard product
hadamard = a * b  # [4, 10, 18]

# Outer product
outer = np.outer(a, b)  
# [[ 4,  5,  6],
#  [ 8, 10, 12],
#  [12, 15, 18]]
```

#### **1.1.3 Vector Norms**

Norms measure magnitude. Critical for regularization (L1/L2) and optimization.

- **L1 Norm (Manhattan):** $\|\mathbf{x}\|_1 = \sum |x_i|$ — Encourages sparsity
- **L2 Norm (Euclidean):** $\|\mathbf{x}\|_2 = \sqrt{\sum x_i^2}$ — Default distance metric
- **L∞ Norm (Max):** $\|\mathbf{x}\|_\infty = \max |x_i|$

```python
l1 = np.linalg.norm(a, ord=1)   # Sum of absolute values
l2 = np.linalg.norm(a, ord=2)   # Euclidean length
linf = np.linalg.norm(a, ord=np.inf)  # Maximum absolute value
```

---

## **1.2 Matrix Operations: The Mechanics of Transformation**

Matrices transform spaces. When you multiply a vector by a matrix, you rotate, scale, or project it into a new space. This is literally what happens in every layer of a neural network.

#### **1.2.1 Matrix Multiplication**

For matrices $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{n \times p}$:
$$ (\mathbf{A}\mathbf{B})_{ij} = \sum_{k=1}^n A_{ik} B_{kj} $$

**Key Rules:**
- Not commutative: $\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}$ (usually)
- Associative: $(\mathbf{A}\mathbf{B})\mathbf{C} = \mathbf{A}(\mathbf{B}\mathbf{C})$
- Transpose rule: $(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T \mathbf{A}^T$

```python
A = np.array([[1, 2], [3, 4], [5, 6]])      # 3x2
B = np.array([[7, 8, 9], [10, 11, 12]])     # 2x3

# Matrix multiplication
C = np.matmul(A, B)  # or A @ B (Python 3.5+)
# Result: 3x3 matrix

# Note: B @ A would give 2x2 (different result)
```

**Geometric Interpretation:** Matrix multiplication is composition of linear transformations. If $\mathbf{A}$ rotates and $\mathbf{B}$ scales, $\mathbf{A}\mathbf{B}$ does both.

#### **1.2.2 Special Matrices**

**Identity Matrix ($\mathbf{I}$):** The "1" of matrices. $\mathbf{A}\mathbf{I} = \mathbf{A}$.
$$ \mathbf{I}_3 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} $$

**Diagonal Matrix:** Non-zero elements only on diagonal. Efficient storage ($O(n)$ vs $O(n^2)$).
```python
diag = np.diag([1, 2, 3])  # Creates diagonal matrix
extracted = np.diag(diag)  # Extracts diagonal: [1, 2, 3]
```

**Symmetric Matrix:** $\mathbf{A} = \mathbf{A}^T$. Common in covariance matrices and graph adjacency matrices.

**Orthogonal Matrix:** $\mathbf{Q}^T \mathbf{Q} = \mathbf{I}$. Columns are orthonormal (perpendicular unit vectors). Preserves lengths and angles—essential in attention mechanisms.

#### **1.2.3 Matrix Transpose and Inverse**

**Transpose:** Flip over diagonal. $(\mathbf{A}^T)_{ij} = A_{ji}$.

**Inverse:** Matrix $\mathbf{A}^{-1}$ such that $\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}$. Exists only if $\mathbf{A}$ is square and full rank (determinant ≠ 0).

In AI, we rarely compute inverses directly (too expensive: $O(n^3)$), but understanding them is crucial for:
- Solving linear systems $\mathbf{A}\mathbf{x} = \mathbf{b}$
- Understanding covariance matrices
- Deriving normal equations for linear regression

```python
A_square = np.array([[4, 7], [2, 6]])

# Transpose
A_T = A_square.T

# Inverse
A_inv = np.linalg.inv(A_square)

# Verification
identity = A_square @ A_inv  # Should be close to I (with floating point error)
```

**Pseudo-inverse (Moore-Penrose):** For non-square matrices, used in least squares solutions.
```python
A_pinv = np.linalg.pinv(A)  # Works for rectangular matrices
```

#### **1.2.4 Eigenvalues and Eigenvectors**

For a square matrix $\mathbf{A}$, an eigenvector $\mathbf{v}$ is a non-zero vector that only gets scaled (not rotated) when multiplied by $\mathbf{A}$:
$$ \mathbf{A}\mathbf{v} = \lambda \mathbf{v} $$

Where $\lambda$ is the eigenvalue (scaling factor).

**Why this matters for AI:**
- **PCA (Dimensionality reduction):** Eigenvectors of covariance matrix = principal components
- **PageRank:** Eigenvector of the web graph adjacency matrix
- **Stability analysis:** Eigenvalues of Hessian matrix determine optimization convergence
- **Spectral clustering:** Uses eigenvectors of graph Laplacian

```python
A = np.array([[4, 2], [1, 3]])

# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)

# eigenvectors[:, i] is the eigenvector for eigenvalues[i]
print(f"Eigenvalue: {eigenvalues[0]}")
print(f"Corresponding eigenvector: {eigenvectors[:, 0]}")

# Verify: A @ v should equal lambda * v
v = eigenvectors[:, 0]
lhs = A @ v
rhs = eigenvalues[0] * v
print(f"Verification (should be ~0): {np.allclose(lhs, rhs)}")
```

**Spectral Decomposition:** For symmetric matrices:
$$ \mathbf{A} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^T $$
Where $\mathbf{Q}$ is orthogonal matrix of eigenvectors, $\mathbf{\Lambda}$ is diagonal matrix of eigenvalues.

---

## **1.3 Calculus: The Optimization Engine**

Machine learning is optimization. We minimize loss functions using gradients. This section builds the machinery for understanding backpropagation.

#### **1.3.1 Derivatives and Gradients**

**Derivative:** Rate of change. Slope of the tangent line.
$$ f'(x) = \frac{df}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} $$

**Partial Derivative:** Derivative with respect to one variable, holding others constant.
$$ \frac{\partial f}{\partial x_i} $$

**Gradient:** Vector of partial derivatives. Points in direction of steepest ascent.
$$ \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} $$

In ML, we move **against** the gradient (gradient descent) to minimize loss.

```python
# Numerical gradient checking (finite differences)
def numerical_gradient(f, x, h=1e-5):
    """Compute gradient of f at x using central difference"""
    grad = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    
    while not it.finished:
        idx = it.multi_index
        old_val = x[idx]
        
        x[idx] = old_val + h
        fx_plus = f(x)
        
        x[idx] = old_val - h
        fx_minus = f(x)
        
        grad[idx] = (fx_plus - fx_minus) / (2 * h)
        x[idx] = old_val
        it.iternext()
    
    return grad

# Example: f(x, y) = x^2 + y^2
def f(point):
    x, y = point
    return x**2 + y**2

point = np.array([3.0, 4.0])
grad = numerical_gradient(f, point)
print(f"Gradient at (3,4): {grad}")  # Should be close to [6, 8]
```

#### **1.3.2 The Chain Rule**

The chain rule is the mathematical foundation of **backpropagation**. If $z = f(y)$ and $y = g(x)$, then:
$$ \frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx} $$

For multiple variables (multivariate chain rule):
$$ \frac{\partial z}{\partial x} = \sum_i \frac{\partial z}{\partial y_i} \frac{\partial y_i}{\partial x} $$

**Example:** Neural network layer
$$ z = \text{ReLU}(\mathbf{w}^T \mathbf{x} + b) $$

To find $\frac{\partial z}{\partial \mathbf{w}}$, we apply chain rule through:
1. Linear transformation: $a = \mathbf{w}^T \mathbf{x} + b$
2. Activation: $z = \max(0, a)$

$$ \frac{\partial z}{\partial \mathbf{w}} = \frac{\partial z}{\partial a} \cdot \frac{\partial a}{\partial \mathbf{w}} = \mathbb{I}(a > 0) \cdot \mathbf{x}^T $$

Where $\mathbb{I}$ is the indicator function (1 if true, 0 else).

```python
# Manual backpropagation example
def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

# Forward pass
x = np.array([2.0, -1.0, 0.5])
w = np.array([0.1, 0.5, -0.2])
b = 0.1

z = np.dot(w, x) + b
a = relu(z)

# Backward pass (assuming da = 1 for final output)
da = 1.0
dz = da * relu_derivative(z)  # Chain rule through ReLU
dw = dz * x                   # Gradient w.r.t weights
db = dz                       # Gradient w.r.t bias

print(f"Gradient w.r.t w: {dw}")
print(f"Gradient w.r.t b: {db}")
```

#### **1.3.3 Jacobian and Hessian**

**Jacobian Matrix:** For vector-valued function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian is the matrix of all first-order partial derivatives.
$$ \mathbf{J}_{ij} = \frac{\partial f_i}{\partial x_j} $$

Used when layers output vectors rather than scalars.

**Hessian Matrix:** Matrix of second derivatives.
$$ \mathbf{H}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} $$

Used in:
- Second-order optimization methods (Newton's method)
- Determining convexity (positive definite = convex)
- Analyzing curvature of loss landscape (sharp vs flat minima)

```python
from scipy.optimize import approx_fprime

# Jacobian example
def vector_func(x):
    return np.array([
        x[0]**2 + x[1],
        x[0] * x[1]
    ])

x = np.array([1.0, 2.0])
jacobian = approx_fprime(x, vector_func, epsilon=1e-6)
print(f"Jacobian shape: {jacobian.shape}")  # 2x2
```

---

## **1.4 Probability Theory: Handling Uncertainty**

AI must reason under uncertainty. Probability provides the framework for:
- Quantifying prediction confidence
- Bayesian neural networks
- Generative models (diffusion, VAEs)
- Reinforcement learning (Markov Decision Processes)

#### **1.4.1 Fundamentals**

**Random Variable:** A variable whose value is subject to chance.
- Discrete (e.g., number of heads in 10 coin flips)
- Continuous (e.g., height of a person)

**Probability Mass Function (PMF):** For discrete variables. $P(X = x)$.
**Probability Density Function (PDF):** For continuous variables. $p(x)$ where $\int p(x) dx = 1$.

**Key Distributions for AI:**

1. **Bernoulli:** Binary outcome (coin flip). $P(X=1) = p$
2. **Categorical/Multinoulli:** K outcomes (class labels). Used in classification output
3. **Normal (Gaussian):** $p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$
   - Central to diffusion models, noise assumptions, weight initialization
4. **Uniform:** All outcomes equally likely. Used in random initialization, dropout
5. **Beta/Dirichlet:** Distributions over probabilities. Used in Bayesian methods

```python
import numpy as np
from scipy import stats

# Sampling from distributions
normal_samples = np.random.normal(loc=0, scale=1, size=1000)  # Gaussian
uniform_samples = np.random.uniform(low=0, high=1, size=1000)
categorical_sample = np.random.choice([0, 1, 2], p=[0.2, 0.5, 0.3])

# Probability density
x = np.linspace(-3, 3, 100)
pdf = stats.norm.pdf(x, loc=0, scale=1)
```

#### **1.4.2 Expectation and Variance**

**Expectation (Mean):** Weighted average.
$$ \mathbb{E}[X] = \sum x P(X=x) \quad \text{(discrete)} $$
$$ \mathbb{E}[X] = \int x p(x) dx \quad \text{(continuous)} $$

**Variance:** Spread of distribution.
$$ \text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 $$

**Covariance:** How two variables change together.
$$ \text{Cov}(X,Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] $$

**Covariance Matrix:** For random vector $\mathbf{X}$, $\Sigma_{ij} = \text{Cov}(X_i, X_j)$. Symmetric, positive semi-definite. Diagonal elements are variances.

```python
# Empirical estimation from data
data = np.random.multivariate_normal(
    mean=[0, 0], 
    cov=[[1, 0.8], [0.8, 1]], 
    size=1000
)

mean = np.mean(data, axis=0)
cov_matrix = np.cov(data.T)  # Note: rowvar=False in some conventions

print(f"Mean: {mean}")
print(f"Covariance matrix:\n{cov_matrix}")
```

#### **1.4.3 Bayes' Theorem**

The single most important equation for modern AI (Bayesian methods, spam filters, medical diagnosis):

$$ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} $$

Where:
- $P(A|B)$: Posterior (belief after seeing evidence)
- $P(B|A)$: Likelihood (probability of evidence given hypothesis)
- $P(A)$: Prior (initial belief)
- $P(B)$: Marginal likelihood (normalizing constant)

**Example:** Medical Test
- Disease prevalence: $P(\text{Disease}) = 0.01$ (1%)
- Test sensitivity: $P(\text{Positive}|\text{Disease}) = 0.99$
- False positive rate: $P(\text{Positive}|\text{No Disease}) = 0.05$

What's $P(\text{Disease}|\text{Positive})$?

```python
# Bayesian updating example
def bayesian_update(prior, likelihood, marginal):
    return (likelihood * prior) / marginal

prior_disease = 0.01
sensitivity = 0.99
false_positive = 0.05

# Marginal likelihood: P(Pos) = P(Pos|D)P(D) + P(Pos|~D)P(~D)
p_positive = (sensitivity * prior_disease + 
              false_positive * (1 - prior_disease))

posterior = bayesian_update(prior_disease, sensitivity, p_positive)
print(f"Probability of disease given positive test: {posterior:.4f}")
# Result: ~0.167 (16.7%) - surprisingly low due to low base rate!
```

#### **1.4.4 Maximum Likelihood Estimation (MLE)**

How do we train models? We maximize the likelihood of observed data.

Given data $D$ and parameters $\theta$, find:
$$ \hat{\theta} = \arg\max_\theta P(D|\theta) $$

Usually minimizes negative log-likelihood (NLL):
$$ \mathcal{L}(\theta) = -\sum_i \log P(x_i|\theta) $$

This is why we use cross-entropy loss for classification—it's the negative log-likelihood of the categorical distribution.

```python
# MLE for Gaussian mean
data = np.random.normal(loc=5.0, scale=2.0, size=1000)

# Analytical MLE for mean is just the sample mean
mle_mean = np.mean(data)
mle_std = np.std(data, ddof=0)  # Population std (MLE uses N, not N-1)

print(f"True mean: 5.0, Estimated: {mle_mean:.4f}")
print(f"True std: 2.0, Estimated: {mle_std:.4f}")
```

---

## **1.5 Statistics: Validating Knowledge**

Mathematics tells us what is true; statistics tells us what we can infer from finite, noisy data.

#### **1.5.1 Sampling and Estimation**

**Population vs Sample:** We rarely have population data; we infer from samples.

**Estimator Properties:**
- **Bias:** $\mathbb{E}[\hat{\theta}] - \theta$ (systematic error)
- **Variance:** $\text{Var}(\hat{\theta})$ (sensitivity to sample)
- **Consistency:** $\hat{\theta} \to \theta$ as $n \to \infty$

**Bias-Variance Tradeoff** (preview of ML concepts):
- High bias: Underfitting (too simple)
- High variance: Overfitting (too complex)

#### **1.5.2 Hypothesis Testing**

**Null Hypothesis ($H_0$):** Default assumption (e.g., "Model A and B perform equally")
**Alternative Hypothesis ($H_1$):** What we want to prove (e.g., "Model A is better")

**p-value:** Probability of observing data as extreme as we did, assuming $H_0$ is true. If $p < 0.05$, we reject $H_0$.

**T-test:** Compare means of two groups.
```python
from scipy import stats

# Model A accuracy scores across 10 folds
model_a = [0.85, 0.87, 0.84, 0.86, 0.88, 0.85, 0.86, 0.87, 0.85, 0.86]
# Model B accuracy scores
model_b = [0.82, 0.83, 0.81, 0.84, 0.82, 0.83, 0.81, 0.82, 0.83, 0.82]

t_stat, p_value = stats.ttest_ind(model_a, model_b)
print(f"T-statistic: {t_stat:.4f}, p-value: {p_value:.6f}")

if p_value < 0.05:
    print("Significant difference between models!")
else:
    print("No significant difference detected.")
```

**Multiple Testing Problem:** If you test 20 hypotheses at $p < 0.05$, you expect 1 false positive by chance. Use Bonferroni correction: $\alpha_{adjusted} = \alpha / n$.

#### **1.5.3 Confidence Intervals**

A 95% confidence interval means: if we repeated the experiment many times, 95% of intervals would contain the true parameter.

$$ \text{CI} = \bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}} $$

```python
def confidence_interval(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    std_err = stats.sem(data)  # Standard error of mean
    margin = std_err * stats.t.ppf((1 + confidence) / 2, n - 1)
    return mean - margin, mean + margin

lower, upper = confidence_interval(model_a)
print(f"95% CI for Model A: [{lower:.4f}, {upper:.4f}]")
```

#### **1.5.4 Correlation vs. Causation**

**Correlation:** Statistical relationship ($\rho$ or $r$)
$$ r = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} $$

Range: $[-1, 1]$. $r=1$ perfect positive linear relationship.

**Causation:** $X$ causes $Y$ requires:
1. Temporal precedence ($X$ before $Y$)
2. Covariation (correlation)
3. Elimination of confounds (controlled experiments, instrumental variables, or causal inference methods like do-calculus)

**Simpson's Paradox:** Trend appears in subgroups but disappears/reverses in aggregate.

```python
# Correlation example
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5  # y depends on x with noise

correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation: {correlation:.4f}")

# Spurious correlation example (both depend on time)
time = np.arange(100)
ice_cream_sales = 10 + 0.5 * time + np.random.randn(100)
drowning_deaths = 5 + 0.3 * time + np.random.randn(100)

spurious_corr = np.corrcoef(ice_cream_sales, drowning_deaths)[0, 1]
print(f"Spurious correlation (ice cream vs drowning): {spurious_corr:.4f}")
# High correlation! But ice cream doesn't cause drowning (confound: summer/temperature)
```

---

## **1.6 Workbook Labs**

Complete these labs to solidify understanding. Solutions should be committed to your GitHub portfolio.

### **Lab 1: Linear Algebra from Scratch**
Implement matrix multiplication, transpose, and inverse without using `np.linalg` (use only basic loops and NumPy indexing). Verify against NumPy implementations.

**Deliverable:** `linear_algebra_scratch.py` with functions `matmul_custom()`, `transpose_custom()`, and comparison tests.

### **Lab 2: Gradient Descent Visualizer**
Implement gradient descent on $f(x, y) = x^2 + 2y^2$. Track the path taken from initial point $(2, 3)$. Plot contours of the function and overlay the optimization path using Matplotlib.

**Deliverable:** Jupyter notebook with visualization and convergence analysis for different learning rates (0.01, 0.1, 0.5, 1.0).

### **Lab 3: Principal Component Analysis (PCA)**
Using only eigenvalue decomposition (no `sklearn`), implement PCA on the Iris dataset.
1. Center the data
2. Compute covariance matrix
3. Find eigenvectors/values
4. Project data to 2D
5. Plot with colors for species

**Deliverable:** `pca_from_scratch.py` + comparison plot with `sklearn.decomposition.PCA`.

### **Lab 4: Maximum Likelihood Estimation**
Given coin flip data (0=tail, 1=head), implement gradient ascent to find MLE of probability $p$ of heads. Compare with analytical solution ($\hat{p} = \frac{\text{#heads}}{n}$).

**Deliverable:** Notebook showing convergence of optimization and final probability estimate.

### **Lab 5: Statistical Significance in Model Selection**
You have accuracy scores from 5-fold cross-validation for three algorithms. Perform ANOVA test to check if any difference exists, then pairwise t-tests with Bonferroni correction to identify which differ significantly.

**Deliverable:** `model_comparison.py` with statistical report.

---

## **1.7 Common Pitfalls**

1. **Matrix Dimension Mismatch:** Always check shapes. $(m \times n) \cdot (n \times p) = (m \times p)$. Common error: forgetting to transpose.
   
2. **Confusing Row vs Column Vectors:** NumPy 1D arrays are neither (shape `(n,)`). Explicitly reshape to `(n,1)` or `(1,n)` for matrix operations.

3. **Numerical Instability:** Computing softmax as $\frac{e^x}{\sum e^x}$ overflows for large $x$. Solution: subtract $\max(x)$ before exponentiation.

4. **Sample vs Population Std:** Use `ddof=1` (Delta Degrees of Freedom) for sample standard deviation, `ddof=0` for population/MLE.

5. **p-hacking:** Running tests until you find significance. Always pre-register hypotheses or use correction methods.

6. **Assuming Causation:** "Users who click recommendations have higher LTV" doesn't mean "Make them click more" increases LTV (selection bias).

---

## **1.8 Interview Questions**

**Q1:** Why is matrix multiplication not commutative? Give an ML example where order matters.
*A: Generally $\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}$. In neural networks, applying rotation then scaling is different from scaling then rotation. In backpropagation, weight matrices must multiply gradients in correct order (chain rule direction).*

**Q2:** What is the geometric interpretation of eigenvalues/eigenvectors? Why are they important in PCA?
*A: Eigenvectors are directions invariant under linear transformation; eigenvalues are scaling factors along those directions. In PCA, eigenvectors of covariance matrix are principal directions of variance; eigenvalues indicate variance magnitude.*

**Q3:** Explain the chain rule and why it's crucial for neural networks.
*A: Chain rule allows computing derivatives of composite functions. Neural networks are compositions of layers $f_n(f_{n-1}(...f_1(x)))$. Backpropagation applies chain rule iteratively from output to input to compute gradients efficiently.*

**Q4:** What's the difference between L1 and L2 regularization mathematically? Why does L1 induce sparsity?
*A: L1 adds $\lambda \sum |w_i|$ to loss; L2 adds $\lambda \sum w_i^2$. L1 has diamond-shaped level sets with corners at axes, pushing weights to exactly zero. L2 has circular level sets, shrinking weights uniformly but rarely to zero.*

**Q5:** A medical test is 99% accurate. You test positive. What's the probability you're sick?
*A: Depends on base rate (Bayes' theorem). If disease is rare (0.1% prevalence), even with 99% accuracy, false positives outnumber true positives, so posterior probability might be <10%. This illustrates importance of prior probability.*

---

## **1.9 Further Reading**

**Books:**
- *Mathematics for Machine Learning* (Deisenroth, Faisal, Ong) - Free PDF available
- *The Elements of Statistical Learning* (Hastie, Tibshirani, Friedman) - Chapter 2 (Math prerequisites)
- *Deep Learning* (Goodfellow, Bengio, Courville) - Part I (Applied Math)

**Courses:**
- Khan Academy: Linear Algebra, Calculus, Statistics
- MIT 18.06 (Linear Algebra) - Gilbert Strang
- Stanford CS229 (Andrew Ng) - Mathematical foundations of ML

**Interactive:**
- 3Blue1Brown "Essence of Linear Algebra" and "Essence of Calculus" (YouTube)
- Distill.pub: "Why Momentum Really Works" (optimization visualizations)

---

## **1.10 Checkpoint Project: Mathematical Toolkit Library**

Build a Python package `ai_math_toolkit` that implements:

**Module 1: `linalg_tools.py`**
- Custom matrix operations (no NumPy linalg)
- Singular Value Decomposition (SVD) using power iteration
- Function to check if matrix is positive definite

**Module 2: `autodiff.py`**
- Simple automatic differentiation engine using computational graphs
- Support for +, *, sin, exp, log operations
- Verify gradients against numerical differentiation

**Module 3: `probability_distributions.py`**
- Classes for Gaussian, Bernoulli, Categorical distributions
- Methods for PDF/PMF, sampling, MLE fitting
- KL-divergence calculation between distributions

**Module 4: `stats_tests.py`**
- Implement t-test from scratch
- Bootstrap confidence intervals
- Permutation test for independence

**Deliverables:**
- GitHub repository with tests (pytest)
- Documentation showing mathematical formulas implemented
- Jupyter notebook demonstrating usage on real dataset (e.g., comparing two marketing campaigns)

**Success Criteria:**
- All functions pass unit tests against scipy/numpy equivalents (within numerical tolerance)
- Code is vectorized (no Python loops for matrix ops)
- Documentation includes mathematical derivation in LaTeX

---

**End of Chapter 1**

*Proceed to Chapter 2 only after completing all labs and the Checkpoint Project. The mathematical intuition built here is the foundation upon which all neural network architectures rest.*

---
