# Module 2: Mathematical Foundations for Machine Learning

---

Machine Learning algorithms are built on mathematical principles. This module covers the core areas of mathematics you need to understand: **linear algebra**, **statistics**, **probability**, and the intuition behind **calculus** (gradients). We will use Python and NumPy to make every concept concrete.

You do not need to be a mathematician — the goal is to build intuition and working fluency.

---

## Table of Contents

1. [Linear Algebra Essentials](#1.-Linear-Algebra-Essentials)
2. [Statistics Refresher](#2.-Statistics-Refresher)
3. [Probability Basics](#3.-Probability-Basics)
4. [Calculus Intuition: Gradients and Optimization](#4.-Calculus-Intuition)
5. [Exercises](#5.-Exercises)
6. [Summary and Further Reading](#6.-Summary-and-Further-Reading)

In [None]:
# Libraries used throughout this module
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

---

## 1. Linear Algebra Essentials

Linear algebra is the backbone of most ML algorithms. Data is represented as matrices, and transformations on that data are expressed as matrix operations.

### 1.1 Scalars, Vectors, and Matrices

| Object | Description | Example |
|--------|------------|--------|
| **Scalar** | A single number | Temperature: 25.3 |
| **Vector** | An ordered list of numbers (1D) | Feature vector: [5.1, 3.5, 1.4, 0.2] |
| **Matrix** | A 2D array of numbers (rows x columns) | A dataset of 100 samples with 4 features |

In [None]:
# Scalars, Vectors, Matrices in NumPy

# Scalar
scalar = 5.0
print(f"Scalar: {scalar}")
print(f"  Type: {type(scalar)}")

# Vector (1D array)
vector = np.array([5.1, 3.5, 1.4, 0.2])
print(f"\nVector: {vector}")
print(f"  Shape: {vector.shape}")
print(f"  Dimension: {vector.ndim}")

# Matrix (2D array)
matrix = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
print(f"\nMatrix:\n{matrix}")
print(f"  Shape: {matrix.shape}  (rows x columns)")
print(f"  Dimension: {matrix.ndim}")

### 1.2 Vector Operations

These operations appear constantly in ML — from computing distances between data points to calculating model predictions.

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print("Vector a:", a)
print("Vector b:", b)
print()

# Element-wise operations
print("Addition (a + b):       ", a + b)
print("Subtraction (a - b):    ", a - b)
print("Element-wise multiply:  ", a * b)

# Dot product: sum of element-wise products
dot_product = np.dot(a, b)  # 1*4 + 2*5 + 3*6 = 32
print(f"\nDot product (a . b):    {dot_product}")
print(f"  Manual check: 1*4 + 2*5 + 3*6 = {1*4 + 2*5 + 3*6}")

# Vector magnitude (L2 norm)
magnitude_a = np.linalg.norm(a)
print(f"\nMagnitude of a (||a||): {magnitude_a:.4f}")
print(f"  Manual check: sqrt(1^2 + 2^2 + 3^2) = {np.sqrt(1+4+9):.4f}")

# Euclidean distance between two vectors
distance = np.linalg.norm(a - b)
print(f"\nEuclidean distance between a and b: {distance:.4f}")

In [None]:
# Visualizing vectors in 2D
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Vector addition
v1 = np.array([2, 1])
v2 = np.array([1, 3])
v_sum = v1 + v2

ax = axes[0]
ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1,
          color='#2196F3', linewidth=2, label=f'a = {list(v1)}')
ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1,
          color='#FF5722', linewidth=2, label=f'b = {list(v2)}')
ax.quiver(0, 0, v_sum[0], v_sum[1], angles='xy', scale_units='xy', scale=1,
          color='#4CAF50', linewidth=2, label=f'a + b = {list(v_sum)}')
# Parallelogram
ax.plot([v1[0], v_sum[0]], [v1[1], v_sum[1]], 'k--', alpha=0.3)
ax.plot([v2[0], v_sum[0]], [v2[1], v_sum[1]], 'k--', alpha=0.3)
ax.set_xlim(-0.5, 5)
ax.set_ylim(-0.5, 5)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.legend(fontsize=11)
ax.set_title('Vector Addition', fontsize=14, fontweight='bold')
ax.set_xlabel('x')
ax.set_ylabel('y')

# Dot product and angle
ax = axes[1]
v3 = np.array([3, 1])
v4 = np.array([1, 3])
cos_angle = np.dot(v3, v4) / (np.linalg.norm(v3) * np.linalg.norm(v4))
angle = np.arccos(cos_angle)

ax.quiver(0, 0, v3[0], v3[1], angles='xy', scale_units='xy', scale=1,
          color='#2196F3', linewidth=2, label=f'a = {list(v3)}')
ax.quiver(0, 0, v4[0], v4[1], angles='xy', scale_units='xy', scale=1,
          color='#FF5722', linewidth=2, label=f'b = {list(v4)}')

theta = np.linspace(np.arctan2(v3[1], v3[0]), np.arctan2(v4[1], v4[0]), 30)
ax.plot(0.8 * np.cos(theta), 0.8 * np.sin(theta), 'k-', linewidth=1.5)
mid_angle = (np.arctan2(v3[1], v3[0]) + np.arctan2(v4[1], v4[0])) / 2
ax.text(1.1 * np.cos(mid_angle), 1.1 * np.sin(mid_angle),
        f'{np.degrees(angle):.1f} deg', fontsize=11, ha='center')

ax.set_xlim(-0.5, 4)
ax.set_ylim(-0.5, 4)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.legend(fontsize=11)
ax.set_title(f'Dot Product and Angle\na . b = {np.dot(v3, v4)}', fontsize=14, fontweight='bold')
ax.set_xlabel('x')
ax.set_ylabel('y')

plt.tight_layout()
plt.show()

### 1.3 Matrix Operations

In ML, a dataset is typically stored as a matrix where rows represent samples and columns represent features.

In [None]:
A = np.array([[1, 2], [3, 4], [5, 6]])
B = np.array([[7, 8], [9, 10]])

print("Matrix A (3x2):")
print(A)
print(f"\nMatrix B (2x2):")
print(B)

# Transpose: swap rows and columns
print(f"\nTranspose of A (2x3):")
print(A.T)

# Matrix multiplication: A (3x2) @ B (2x2) = result (3x2)
C = A @ B  # or np.dot(A, B)
print(f"\nA @ B (matrix multiplication):")
print(C)
print(f"  Shape: {C.shape}")

# Identity matrix
I = np.eye(3)
print(f"\nIdentity matrix (3x3):")
print(I)
print("Any matrix times the identity matrix returns itself (like multiplying by 1).")

In [None]:
# Matrix inverse and solving linear systems
# This is directly relevant to Linear Regression (Module 4)

M = np.array([[2, 1], [5, 3]])
print("Matrix M:")
print(M)

# Inverse
M_inv = np.linalg.inv(M)
print(f"\nInverse of M:")
print(M_inv)

# Verify: M @ M_inv should equal the identity matrix
print(f"\nM @ M_inv (should be identity):")
print(np.round(M @ M_inv, decimals=10))

# Determinant
det = np.linalg.det(M)
print(f"\nDeterminant of M: {det:.4f}")
print("A matrix is invertible only if its determinant is non-zero.")

### 1.4 Why Linear Algebra Matters in ML

- **Data representation**: A dataset with n samples and p features is an n x p matrix.
- **Linear Regression**: The closed-form solution involves matrix operations: w = (X^T X)^{-1} X^T y
- **PCA (dimensionality reduction)**: Uses eigenvalue decomposition of the covariance matrix.
- **Neural Networks**: Every layer computes a matrix multiplication followed by a nonlinear activation.

---

## 2. Statistics Refresher

Statistics helps us understand the properties of our data and make informed decisions about models.

### 2.1 Measures of Central Tendency and Spread

In [None]:
# Generate a sample dataset
np.random.seed(42)
data = np.random.normal(loc=50, scale=15, size=1000)  # mean=50, std=15

print("=" * 50)
print("DESCRIPTIVE STATISTICS")
print("=" * 50)
print(f"Mean:               {np.mean(data):.2f}")
print(f"Median:             {np.median(data):.2f}")
print(f"Mode (approx):      {float(stats.mode(np.round(data), keepdims=True).mode):.2f}")
print(f"Variance:           {np.var(data):.2f}")
print(f"Standard Deviation: {np.std(data):.2f}")
print(f"Minimum:            {np.min(data):.2f}")
print(f"Maximum:            {np.max(data):.2f}")
print(f"Range:              {np.ptp(data):.2f}")
print(f"25th percentile:    {np.percentile(data, 25):.2f}")
print(f"75th percentile:    {np.percentile(data, 75):.2f}")
print(f"IQR:                {np.percentile(data, 75) - np.percentile(data, 25):.2f}")

In [None]:
# Visualize the distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Histogram
ax = axes[0]
ax.hist(data, bins=40, color='#2196F3', edgecolor='white', alpha=0.7, density=True)
ax.axvline(np.mean(data), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(data):.1f}')
ax.axvline(np.median(data), color='green', linestyle='--', linewidth=2, label=f'Median: {np.median(data):.1f}')
ax.set_title('Histogram with Mean and Median', fontsize=13, fontweight='bold')
ax.set_xlabel('Value')
ax.set_ylabel('Density')
ax.legend(fontsize=10)

# Box plot
ax = axes[1]
bp = ax.boxplot(data, vert=True, patch_artist=True,
                boxprops=dict(facecolor='#2196F3', alpha=0.6),
                medianprops=dict(color='red', linewidth=2))
ax.set_title('Box Plot', fontsize=13, fontweight='bold')
ax.set_ylabel('Value')

# QQ plot (check if data follows normal distribution)
ax = axes[2]
stats.probplot(data, dist='norm', plot=ax)
ax.set_title('Q-Q Plot (Normality Check)', fontsize=13, fontweight='bold')
ax.get_lines()[0].set_markerfacecolor('#2196F3')
ax.get_lines()[0].set_alpha(0.5)

plt.tight_layout()
plt.show()

### 2.2 Common Distributions

Understanding distributions is essential for choosing the right models and interpreting results.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Normal (Gaussian) Distribution
ax = axes[0, 0]
x = np.linspace(-4, 4, 200)
for mu, sigma in [(0, 1), (0, 0.5), (1, 1.5)]:
    y = stats.norm.pdf(x, mu, sigma)
    ax.plot(x, y, linewidth=2, label=f'mu={mu}, sigma={sigma}')
ax.set_title('Normal (Gaussian) Distribution', fontsize=13, fontweight='bold')
ax.legend(fontsize=9)
ax.set_xlabel('x')
ax.set_ylabel('Probability Density')

# Uniform Distribution
ax = axes[0, 1]
x_uni = np.linspace(-0.5, 1.5, 200)
ax.plot(x_uni, stats.uniform.pdf(x_uni), linewidth=2, color='#FF5722')
ax.fill_between(x_uni, stats.uniform.pdf(x_uni), alpha=0.3, color='#FF5722')
ax.set_title('Uniform Distribution (0, 1)', fontsize=13, fontweight='bold')
ax.set_xlabel('x')
ax.set_ylabel('Probability Density')

# Binomial Distribution
ax = axes[1, 0]
n_trials = 20
for p in [0.3, 0.5, 0.7]:
    x_binom = np.arange(0, n_trials + 1)
    y_binom = stats.binom.pmf(x_binom, n_trials, p)
    ax.bar(x_binom + (p - 0.5) * 0.25, y_binom, width=0.25, alpha=0.7, label=f'p={p}')
ax.set_title('Binomial Distribution (n=20)', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.set_xlabel('Number of Successes')
ax.set_ylabel('Probability')

# Poisson Distribution
ax = axes[1, 1]
for lam in [2, 5, 10]:
    x_pois = np.arange(0, 25)
    y_pois = stats.poisson.pmf(x_pois, lam)
    ax.plot(x_pois, y_pois, 'o-', linewidth=1.5, markersize=5, label=f'lambda={lam}')
ax.set_title('Poisson Distribution', fontsize=13, fontweight='bold')
ax.legend(fontsize=10)
ax.set_xlabel('k')
ax.set_ylabel('Probability')

plt.tight_layout()
plt.show()

### 2.3 Correlation

Correlation measures the linear relationship between two variables. It ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation).

In [None]:
# Generate correlated data
np.random.seed(42)
n = 200

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Positive correlation
x1 = np.random.randn(n)
y1 = 0.8 * x1 + 0.3 * np.random.randn(n)
r1 = np.corrcoef(x1, y1)[0, 1]
axes[0].scatter(x1, y1, alpha=0.5, color='#2196F3', s=30, edgecolors='white', linewidth=0.3)
axes[0].set_title(f'Positive Correlation\nr = {r1:.2f}', fontsize=13, fontweight='bold')
z = np.polyfit(x1, y1, 1)
axes[0].plot(np.sort(x1), np.polyval(z, np.sort(x1)), 'r-', linewidth=2)

# No correlation
x2 = np.random.randn(n)
y2 = np.random.randn(n)
r2 = np.corrcoef(x2, y2)[0, 1]
axes[1].scatter(x2, y2, alpha=0.5, color='#FF9800', s=30, edgecolors='white', linewidth=0.3)
axes[1].set_title(f'No Correlation\nr = {r2:.2f}', fontsize=13, fontweight='bold')

# Negative correlation
x3 = np.random.randn(n)
y3 = -0.9 * x3 + 0.2 * np.random.randn(n)
r3 = np.corrcoef(x3, y3)[0, 1]
axes[2].scatter(x3, y3, alpha=0.5, color='#4CAF50', s=30, edgecolors='white', linewidth=0.3)
axes[2].set_title(f'Negative Correlation\nr = {r3:.2f}', fontsize=13, fontweight='bold')
z3 = np.polyfit(x3, y3, 1)
axes[2].plot(np.sort(x3), np.polyval(z3, np.sort(x3)), 'r-', linewidth=2)

for ax in axes:
    ax.set_xlabel('x')
    ax.set_ylabel('y')

plt.suptitle('Types of Correlation', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

---

## 3. Probability Basics

Probability provides the mathematical framework for reasoning about uncertainty — which is central to making predictions with imperfect data.

### 3.1 Core Concepts

| Concept | Formula | Meaning |
|---------|---------|--------|
| Probability of event A | P(A) = favorable outcomes / total outcomes | How likely A is to occur |
| Joint Probability | P(A and B) | Probability that both A and B occur |
| Conditional Probability | P(A \| B) = P(A and B) / P(B) | Probability of A given that B has occurred |
| Bayes' Theorem | P(A \| B) = P(B \| A) * P(A) / P(B) | Update beliefs with new evidence |

In [None]:
# Example: Medical test scenario using Bayes' Theorem
#
# A disease affects 1% of the population.
# A test has:
#   - 99% sensitivity (true positive rate): P(Test+ | Disease) = 0.99
#   - 95% specificity (true negative rate): P(Test- | No Disease) = 0.95
#
# Question: If someone tests positive, what is the probability they actually have the disease?

P_disease = 0.01
P_no_disease = 1 - P_disease
P_pos_given_disease = 0.99     # sensitivity
P_pos_given_no_disease = 0.05  # false positive rate (1 - specificity)

# Total probability of testing positive
P_pos = P_pos_given_disease * P_disease + P_pos_given_no_disease * P_no_disease

# Bayes' Theorem
P_disease_given_pos = (P_pos_given_disease * P_disease) / P_pos

print("BAYES' THEOREM — MEDICAL TEST EXAMPLE")
print("=" * 50)
print(f"Prior probability of disease:   P(D)          = {P_disease:.2%}")
print(f"Sensitivity (true positive):    P(+|D)        = {P_pos_given_disease:.2%}")
print(f"False positive rate:            P(+|no D)     = {P_pos_given_no_disease:.2%}")
print(f"Probability of positive test:   P(+)          = {P_pos:.4f}")
print(f"\nPosterior probability:          P(D|+)        = {P_disease_given_pos:.2%}")
print(f"\nInterpretation: Even with a positive test result, there is only a {P_disease_given_pos:.1%}")
print("chance of actually having the disease. This counterintuitive result arises")
print("because the disease is rare — most positive tests are false positives.")

In [None]:
# Visualize Bayes' Theorem with varying prior probabilities
priors = np.linspace(0.001, 0.5, 200)
posteriors = (P_pos_given_disease * priors) / \
             (P_pos_given_disease * priors + P_pos_given_no_disease * (1 - priors))

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(priors * 100, posteriors * 100, linewidth=2.5, color='#2196F3')
ax.axhline(y=50, color='gray', linestyle='--', alpha=0.5, label='50% threshold')
ax.axvline(x=1, color='red', linestyle='--', alpha=0.5, label='Our scenario (1% prior)')
ax.scatter([1], [P_disease_given_pos * 100], color='red', s=100, zorder=5)
ax.annotate(f'{P_disease_given_pos:.1%}', xy=(1, P_disease_given_pos * 100),
            xytext=(5, P_disease_given_pos * 100 + 10), fontsize=12,
            arrowprops=dict(arrowstyle='->', color='red'))
ax.set_xlabel('Prior Probability of Disease (%)', fontsize=13)
ax.set_ylabel('Posterior Probability Given Positive Test (%)', fontsize=13)
ax.set_title("Bayes' Theorem: How Prior Probability Affects the Posterior",
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.set_xlim(0, 50)
ax.set_ylim(0, 100)
plt.tight_layout()
plt.show()

---

## 4. Calculus Intuition: Gradients and Optimization

Calculus is used in ML primarily for **optimization** — finding the parameters that minimize a cost function. You do not need to compute derivatives by hand, but you need to understand the intuition.

### Key Idea: Gradient Descent

Most ML models learn by minimizing a **loss function** (also called cost function or error). **Gradient descent** is the algorithm that does this:

1. Start with random parameters.
2. Compute the gradient (slope) of the loss function at the current point.
3. Move the parameters in the direction that reduces the loss (opposite to the gradient).
4. Repeat until convergence.

In [None]:
# Visualize gradient descent on a simple 1D function
# Loss function: f(x) = x^2 (minimum at x=0)

def f(x):
    return x ** 2

def df(x):
    """Derivative: f'(x) = 2x"""
    return 2 * x

# Gradient descent
learning_rate = 0.2
x_current = 4.0  # starting point
history = [x_current]

for i in range(15):
    gradient = df(x_current)
    x_current = x_current - learning_rate * gradient
    history.append(x_current)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: Function and gradient descent path
x_range = np.linspace(-5, 5, 200)
ax = axes[0]
ax.plot(x_range, f(x_range), 'b-', linewidth=2, label='f(x) = x²')
ax.plot(history, [f(x) for x in history], 'ro-', markersize=8, linewidth=1.5, label='Gradient descent path')
ax.annotate('Start', xy=(history[0], f(history[0])), xytext=(history[0] + 0.5, f(history[0]) + 2),
            fontsize=11, arrowprops=dict(arrowstyle='->'))
ax.annotate('Minimum', xy=(history[-1], f(history[-1])), xytext=(history[-1] + 1, f(history[-1]) + 3),
            fontsize=11, arrowprops=dict(arrowstyle='->'))
ax.set_xlabel('x (parameter)', fontsize=13)
ax.set_ylabel('f(x) (loss)', fontsize=13)
ax.set_title('Gradient Descent on f(x) = x²', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)

# Right: Convergence plot
ax = axes[1]
ax.plot(range(len(history)), [f(x) for x in history], 'o-', color='#FF5722', linewidth=2, markersize=6)
ax.set_xlabel('Iteration', fontsize=13)
ax.set_ylabel('Loss f(x)', fontsize=13)
ax.set_title('Convergence of Gradient Descent', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nGradient Descent Trace:")
print(f"{'Step':>4}  {'x':>10}  {'f(x)':>10}  {'gradient':>10}")
print("-" * 40)
for i in range(min(8, len(history))):
    print(f"{i:>4}  {history[i]:>10.4f}  {f(history[i]):>10.4f}  {df(history[i]):>10.4f}")
print(f"  ...")
print(f"{len(history)-1:>4}  {history[-1]:>10.6f}  {f(history[-1]):>10.6f}  {df(history[-1]):>10.6f}")

In [None]:
# Effect of learning rate on gradient descent
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
learning_rates = [0.05, 0.3, 0.95]
titles = ['Too Small (lr=0.05)\nSlow convergence',
          'Good (lr=0.3)\nSmooth convergence',
          'Too Large (lr=0.95)\nOscillation / Divergence']

for idx, (lr, title) in enumerate(zip(learning_rates, titles)):
    ax = axes[idx]
    x_range = np.linspace(-5, 5, 200)
    ax.plot(x_range, f(x_range), 'b-', linewidth=2, alpha=0.5)

    x_curr = 4.0
    hist = [x_curr]
    for _ in range(20):
        x_curr = x_curr - lr * df(x_curr)
        hist.append(x_curr)
        if abs(x_curr) > 10:
            break

    hist_clipped = [x for x in hist if abs(x) <= 5.5]
    ax.plot(hist_clipped, [f(x) for x in hist_clipped], 'ro-', markersize=6, linewidth=1.5)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.set_xlabel('x')
    ax.set_ylabel('f(x)')
    ax.set_xlim(-5, 5)
    ax.set_ylim(-1, 25)

plt.suptitle('Effect of Learning Rate on Gradient Descent', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nKey takeaways:")
print("  - Learning rate too small: converges, but very slowly.")
print("  - Learning rate just right: smooth and efficient convergence.")
print("  - Learning rate too large: oscillates or diverges — never reaches the minimum.")

### 4.1 Partial Derivatives and Gradients in Higher Dimensions

In practice, loss functions have many parameters (not just one). The **gradient** is a vector of partial derivatives — one for each parameter. Gradient descent follows the direction of steepest descent in the parameter space.

In [None]:
# 2D gradient descent visualization
# Loss function: f(x, y) = x^2 + 2*y^2 (minimum at origin)

def f_2d(x, y):
    return x**2 + 2*y**2

def grad_2d(x, y):
    return np.array([2*x, 4*y])

# Gradient descent in 2D
lr = 0.15
pos = np.array([4.0, 3.0])
path = [pos.copy()]

for _ in range(30):
    g = grad_2d(pos[0], pos[1])
    pos = pos - lr * g
    path.append(pos.copy())

path = np.array(path)

# Contour plot
fig, ax = plt.subplots(figsize=(10, 8))
X_grid = np.linspace(-5, 5, 200)
Y_grid = np.linspace(-4, 4, 200)
X_mesh, Y_mesh = np.meshgrid(X_grid, Y_grid)
Z = f_2d(X_mesh, Y_mesh)

contours = ax.contour(X_mesh, Y_mesh, Z, levels=20, cmap='viridis', alpha=0.7)
ax.clabel(contours, inline=True, fontsize=8)
ax.contourf(X_mesh, Y_mesh, Z, levels=20, cmap='viridis', alpha=0.3)

ax.plot(path[:, 0], path[:, 1], 'ro-', markersize=5, linewidth=1.5, label='Gradient descent path')
ax.plot(path[0, 0], path[0, 1], 'rs', markersize=12, label='Start')
ax.plot(path[-1, 0], path[-1, 1], 'r*', markersize=15, label='End (near minimum)')

ax.set_xlabel('x', fontsize=13)
ax.set_ylabel('y', fontsize=13)
ax.set_title('2D Gradient Descent on f(x,y) = x² + 2y²', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
plt.tight_layout()
plt.show()

---

## 5. Exercises

### Exercise 1: Linear Algebra Practice

In [None]:
# Exercise 1: Complete the following tasks

# 1a. Create two vectors: v1 = [3, 4, 5] and v2 = [1, 0, -1]
#     Compute their dot product, norms, and the Euclidean distance between them.

# Your code here:


# 1b. Create a 3x3 matrix M = [[1, 2, 3], [0, 1, 4], [5, 6, 0]]
#     Compute its transpose, determinant, and inverse.

# Your code here:


# 1c. Verify that M @ M_inv equals the identity matrix.

# Your code here:


### Exercise 2: Statistics on Real Data

In [None]:
# Exercise 2: Use the Iris dataset to practice statistics

from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# TODO: Compute mean, median, std, and variance for EACH feature
# Hint: use iris_df.describe() or compute each individually with np functions

# TODO: Create a correlation matrix and visualize it as a heatmap

# TODO: Which two features are most correlated? Which are least correlated?

# TODO: Plot histograms for all 4 features in a 2x2 subplot


### Exercise 3: Bayes' Theorem Application

In [None]:
# Exercise 3: Spam Filter using Bayes' Theorem
#
# Given:
#   - 30% of all emails are spam:                    P(spam) = 0.30
#   - The word "free" appears in 80% of spam emails: P("free" | spam) = 0.80
#   - The word "free" appears in 10% of non-spam:    P("free" | not spam) = 0.10
#
# Question: If an email contains the word "free", what is the probability it is spam?
#
# TODO: Compute P(spam | "free") using Bayes' Theorem

# Your code here:


<details>
<summary><b>Click here for the solution</b></summary>

```python
P_spam = 0.30
P_not_spam = 0.70
P_free_given_spam = 0.80
P_free_given_not_spam = 0.10

P_free = P_free_given_spam * P_spam + P_free_given_not_spam * P_not_spam
P_spam_given_free = (P_free_given_spam * P_spam) / P_free

print(f"P(spam | 'free') = {P_spam_given_free:.2%}")
# Answer: approximately 77.42%
```

</details>

### Exercise 4: Implement Gradient Descent

In [None]:
# Exercise 4: Implement gradient descent for f(x) = (x - 3)^2 + 5
# The minimum should be at x = 3, f(3) = 5

# TODO: Define the function and its derivative
# TODO: Implement gradient descent starting from x = -2
# TODO: Use learning_rate = 0.1 and run for 50 iterations
# TODO: Plot the function and the gradient descent path
# TODO: Print the final x value and f(x) — they should be close to 3 and 5

# Your code here:


---

## 6. Summary and Further Reading

### What We Covered

- **Linear Algebra**: Vectors, matrices, dot products, matrix multiplication, transpose, inverse — the building blocks of data representation in ML.
- **Statistics**: Measures of central tendency and spread, distributions (Normal, Uniform, Binomial, Poisson), correlation.
- **Probability**: Conditional probability, Bayes' Theorem, and its real-world applications.
- **Calculus**: The concept of gradients and how gradient descent is used to minimize loss functions.

### Recommended Reading

- Gilbert Strang, *Introduction to Linear Algebra* (or his free MIT OpenCourseWare lectures)
- 3Blue1Brown, *Essence of Linear Algebra* video series (YouTube)
- Khan Academy — Statistics and Probability (free online)
- 3Blue1Brown, *Essence of Calculus* video series (YouTube)

### Next Module

In **Module 3: Data Preprocessing and Feature Engineering**, we will learn how to clean messy data, handle missing values, encode categorical variables, and scale features — the critical step between raw data and model training.

---