# 1. Calculus Essentials: Derivatives & Gradients

In Machine Learning, we optimize a model by minimizing a **Loss Function** $J(w)$.
To do this, we need to know which direction to move the weights.

- **Derivative ($f'(x)$):** Measures the slope of a function for a single variable.
- **Gradient ($\nabla f(x)$):** A vector of partial derivatives for multiple variables. It points in the direction of steepest **ascent** (uphill).

$$\nabla f(x) = \left( \frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_d} \right)^T$$

**The Golden Rule:** To minimize loss, we move in the **opposite** direction of the gradient:
$$w_{new} = w_{old} - \eta \nabla J(w)$$
where $\eta$ is the **Learning Rate**.


In [None]:
import numpy as np
import matplotlib.pyplot as plt


# Example: Simple Derivative of f(x) = x^2
def f(x):
    return x**2


def df(x):
    return 2 * x


# Check the slope at x = 3
x_val = 3
print(f"At x={x_val}, the function value is {f(x_val)}")
print(f"The slope (gradient) is {df(x_val)}")
print(f"To minimize, we should move towards: {x_val - 0.1 * df(x_val)}")

# 2. Optimization Algorithms

We use **Gradient Descent** to train models. [cite_start]There are three main variants[cite: 208]:

1.  **Batch Gradient Descent:** Uses **all** $n$ samples to calculate the gradient. Precise but slow.
    $$g_k = \frac{1}{n} \sum_{i=1}^{n} \nabla l(f_w(x_i), y_i)$$
2.  **Stochastic Gradient Descent (SGD):** Uses **one** random sample. Fast but noisy.
    $$g_k = \nabla l(f_w(x_{i_k}), y_{i_k})$$
3.  **Mini-batch GD:** Uses a small batch (e.g., 32 samples). The best of both worlds.


In [None]:
# Setup Data (House Prices)
X = np.array(
    [
        [1, 2],
        [2, 4],
    ]
)  # Features: [Size, Rooms]
y = np.array([5, 10])  # Target: Price

# Hyperparameters
w = np.array([0.0, 1.0])  # Initialization
learning_rate = 0.1
n_samples = len(y)

# BATCH Gradient Descent Step
predictions = X @ w
error = predictions - y
gradient = (X.T @ error) / n_samples
w_batch = w - learning_rate * gradient

print(f"Weights after 1 Batch step: {w_batch}")

In [None]:
# STOCHASTIC Gradient Descent Step
# Pick 1 random sample
idx = np.random.randint(0, n_samples)
x_sample = X[idx]
y_sample = y[idx]

# Calculate Gradient for just this sample
pred_sample = x_sample @ w
err_sample = pred_sample - y_sample
grad_sample = x_sample * err_sample  # No division by n!

w_sgd = w - learning_rate * grad_sample

print(f"Weights after 1 SGD step (Sample {idx}): {w_sgd}")

# 3. Convexity and The Hessian

A function is **Convex** if it is shaped like a bowl.

- **Why it matters:** Convex functions have only one minimum (Global Minimum). We cannot get stuck in local minima.
- **The Test:** A twice-differentiable function is convex if its **Hessian Matrix** (second derivatives) is **Positive Semi-Definite (PSD)**.

$$H_f(x) \succeq 0$$

This roughly means the "curvature" is curving up in all directions.


In [None]:
# Example: Check if a matrix Q is Positive Semi-Definite
Q = np.array(
    [
        [3.0, 1.0],
        [1.0, 2.0],
    ]
)

# Calculate Eigenvalues
eigvals = np.linalg.eigvals(Q)

print(f"Eigenvalues: {eigvals}")

if np.all(eigvals >= 0):
    print("The matrix is Positive Semi-Definite -> The function is CONVEX (Safe!)")
else:
    print(
        "The matrix is Indefinite -> The function is NON-CONVEX (Danger of Local Minima!)"
    )

# 4. Constrained Optimization

Sometimes parameters $w$ must stay within a set $C$.
We use **Projected Gradient Descent**:

1. Take a normal gradient step.
2. **Project** (clip) the result back into the valid set $C$.

$$x^* = \text{proj}_C(x - \eta \nabla f(x))$$


In [None]:
# Example: Weights must stay between -1 and 1
w_temp = np.array([1.5, -2.0, 0.5])  # Calculated weights (some are out of bounds)

# Projection Step
w_projected = np.clip(w_temp, -1, 1)

print(f"Original: {w_temp}")
print(f"Projected: {w_projected}")