In [1]:
# =========================== DIFFERENTIATION CHEAT SHEET (NO FRAMEWORKS) ===========================
# Notation:
# - Scalars:      x ∈ R
# - Vectors:      x ∈ R^N 
# - Covectors:    w ∈ (R^N)* (dual space)
# - Matrices:     X ∈ R^{N×M}
# - Tensors:      Θ ∈ R^{N1×N2×…}  (multi-dimensional array)
# - Flatten:      vec(Θ) ∈ R^P with P = ∏_k Nk   (reshaping does not change derivatives; only layout of input; row-major convention by default (stack rows left to right) but any consistent flattening works)
#
# Note on shape
# - R^d            = a length-d 1D array (shape (d,)); no row/column orientation is encoded (in NumPy/JAX/PyTorch, vectors and covectors are both stored as 1D arrays). 
# - R^{1×d}        = row vector  (shape (1, d)).
# - R^{d×1}        = column vector (shape (d, 1)).
# - R^{m×n}        = matrix (2D array, shape (m, n)); rows = m, cols = n.
# - Tensors        = multi-dim arrays Θ ∈ R^{N₁×…×N_k} with shape (N₁,…,N_k).
# - Flattening     = vec(Θ) reshapes Θ to 1D (shape (∏ₗ Nₗ,)); reshaping doesn’t change derivative values, only layout.
# - Gradients      = match the primal’s shape (e.g., ∇_X F has the same shape as X). “Column vector” is a math convention, not a stored shape.
# - Broadcasting   = elementwise ops allow size 1 to expand along an axis (libraries follow NumPy broadcasting rules).
#
# 1) GRADIENTS (SCALAR OUTPUT)
# --------------------------------------------------------------------------------
# 1.1) f : R → R
#   Derivative: df/dx ∈ R.
#
# 1.2) F : R^n → R   (scalar-valued function on a vector)
#   Math:    ∇_x F(x) = [∂F/∂x₁, …, ∂F/∂xₙ]ᵀ ∈ R^{n×1} (column vector)
#   Code:    ∇_x F(x) = [∂F/∂x₁, …, ∂F/∂xₙ]  ∈ R^n (row/column orientation isn’t encoded)
#  
# 1.3) F : R^{N₁×…×N_k} → R   (scalar-valued function on a tensor)
#   Math/Code: (∇_X F)[i₁,…,i_k] = ∂F/∂X[i₁,…,i_k], so ∇_X F ∈ R^{N₁×…×N_k} (componentwise definition on input tensor X[i₁,…,i_k])
#   Flattening equivalence: if vec(X) ∈ R^P with P = ∏ₗ Nₗ, then
#       vec(∇_X F) = ∇_{vec(X)} F.  (Same entries; only reshaped.)
#
#  In each case, the gradients have the same shape as the primal input x or X. If x or X is passed in as a row/column vector or reshaped tensor, the gradient matches that shape.
#
# Example (applies to 1.2 and 1.3):
#   F(X) = ½‖X‖² = ½ Σ_{all indices} X[idx]²
#   ⇒ ∇_X F = X  (same shape as X).  Flattening yields ∇_{vec(X)} F = vec(X).
#
# --------------------------------------------------------------------------------
# 2) JACOBIANS (VECTOR OUTPUT)
# --------------------------------------------------------------------------------
# 2.1) G : R^n → R^m  (vector → vector)
# Math:
#   J_x G(x) ∈ R^{m×n}, with entries J_x G(x)[i, j] = ∂G_i/∂x_j.
#   Row i is the gradient of the i-th output:  J_x G(x)[i, :] = (∇_x G_i(x))^T ∈ R^{1×n}.
# Code:
#   J_x G(x)[i, j] = ∂G_i/∂x_j ∈ R^{m×n}. 
#   J_x G(x)[i, :] = ∇_x G_i(x) ∈ R^n 
#
# 2.2) G : R^{N1×…×Nk} → R^m  (tensor → vector)
# Math/Code:
# J_X G(X) ∈ R^{m × N1 × … × Nk}, with entries J_X G(X)_{p, i1,...,ik} = ∂G_p(X) / ∂X_{i1,...,ik},    for p = 1..m.
# Equivalently: the Jacobian is a stack of gradient tensors
#   ∇_X G_p(X) ∈ R^{N1×…×Nk}   (one gradient tensor per output p) and J_G is {∇_X G_p(X)}_p stacked along a new leading dimension p.
# 
# Flattened equivalence (convenient for some routines like linear algebra):
# Let P = ∏_ℓ Nℓ and vec(X) ∈ R^P be a flattening of X (e.g., row-major). Let lin(i_1,...,i_k) ∈ {0,1,..,P-1}  be the linear index of X[i_1,...,i_k] under this flattening, so Vec(X)[lin(i_1,...,i_k)] = X[i_1,...,i_k].
# Then, J_X G(X) ∈ R^{m x P} where, J_X G(X)[p, lin(i1,...,ik)] = ∂G_p(X) / ∂X_{i1,...,ik} for p = 1..m.
# Row p is the gradient of the pth output: J_X G(X)[p, :] = vec(∇_X G_p(X)) ∈ R^P
#
# Example (X ∈ R^{N1×…×Nk}, A ∈ R^{m×N1×…×Nk} and G_p = <A_p∘ X> ∈ R (inner product over all tensor indices)):
# G_p(X) = ⟨A_p, X⟩ = Σ_{i1,...,ik} A_{p;i1,...,ik} X_{i1,...,ik}, then ∂G_p/∂X_{i1,...,ik} = A_{p;i1,...,ik}  ⇒  ∇_X G_p(X) = A_p, and J_G ∈ R^{m x P} with J_X G(X) [p, :] = vec(∇_X G_p(X)) ∈ R^P
#
# 2.3) G : (R^{N1×…×Nk} × R^{M1×…×Mr}) → R^m  (multi-arg: inputs X and params θ; just applying linearity)
# Math/Code:
#   J_G(X, θ) = [ J_X G(X, θ), J_θ G(X, θ) ] (returns two block matrices), where
#   J_X G(X, θ) ∈ R^{m × N1 × … × Nk},   with entries J_X G(X, θ)[p, i1,...,ik] = ∂G_p/∂X_{i1,...,ik},
#   J_θ G(X, θ) ∈ R^{m × M1 × … × Mr},   with entries J_θ G(X, θ)[p, j1,...,jl] = ∂G_p/∂θ_{j1,...,jl}.
#
# Flattened equivalence:
#   J_G(X, θ) = [ J_X G(X, θ)   J_θ G(X, θ) ] ∈ R^{m × (Px + Pθ)} (concatenated), where
#   J_X G(X, θ) ∈ R^{m×Px},  Px = ∏ Nℓ,   rows = vec(∇_X G_i(X)) ∈ R^Px
#   J_θ G(X, θ) ∈ R^{m×Pθ},  Pθ = ∏ Mj,   rows = vec(∇_θ G_i(θ)) ∈ R^Pθ
#
# Note that all flattening is done manually by reshaping the output tensors
#
# --------------------------------------------------------------------------------
# 3) Pushforwards (JVPs) and Pullbacks (VJPs)
# --------------------------------------------------------------------------------
# Setup:
#   F : (X × Θ) → Y, with x ∈ X  (dim X = N), θ ∈ Θ  (dim θ = P), y = F(x, θ) ∈ Y (dim Y = M).
#   Let {x^i} (i=1..N), {θ^j} (j=1..P), {y^k} (k=1..M) be coordinates on X, Θ, Y respectively.
#   Write J_x F and J_θ F for the partial Jacobians w.r.t. x and θ: J_x F[k, i] = ∂F^k/∂x^i(x,θ),   J_θ F[k, j] = ∂F^k/∂θ^j(x,θ).
#
# Pushforward (differential) at (x, θ):
#   dF_(x,θ) : T_(x,θ) [X × Θ] = T_x X × T_θ Θ → T_y Y.
# Given tangents v_x ∈ T_x X and v_θ ∈ T_θ Θ, the pushed-forward tangent is (denoting the restriction of dF_(x,θ) to each tangent space T_x X and T_θ Θ as dF_x and dF_θ respectively):
#   dF_(x,θ)[v_x, v_θ] = dF_(x,θ)[v_x] + dF_(x,θ)[v_θ] = dF_x[v_x] + dF_θ[v_θ]  ∈ T_y Y ->  [dF_(x,θ)[v_x, v_θ]]^k = (∂F^k/∂x^i)(x,θ) v_x^i  + (∂F^k/∂θ^j)(x,θ) v_θ^j =  [J_x F(x,θ) · v_x]^k +  [J_θ F(x,θ) · v_θ]^k
#   (v_x and v_θ are arrays with components v_x^i and v_θ^j respectively, so they are treated as column vectors in these matrix products by the matrix multiplication conventions above)
# Thus,  dF_(x,θ)[v_x, v_θ] = J_x F(x,θ) · v_x  +  J_θ F(x,θ) · v_θ, i.e., the sum of two "Jacobian–vector products" (JVPs).
# 
# Pullback at (x, θ):
#   F^*_(x,θ) : T*_y Y → T*_(x,θ) [X × Θ] = T*_x X × T*_θ Θ.
# Given a cotangent w ∈ T*_y Y, the pulled-back covector is defined on tangents v_x ∈ T_x X and v_θ ∈ T_θ Θ by:
#   F^*_(x,θ)(w)[v_x, v_θ] = w · [ dF_(x,θ)[v_x, v_θ] ] = w[dF_x[v_x] + dF_θ[v_θ]] = w_k (∂F^k/∂x^i)(x,θ) v_x^i +  w_k (∂F^k/∂θ^j)(x,θ) v_θ^j = [J_x F(x,θ)^T · w]_i v_x^i + [J_θ F(x,θ)^T · w]_j v_θ^j
#   (w is an array with components w_k, so it is treated as a column vector in these matrix products by the matrix multiplication conventions above)
# Thus, F^*_(x,θ)(w) = (x̄, θ̄) ∈ T*_x X × T*_θ Θ, with x̄_i = [J_x F(x,θ)^T · w]_i and θ̄_j = [J_θ F(x,θ)^T · w]_j, i.e., a tuple of "vector–Jacobian products" (VJPs).
#
# ---------------------------------------------------------------------------------
# ADDING A SCALAR READOUT φ : Y → R AND GRADIENTS VIA VJP (BACKPROP)
# ------------------------------------------------------------------------------
# Setup:
#   • y = F(x, θ) ∈ Y, with F : X × Θ → Y.
#   • φ : Y → R is any smooth scalar readout on Y.
#   • Define the scalar objective on X×Θ:  Φ(x, θ) := (φ ∘ F)(x, θ).
#
# Output covector seed at y
# -------------------------
#   • Let w := (dφ)_y ∈ T*_y Y be the differential (row-like covector) of φ at y.
#     In coordinates {y^k}, this is w_k = ∂φ/∂y^k (evaluated at y).
#
# Chain rule (pullback through F)
# -------------------------------
#   • Differential of Φ at (x, θ) is the pullback of w:
#       dΦ_(x,θ) = F^*_(x,θ)(w) ∈ T*_(x,θ)(X×Θ) = T*_x X x T*_θ Θ (dΦ_(x,θ) = dφ_y ∘ dF_(x,θ) = F^*_(x,θ)(w))
#     In blocks:
#       x̄ := J_x F(x,θ)^T · w   ∈ T*_x X,
#       θ̄ := J_θ F(x,θ)^T · w   ∈ T*_θ Θ,
#     so dΦ_(x,θ)[v_x, v_θ] = dΦ_x[v_x] + dΦ_θ[v_θ] = x̄(v_x) + θ̄(v_θ).
#
# Coordinate formulae (indices)
# -----------------------------
#   • Let J_x F[k,i] = ∂F^k/∂x^i and J_θ F[k,j] = ∂F^k/∂θ^j.
#     Then with w_k = ∂φ/∂y^k:
#       (∂Φ/∂x^i) = (J_x F)^T_{i k} w_k = ∑_k (∂F^k/∂x^i) (∂φ/∂y^k),
#       (∂Φ/∂θ^j) = (J_θ F)^T_{j k} w_k = ∑_k (∂F^k/∂θ^j) (∂φ/∂y^k).
#
# Vector gradients via metrics 
# ---------------------------------------
#   • If X, Θ, Y carry Riemannian metrics g_X, g_Θ, g_Y, the vector gradients are “sharps”:
#
#     ∇_x Φ = (d_x Φ)^# -> (∇_x Φ)^i = g_X^{i k} (d_x Φ)_k  = g_X^{i k} x̄_k -> ∇_x Φ = g_X^{-1} x̄,
#     ∇_θ Φ = (d_θ Φ)^# -> (∇_θ Φ)^j = g_Θ^{j l} (d_θ Φ)_l  = g_Θ^{j l} θ̄_l  -> ∇_θ Φ = g_Θ^{-1} θ̄
#
#   • In Euclidean spaces (identity metrics): 
#       ∇_x Φ = J_x F^T w,      ∇_θ Φ = J_θ F^T w.
#
#   Thus, To get gradients of the scalar objective Φ w.r.t. inputs/parameters, pull back the output covector w = dφ_y through F (VJP) to obtain (x̄, θ̄), then raise indices via the metric(s) to convert covectors to vectors.
# --------------------------------------------------------------------------------
# 4) JVPs and VJPs for general multivariate vector functions
# --------------------------------------------------------------------------------
# Setup: y = F(X, θ) ∈ R^M, with inputs X and (optional) parameters θ (can be tensors).
# Goal
# - JVP (forward sensitivity):    given a tangent in input/param space, compute J_(X, θ) F(X, θ) · u (i.e., the pushforward dF_{X,θ}(u) = directional derivative of F at (X,θ) along u)
# - VJP (reverse sensitivity):    given a cotangent in output space, compute J_(X, θ) F(X, θ)^T · w. (i.e., the pullback F^*(w)_{X,θ} = w ∘ dF_{X,θ})
# These are the core primitives behind forward-mode and reverse-mode autodiff.
#
# ---------------------------------------
# 4.1) JVP — Jacobian–vector product
# ---------------------------------------
# Input: tangents (u_X, u_θ) matching the shapes of (X, θ).
# Output: JVP = J_X F(X,θ) · u_X  +  J_θ F(X,θ) · u_θ   ∈ R^M (the pushforward of (u_X, u_θ) through F)
#
# Cases:
# • F: R^N → R^M
#     - x, u_x ∈ R^N  → J_x F(x) ∈ R^{M×N} →  JVP = J_x F(x) · u_x ∈ R^M.
# • F: R^{N1×…×Nk} → R^M
#     - X, u_X ∈ R^{N1×…×Nk} → J_X F(X) ∈ R^{M×N1×…×Nk} → JVP = J_X F(X) · u_X ∈ R^M.
# • F: (X, θ) → R^M
#     - X, u_X ∈ R^{N1×…×Nk}; θ, u_θ ∈ R^{M1×…×Mr} →  J_X F(X, θ) ∈ R^{M×N1×…×Nk}, J_θ F(X, θ) ∈ R^{M×M1×…×Mr} →  JVP = J_X F(X,θ) · u_X  +  J_θ F(X,θ) · u_θ  ∈ R^M 
#     - If you only care about input sensitivity, set u_θ = 0 (and vice versa). 
#
# Intuition:
# - Push a small input/parameter change forward through F to get the corresponding first-order change in the output.
#
# ---------------------------------------
# 4.2) VJP — Vector–Jacobian product (backprop)
# ---------------------------------------
# Input: output-space cotangent w ∈ R^M (same shape as y = F(X, θ)).
# Output: a tuple of pullbacks (x̄, θ̄):
#     x̄   = J_X F(X,θ)^T · w       (same shape as X)
#     θ̄   = J_θ F(X,θ)^T · w       (same shape(s) as θ; one tensor per θ block)
#
# Cases:
# • F: R^N → R^M
#     - x ∈ R^N  → J_x F(x) ∈ R^{M×N} →  VJP = J_x F(x)^T · w ∈ R^N.
# • F: R^{N1×…×Nk} → R^M
#     - X ∈ R^{N1×…×Nk} → J_X F(X) ∈ R^{M×N1×…×Nk} → VJP = J_X F(X)^T · w ∈ R^{N1×…×Nk}.
# • F: (X, θ) → R^M
#     - X ∈ R^{N1×…×Nk}; θ ∈ R^{M1×…×Mr} →  J_X F(X, θ) ∈ R^{M×N1×…×Nk}, J_θ F(X, θ) ∈ R^{M×M1×…×Mr} →  VJP = [J_X F(X,θ)^T · w  ∈ R^{N1×…×Nk}, J_θ F(X,θ)^T · w ∈ R^{M1×…×Mr}] 
#
# Intuition:
# - Pull a covector w at the output back through F to obtain gradients with respect to inputs/parameters.

In [None]:
# WHAT IS A NEURAL NETWORK?
# -----------------------------------------------------------------------------
# A neural network is a parametric function f_θ that maps inputs to outputs:
#   f_θ : R^{d_in} → R^{d_out},  x ↦ ŷ = f_θ(x) (equivalently f: (x; θ) ∈ R^{d_in} × R^P  ↦ ŷ = f(x; θ))
# - x ∈ R^{d_in}: input vector of features (numeric descriptors). ŷ 
# - ŷ ∈ R^{d_out}: output vector (predictions).
# - θ ∈ R^P : all learnable parameters of the model, "learnable” means θ is adjusted during training to minimize a scalar loss L(θ) via optimization (e.g., gradient descent).
#
# Note: 
# • When we write x ∈ R^{d_in}, ŷ ∈ R^{d_out},  θ ∈ R^P, we mean the flattened (vectorized) collection vec(x), vec(ŷ), vec(θ) have length d_in, d_out, and P respectively.
#   In practice, x, ŷ, θ can be a scalar, vector, matrix, higher-order tensor, or a nested collection thereof.
#
# Three regression examples for x (features) → ŷ (targets):
#   1) Housing prices:                  x = [square_footage, num_bedrooms, year_built],         ŷ = [price] (scalar)
#   2) Physics (oscillator):            x = [t]  (time), or x = [A, β, ω, δ],                   ŷ = [x(t)] or [x(t) for many t]
#   3) Physics (experiment):            x = [temperature, pressure, field_strength, …],         ŷ = [measured_signal(t) for t = t_0, t_1, …, t_{T-1}]
#
# Shapes:
#   - Single feature:                                            x ∈ R^{d_in},                  ŷ ∈ R^{d_out}
#   - Batch of B features: {x_i} for i = 0,...,B-1:              X ∈ R^{B×d_in},                Ŷ ∈ R^{B×d_out}   (each row is one example)
#
# LAYERS (WEIGHTS, BIASES, ACTIVATIONS)
# -----------------------------------------------------------------------------
# A NN is built using a composition of L "layers" labeled ℓ = 0,...,L-1: 
#
# θ := (θ_0, θ_1, …, θ_{L-1}) # whole-model parameters (one block per layer)
# f(; θ) = f_{L-1}(; θ_{L-1})∘ f_{L-2}(; θ_{L-2}) ∘ … ∘ f_0(; θ_0) # whole network as composition of layer functions
#
# Forward (single example):
#   h_0 := x ∈ R^{d_0}                              # input
#   for ℓ = 0,…,L-1:
#       h_{ℓ+1} = f_ℓ(h_ℓ ; θ_ℓ) ∈ R^{d_{ℓ+1}}      # layer-ℓ transform with its own parameters θ_ℓ
#   ŷ = h_L ∈ R^{d_L} (d_L = d_out)                 # output
#
# (Batch, row-major): X ∈ R^{B×d_0}, H_ℓ ∈ R^{B×d_ℓ}, H_{ℓ+1} = f_ℓ(H_ℓ ; θ_ℓ) ∈ R^{B×d_{ℓ+1}}, Ŷ = H_L ∈ R^{B×d_L}
#
# Width of layer ℓ = d_ℓ (number of units/channels in that layer)
# Depth of NN = L (number of layers)
#
# WEIGHTS, BIASES, ACTIVATIONS (the building blocks of each layer)
# -------------------------------------------------------------------------------
# DENSE (FULLY CONNECTED) LAYER = AFFINE MAP + (OPTIONAL) NONLINEARITY 
# We define:
#   - feature:               h_0  := input to first layer (i.e. x)
#   - activation:            h_ℓ  := output of layer ℓ (input to layer ℓ+1)
#   - logit:                 z_ℓ  := pre-activation output of layer ℓ (pure affine transform θ_ℓ := (W_ℓ,b_ℓ) of h_ℓ)
#   - weight:                W_ℓ  := layer ℓ weight (matrix)
#   - bias:                  b_ℓ  := layer ℓ bias (translation vector)
#   - activation function:   σ    := nonlinear function (acts elementwise and preserves shape σ : R^{...×d_{ℓ+1}} → R^{...×d_{ℓ+1}})
#   - layer ℓ function:    f_ℓ(; θ_ℓ)  := σ ∘ (W_ℓ, b_ℓ)  (affine map followed by nonlinearity)
#
# Notation & shapes:
# - In code, features h_0 ∈ R^{d_0}, activations h_ℓ ∈ R^{d_ℓ}, biases b_ℓ ∈ R^{d_{ℓ+1}}, and logits z_ℓ ∈ R^{d_{ℓ+1}} (1D arrays).
# - Batches of B features H_0 ∈ R^{B×d_ℓ}, activations H_ℓ ∈ R^{B×d_ℓ}, and logits Z_ℓ ∈ R^{B×d_{ℓ+1}} (2D arrays, rows are individual samples); biases b_ℓ ∈ R^{d_{ℓ+1}} still 1D and broadcast across rows.
# - Weights W_ℓ are 2D arrays/matrices, shape depends on framework convention (see below).
#
# ---------------------------------- PyTorch Convention ----------------------------------
# Storage (matches torch.nn.Linear):
#   W_ℓ ∈ R^{d_{ℓ+1} × d_ℓ}      # (out, in)
#
# Single feature/activation as 1D input:
#   input h_ℓ ∈ R^{d_ℓ}
#   logit z_ℓ = W_ℓ @ h_ℓ + b_ℓ             # z_ℓ ∈ R^{d_{ℓ+1}}
#   output h_{ℓ+1} = σ(z_ℓ)                  # h_{ℓ+1} ∈ R^{d_{ℓ+1}}
#
# Single feature/activation as a ROW (keep 2D):
#   input h_ℓ_row ∈ R^{1 × d_ℓ}
#   logit z_ℓ_row = h_ℓ_row @ W_ℓ^T + b_ℓ   # z_ℓ_row ∈ R^{1 × d_{ℓ+1}}
#   output h_{ℓ+1,row} = σ(z_ℓ_row)          # h_{ℓ+1,row} ∈ R^{1 × d_{ℓ+1}}
#
# Batch of B features/activations (rows are samples):
#   input  H_ℓ ∈ R^{B × d_ℓ}
#   logits Z_ℓ = H_ℓ @ W_ℓ^T + b_ℓ           # Z_ℓ ∈ R^{B × d_{ℓ+1}}  (b_ℓ broadcasts on rows)
#   output H_{ℓ+1} = σ(Z_ℓ)                  # H_{ℓ+1} ∈ R^{B × d_{ℓ+1}}
#
# ------------------------------------ JAX Convention ------------------------------------
# Storage (transpose-free for row-major batches):
#   W_ℓ ∈ R^{d_ℓ × d_{ℓ+1}}      # (in, out)
#
# Single feature/activation as 1D input:
#   input  h_ℓ ∈ R^{d_ℓ}
#   logit  z_ℓ = h_ℓ @ W_ℓ + b_ℓ             # z_ℓ ∈ R^{d_{ℓ+1}}
#   output h_{ℓ+1} = σ(z_ℓ)                  # h_{ℓ+1} ∈ R^{d_{ℓ+1}}
#
# Single feature/activation as a ROW (2D):
#   input  h_ℓ_row ∈ R^{1 × d_ℓ}
#   logit  z_ℓ_row = h_ℓ_row @ W_ℓ + b_ℓ     # z_ℓ_row ∈ R^{1 × d_{ℓ+1}}
#   output h_{ℓ+1,row} = σ(z_ℓ_row)          # h_{ℓ+1,row} ∈ R^{1 × d_{ℓ+1}}
#
# Batch of B features/activations (rows are samples):
#   input  H_ℓ ∈ R^{B × d_ℓ}
#   logits Z_ℓ = H_ℓ @ W_ℓ + b_ℓ             # Z_ℓ ∈ R^{B × d_{ℓ+1}}  (b_ℓ broadcasts on rows)
#   output H_{ℓ+1} = σ(Z_ℓ)                  # H_{ℓ+1} ∈ R^{B × d_{ℓ+1}}
#
# COMMON ACTIVATION FUNCTIONS
# -----------------------------------------------------------------------------
#     identity(z) = z
#       - No nonlinearity; used at output heads for regression tasks.
#
#     ReLU(z) = max(0, z)
#       - Elementwise on logits 
#       - Piecewise linear, zero for negatives, identity for positives.
#       - Pros: simple, fast, sparse activations; strong default.
#       - Cons: “dead ReLU” (units can get stuck at 0).
#
#     tanh(z) = (e^z - e^{-z}) / (e^z + e^{-z})
#       - Elementwise on logits
#       - Smooth, bounded in (-1, 1), zero-centered.
#       - Pros: good for smoothly varying signals.
#       - Cons: can saturate for large |z| → small gradients.
#
#     sigmoid(z) = 1 / (1 + e^{-z})
#       - Elementwise on logits
#       - Smooth, bounded in (0, 1).
#       - Use mainly at binary-classification output heads (probabilities).
#       - Cons: saturates at extremes; not ideal for hidden layers.
#
#     GELU(z) = z * Φ(z),  Φ = standard normal CDF
#       - Elementwise on logits
#       - Common approx: 0.5 * z * (1 + tanh(√(2/π) * (z + 0.044715 z^3)))
#       - Smooth “probabilistic gate”; small negatives softly down-weighted.
#       - Strong default in transformers/deep nets.
#
#    Swish(z) = z * sigmoid(z)
#       - Smooth, non-monotonic; similar spirit to GELU.
#       - Often performs slightly better than ReLU in some settings; a bit slower.
#
#   Softmax(z)_i = exp(z_i) / Σ_j exp(z_j) (Boltzmann distribution)
#       - Not elementwise (acts across a vector); used at multi-class output heads
#       - to convert logits z to class probabilities (softmax(z) ∈ [0,1]^{d_out}, Σ_i softmax(z)_i = 1)
# 
# INTUITION
# - Weights (W): combine/rotate/scale input features to form new features.
# - Biases  (b): shift each output unit’s activation threshold.
# - Activation σ: injects nonlinearity; without it, stacked layers collapse to one affine map.
#
# THREE REGRESSION EXAMPLES
# -----------------------------------------------------------------------------
# Notation: for hidden width h, input dim d_in, output dim d_out (= task-specific).
# A dense layer applies: h_out = σ(W @ h_in + b). Shapes shown batch-first where useful.
#
# 1) Housing prices
#    x = [square_footage, num_bedrooms, year_built] ∈ R^{d_in=3} , output ŷ ∈ R^{d_out=1} (scalar price)
#    Suggested MLP (multi-layer perceptron): 3 → h=64 → 1
#    Weights/Biases:
#      W1 ∈ R^{64×3},  b1 ∈ R^{64}
#      W2 ∈ R^{1×64},  b2 ∈ R^{1}
#    Activations:
#      hidden σ: ReLU or tanh (e.g., h1 = σ(W1 x + b1))
#    Output head:
#      identity (no σ): ŷ = W2 h1 + b2  ∈ R^{1}   # regression scalar 
#
# 2) Physics (oscillator)
#    Option A (time → position): x = [t] ∈ R^{1}, output ŷ = x(t) ∈ R^{1} (scalar position at time t)
#      MLP: 1 → h=32 → 1
#      Weights/Biases: W1 ∈ R^{32×1}, b1 ∈ R^{32}; W2 ∈ R^{1×32}, b2 ∈ R^{1}
#      Activations: hidden σ: tanh (smooth signals)
#      Output head: identity (no σ): ŷ = W2 h1 + b2     # regression scalar 
#
#    Option B (params → full waveform): x = [A, β, ω, δ] ∈ R^{4}, output ŷ ∈ R^{T} (vector of positions at T timepoints)
#      MLP: 4 → h=64 → T
#      Weights/Biases: W1 ∈ R^{64×4}, b1 ∈ R^{64}; W2 ∈ R^{T×64}, b2 ∈ R^{T}
#      Activations: hidden σ: tanh/ReLU
#      Output head: identity (no σ) ŷ = W2 h1 + b2  # vector regression over timepoints
#
# 3) Physics (experiment mapping env vars → measured signal(s))
#    x = [temperature, pressure, field_strength, …] ∈ R^{P}, output ŷ ∈ R^{T} (vector of measured_signal(t) at T timepoints)
#    Suggested MLP: p → h=128 → h=64 → m
#    Weights/Biases:
#      W1 ∈ R^{128×p},  b1 ∈ R^{128}
#      W2 ∈ R^{64×128}, b2 ∈ R^{64}
#      W3 ∈ R^{m×64},   b3 ∈ R^{m}
#    Activations:
#      hidden σ: GELU/ReLU (nonlinear instrument response)
#    Output head:
#      identity (no σ): ŷ = W3 h2 + b3  # multi-output regression (m channels)
#
# NOTE: Not all NN layers are "dense (affine) + nonlinearity".
# - Examples:
#     • Convolutions: local, weight-shared linear ops + activation (not a full W @ x).
#     • Attention: content-based mixing (softmax(QK^T / √d) V) + projections/MLPs.
#     • Recurrent cells (LSTM/GRU): gated state updates; parameters used across time.
#     • Normalization layers (BatchNorm/LayerNorm): per-feature affine using data stats.
#     • Pooling / Residual / Embedding / Graph message passing: not plain dense maps.
#
# Parameters θ are not necessarily just (W, b). They can include:
#   - kernels/filters, projection matrices, normalization scales/shifts (γ, β),
#     gating parameters, positional encodings, etc.
#
# Shape conventions still follow the same *input/output mapping* idea:
#   - For a dense layer (vector input): x ∈ R^{B×d_in}  →  z ∈ R^{B×d_out}
#   - For structured layers, the last dimension is typically "features/channels":
#       * Conv2D: X ∈ R^{B×H×W×C_in}  →  Y ∈ R^{B×H'×W'×C_out}   (NHWC shown)
#       * Self-attention: X ∈ R^{B×T×d_model} → Y ∈ R^{B×T×d_model}
#       * LayerNorm over features: preserves shape; acts along the feature axis.
#
# Takeaway:
#   - Dense = fully connected affine (W @ x + b) + optional σ.
#   - Many layers use different/structured linear ops (or none), but you still track
#     (batch, spatial/temporal dims?, features) → (same or new features) consistently.
#
# LOSS FUNCTIONS (what we optimize during training in *supervised* learning (ground truths are known))
# -------------------------------------------------------------------------------------------------
# PURPOSE
# - A loss L(θ) is a scalar that measures how well the model f_θ fits data.
# - Given a batch {(x_i, y_i)}_{i=1..B} (x_i ∈ R^{d_in} = input feauture and y_i ∈ R^{d_out} = ground-truth target vector) we minimize the per-batch empirical risk:
#     L(θ) = (1/B) * Σ_i  ℓ( f_θ(x_i), y_i ) (ℓ = per-example loss)
#
# SHAPES (batch-first)
# - Batch size: B
# - Inputs:   X ∈ R^{B×d_in}
# - Targets:  Y ∈ R^{B×d_out}  or  Y ∈ {0,…,K-1}^B (classification)
# - Outputs:  Ŷ = f_θ(X) ∈ R^{B×d_out} or logits Z ∈ R^{B×K}
#
# REGRESSION LOSSES
# -----------------------------------------------------------------------------
#  - Ŷ, Y ∈ R^{B x d_out = 1 (scalar regression) or T (vector regression))}
#  - Mean Squared Error (MSE):     ℓ(ŷ_i, y_i) =  1/d_out Σ_j (Ŷ[i,j] - Y[i,j])² (sum over output dims j)
#  - Mean Absolute Error (MAE):    ℓ(ŷ_i, y_i) =  1/d_out Σ_j |Ŷ[i,j] - Y[i,j]|
#  - Huber (smooth L1):            ℓ(ŷ_i, y_i) =  [1/d_out Σ_j huber_loss(Ŷ[i,j], Y[i,j])], where huber_loss(a, b) = 0.5 * (a - b)² if |a - b| < δ else δ * (|a - b| - 0.5 * δ)
#  - L(θ) = 1/B Σ_i ℓ(Ŷ[i], Y[i]) (mean over batch)
#
# TRAINING LOOP (EPOCHS)
# -----------------------------------------------------------------------------
# Dataset: D = {(x_i, y_i)}_{i=1..N}                       # N total samples
# Stacked tensors: X ∈ R^{N×d_in},  Y ∈ R^{N×d_out}        # scalar regression ⇒ d_out = 1
# Batch size B (e.g., 32/64/128):
#   After shuffling each epoch, split D into M = ceil(N / B) batches.
#   One batch: {(x_i, y_i)}_{i=1..B}  →  X_b ∈ R^{B×d_in}, Y_b ∈ R^{B×d_out}
#   Note: the last batch may have < B samples unless you drop it.
#
# Iteration (a.k.a. Step, Update): ONE optimizer update using ONE batch (X_b, Y_b).
#   1) Forward:   Ŷ_b = f_θ(X_b)                  # predictions for this batch
#   2) Loss:      L   = loss_fn(Ŷ_b, Y_b)          # scalar batch loss
#   3) Gradients: g   =  ∇_θ L                     # autodiff / backprop (same shape as θ)
#   4) Update:    θ ← OPTIMIZER_STEP(θ, g)        # e.g., SGD / Adam / AdamW
#
# Epoch: one full pass over the dataset → exactly ONE iteration per batch.
#   iterations_per_epoch = M = ceil(N / B)
#
# Typical training loop:
#   for epoch in range(E):                      # E = num_epochs (hyperparameter to be chosen)
#       shuffle(D)                              # randomize order
#       split D into batches of size B.         # M = ceil(N / B) batches
#       for (X_b, Y_b) in batches:              # M iterations this epoch
#           do one iteration (forward → loss → grad → update) (updated theta passed to next iteration)
#
# --------------------------------- Practical Notes ---------------------------------
# • Conceptually, the model is defined on a single example:
#       f_θ : R^{d_in} → R^{d_out},   x ↦ f_θ(x)
#
#   In code, we usually *write* f_θ(x) as if x were a single input, but in the training loop we feed in whole batches X_b ∈ R^{B×d_in} 
#   The framework computes Ŷ_b = f_θ(X_b) by applying the same operations ndependently across the leading batch dimension B (broadcasting)
#
# • Scalar activation functions (ReLU, tanh, sigmoid, etc.) are elementwise:
#       σ : R → R  (defined on scalars)
#   but the same implementation works on batched tensors:
#       z  ∈ R^{d_out}      → σ(z)  ∈ R^{d_out}
#       Z  ∈ R^{B×d_out}    → σ(Z) ∈ R^{B×d_out}
#   The framework just applies σ entrywise, including across the batch axis.
#
# • Conceptually, the loss starts as a per-example loss ℓ(ŷ, y) (e.g. MSE, cross-entropy). 
#
#   In code we *define* loss_fn(Ŷ_b, Y_b, [optional weights]) as a function on an arbitrary batch (X_b, Y_b) by:
#       – computing ℓ per example in the batch, then
#       – aggregating them via a reduction (typically an unweighted or weighted
#         mean over the batch).
#
# • Because everything is written in terms of array/tensor operations, the SAME
#   f_θ and loss definitions work for any batch size B:
#       B = 1      → single example
#       B = N      → full-batch (one iteration per epoch)
#       1 < B < N  → (mini-)batches
#   Changing B only affects how we slice/form batches (X_b, Y_b, …), not the mathematical definitions of f_θ or the loss.
#
# - Using mean reductions in the loss (over outputs and over batch) keeps the loss scale independent of B; if you sum instead, adjust the learning rate.
# - Gradient accumulation: to simulate a larger effective batch size k * B, accumulate (sum) grads over k batches and then apply one iteration update using the average gradient. 
# - Evaluations (train/val metrics) are typically computed at epoch boundaries without updating parameters.
#
# LOSS REDUCTIONS & WEIGHTING
# -----------------------------------------------------------------------------
# Default reductions
# - Per-example loss:  ℓ_i = mean over output dims (keeps scale independent of d_out)
# - Batch loss:        L_B = (1/B) * Σ_i ℓ_i(ŷ_i, y_i) 
# - Using MEANS (not sums) makes the loss insensitive to batch size B and output size d_out.
#
# Per-example weights (class imbalance, heteroskedastic noise, masks)
# - General weighted mean over the batch (weighs each example i by w_i ≥ 0):
#     L_B = [ Σ_i w_i * ℓ_i(ŷ_i, y_i) ] / [ Σ_i w_i ] 
#   • Unweighted mean: w_i = 1 for all i  → L_B = (1/B) Σ_i ℓ_i
#   • Variable batch size / masked rows: set w_i=0 for masked examples; normalization by Σ w_i stabilizes scale
#   • Inverse-variance weighting (Gaussian noise): w_i ∝ 1/σ_i^2
#
# Per-dimension / per-channel weights (multivariate targets)
# - Define the per-example loss with dimension weights α_j ≥ 0
#     ℓ_i = [ Σ_j α_j * loss_dim(ŷ_{ij}, y_{ij}) ] / [ Σ_j α_j ] (sum over output dims j)
#   Then reduce across the batch as above (optionally with w_i). Normalizing by Σ_j α_j keeps scale stable if some dims are masked (α_j=0).
#
# L2 penalty (a.k.a. "L2 regularization") vs Weight Decay
# - L2 penalty adds λ||θ||² directly to the loss:
#     L_total = L_data + λ/2 Σ_k ||θ_k||²
#   This changes the gradient to:  ∇_θ L_total = ∇_θ L_data + λ θ
# - Weight decay shrinks parameters multiplicatively during the update step:
#     θ ← (1 - ηλ) * θ - η * g , where g = ∇_θ L_data
# - IMPORTANT EQUIVALENCE:
#   • For PLAIN SGD, "L2 penalty in the loss" ≡ "weight decay" (they produce the same update).
#   • For ADAPTIVE OPTIMIZERS (Adam/RMSProp), they are NOT equivalent. Prefer decoupled weight decay (AdamW) instead of adding λ/2||θ||² to the loss.
# - Practical note: Typically DO NOT decay biases and normalization parameters (e.g., BatchNorm/LayerNorm scale & bias).
#
# -----------------------------------------------------------------------------
# COMMON OPTIMIZERS
# -----------------------------------------------------------------------------
# STOCHASTIC GRADIENT DESCENT (SGD)
# - Update:  θ ← θ - η * g,  where g = ∇_θ L(θ; X_b, Y_b)
# - Learning rate η: too large → divergent/oscillatory; too small → slow. (Typical 1e-4 to 1e-1.)
# - Pros: simple, memory-light; strong generalization for large-scale vision when tuned.
# - Cons: sensitive to scale/conditioning; may be slow in narrow valleys.
#
# MOMENTUM (SGD + momentum)
# - Velocity: v ← μ v + g
# - Update:   θ ← θ - η * v
# - μ ∈ [0,1): smooths stochastic noise; accelerates along persistent directions. Nesterov evaluates the grad at a look-ahead point and often converges faster.
#
# ADAM (Adaptive Moment Estimation)
# - Definitions:
#     g_t = ∇_θ L_t(θ; X_b, Y_b)             # gradient at step/iteration t = 1, 2, ...
#     m_0 = 0, v_0 = 0                       # initialize first/second moments to zero
# - Exponential moving averages (per-parameter):
#     m_t = β1 * m_{t-1} + (1-β1) * g_t     # EMA of gradients (momentum-like); more weight on recent grads
#     v_t = β2 * v_{t-1} + (1-β2) * (g_t ⊙ g_t)  # EMA of squared grads (RMSProp-like)
#   Unrolled (explicit) form for m_t:
#     m_t = (1-β1) * Σ_{s=1..t} β1^{t-s} * g_s
# - Bias correction (removes init bias toward zero at early t):
#     m̂_t = m_t / (1 - β1^t),    v̂_t = v_t / (1 - β2^t)
# - Elementwise adaptive step:
#     step_t = m̂_t / (sqrt(v̂_t) + ε)
#     θ ← θ - η * step_t
# - Intuition:
#   • m̂_t is a smoothed direction (reduces gradient noise).
#   • v̂_t rescales by recent magnitude (dims with large variance get smaller steps).
#   • ε (~1e-8) avoids division by zero; sets a small floor on the denominator.
# - Typical defaults: β1=0.9, β2=0.999, ε=1e-8. AMSGrad uses v̂_t = max(v̂_t, v̂_{t-1}) for extra stability.
#
# ADAMW (Adam with decoupled weight decay)
# - Same Adam moment updates (m_t, v_t, bias corrections).
# - Decoupled decay: apply weight decay SEPARATELY from the gradient step:
#     θ ← θ - η * [ m̂_t / (sqrt(v̂_t) + ε) ]   # Adam step
#     θ ← θ - η * λ * θ                       # THEN decay (or equivalently θ ← (1 - ηλ) θ after the Adam step)
# - Why prefer AdamW over Adam+L2-in-loss:
#   • For adaptive methods, decoupled decay preserves the intended regularization behavior and avoids interactions with the per-parameter scaling.
# - Good defaults: η=1e-3, weight_decay≈1e-4 (no decay on biases/normalization). Consider LR warmup for the first 100–1000 steps.
#
# PRACTICAL EXTRAS
# - Gradient clipping: clip global norm (e.g., 1–5) to avoid rare exploding updates.
# - LR schedules: warmup → cosine decay / step decay / ReduceLROnPlateau.
# - Batch size ↔ LR: larger batches often allow slightly larger η; tune together.
# - Init: He/Kaiming for ReLU/GELU; Xavier/Glorot for tanh/sigmoid.
# - Regularization: dropout, data augmentation, label smoothing, early stopping on a validation split.
# - Diagnostics:
#   • If loss plateaus early → warmup + slightly higher η, or lower β2 (e.g., 0.99) for faster adaptation.
#   • If training is noisy/unstable → lower η, increase B, add momentum/EMA of weights, or clip grads.

In [None]:
# NOTE: This needs to be fleshed out more 
# ============================ How Gradients Are Actually Computed ============================ 
#
# Context / Spaces
# -------------------------------------------------------------------------------------------------
# • Inputs:  x ∈ R^{d_in}
# • Params:  θ = {θ_1,…,θ_L}  (collection of tensors; each grad has SAME SHAPE as its tensor)
# • Outputs: ŷ = F(x; θ) ∈ R^{d_out}
# • Per-example scalar loss: ψ : R^{d_out} × R^{d_out} → R,  ψ(ŷ, y)  (y is constant/inert)
# • Scalar objective on (x, θ):
#       Φ(x, θ) := ψ(F(x; θ), y)          # this is what we differentiate in practice
#
# Model as a composition of layers
# -------------------------------------------------------------------------------------------------
# z_0 = x
# for ℓ = 1..L:
#     z_ℓ = f_ℓ(z_{ℓ-1}; θ_ℓ)            # arbitrary differentiable op (dense/conv/attn/norm/etc.)
# ŷ = z_L
#
# ======================================= REVERSE-MODE (BACKPROP / VJP) =========================================
# Goal: compute ∂Φ/∂θ (and optionally ∂Φ/∂x) in ~one backward sweep (independent of #params per se).
#
# 1) Forward pass (cache primals/intermediates each layer needs for its VJP):
#    cache_ℓ = (z_{ℓ-1}, z_ℓ, θ_ℓ, intermediates)
#
# 2) Seed at output (loss gradient wrt ŷ):
#    λ := ∂ψ(ŷ, y)/∂ŷ = ∂Φ/∂ŷ ∈ R^{d_out}      # “backprop signal” injected at the model output
#
# 3) Reverse sweep (layer-local VJPs, i.e. chain rule through each f_ℓ):
#    zbar_L := λ                                 # zbar_ℓ ≡ ∂Φ/∂z_ℓ
#    for ℓ = L..1:
#        # layer VJP evaluated at cache_ℓ
#        # returns cotangents to inputs & params:
#        zbar_{ℓ-1}, thetabar_ℓ = VJP_fℓ(cache_ℓ, zbar_ℓ)
#        # thetabar_ℓ ≡ ∂Φ/∂θ_ℓ, same shape as θ_ℓ
#        grad[θ_ℓ] += thetabar_ℓ
#
# 4) Results (gradients of the scalar objective Φ(x, θ) := ψ(F(x; θ), y)):
#    ∂Φ/∂θ = {grad[θ_1], …, grad[θ_L]}
#    (optional) ∂Φ/∂x = zbar_0
#
# Notes:
# • No explicit Jacobians are formed; each primitive supplies a VJP rule.
# • This is what PyTorch `loss.backward()` and JAX `grad` implement for scalar objectives Φ.
#
# ========================================= FORWARD-MODE (JVP) VIEW =============================================
# Use when you want directional derivatives or input sensitivities and output dim is small
# (or to build Hessian-vector products via JVP-of-VJP). It propagates tangents forward.
#
# Directional-derivative setup for the scalar objective Φ(x,θ) := ψ(F(x;θ), y):
# • Pick tangents u_x (same shape as x) and u_θ = {u_{θ_ℓ}} (same shapes as θ_ℓ).
# • We will compute dΦ(x,θ)[u_x, u_θ], the directional derivative of Φ along (u_x, u_θ).
#
# 1) Initialize tangents at input:
#    v_0 := u_x                                # v_ℓ ≡ tangent for z_ℓ
#
# 2) Layerwise primal + tangent propagation (JVPs):
#    for ℓ = 1..L:
#        # Primal forward:
#        z_ℓ = f_ℓ(z_{ℓ-1}; θ_ℓ)
#        # Tangent update via local linearization:
#        v_ℓ = J_z f_ℓ(z_{ℓ-1}; θ_ℓ) @ v_{ℓ-1}  +  J_θ f_ℓ(z_{ℓ-1}; θ_ℓ) @ u_{θ_ℓ}
#
# 3) Read out scalar directional derivative at the loss:
#    λ := ∂ψ(ŷ, y)/∂ŷ = ∂Φ/∂ŷ               # same seed as reverse-mode
#    dΦ(x,θ)[u_x, u_θ] = ⟨λ, v_L⟩             # inner product over output dims
#
# Consequences:
# • To recover the FULL gradient ∂Φ/∂θ from forward-mode alone, you’d need one JVP per
#   parameter direction (expensive). So forward-mode is great for *directional* queries,
#   but reverse-mode is better for full gradients of scalar losses.
#
# ============================== DENSE-LAYER SPECIALIZATION (CLASSIC “DELTA” FORM) ===============================
# Forward (primal):
#   a_ℓ = W_ℓ z_{ℓ-1} + b_ℓ,  z_ℓ = σ_ℓ(a_ℓ)
#
# Forward (tangent):
#   ȧ_ℓ = W_ℓ v_{ℓ-1} + (u_{W_ℓ} z_{ℓ-1}) + u_{b_ℓ}
#   v_ℓ  = σ_ℓ′(a_ℓ) ⊙ ȧ_ℓ
#
# Reverse (backprop, i.e. gradients of Φ):
#   δ_L = ∂Φ/∂z_L ⊙ σ_L′(a_L)               # (for softmax+CE with logits a_L: δ_L = softmax(a_L) − y)
#   δ_ℓ = (W_{ℓ+1}^T δ_{ℓ+1}) ⊙ σ_ℓ′(a_ℓ)
#   ∂Φ/∂W_ℓ = δ_ℓ (z_{ℓ-1})^T
#   ∂Φ/∂b_ℓ = δ_ℓ
#   ∂Φ/∂x   = W_1^T δ_1
#
# ============================================== BATCH CASE ======================================================
# Batch {(x_i, y_i)}_{i=1..B}, optional per-example weights ω_i ≥ 0:
#   L_B(θ) = [Σ_i ω_i ψ(F(x_i;θ), y_i)] / [Σ_i ω_i]   # scalar objective over the batch
#
# Reverse-mode (what frameworks do for batched losses):
#   λ_i := ∂ψ(ŷ_i, y_i)/∂ŷ_i
#   ∂L_B/∂θ = Σ_i (ω_i / Σ_j ω_j) · [ J_θ F(x_i,θ)^T · λ_i ]     # sum of tensors matching θ’s shapes
#   (optional) ∂L_B/∂x_i = (ω_i / Σ_j ω_j) · [ J_x F(x_i,θ)^T · λ_i ]
#
# Forward-mode (directional batch derivative):
#   Given per-example tangents u_{x_i} and a single u_θ:
#     propagate (z_i, v_i) forward for each i to get v_{L,i},
#     dL_B[(u_{x_1},…,u_{x_B}), u_θ] = (1/Σ_j ω_j) · Σ_i ω_i ⟨ λ_i , v_{L,i} ⟩
#
# =========================================== PRACTICAL SUMMARY ==================================================
# • Training (scalar losses, many parameters): use REVERSE-MODE (backprop / VJP). That’s what
#   PyTorch `loss.backward()` and JAX `grad` do: they give ∇_θ Φ with one backward sweep.
# • Sensitivity analysis, Jacobian-vector or Hessian-vector products: use FORWARD-MODE JVP (and
#   combos like JVP-of-VJP). It gives fast directional derivatives without materializing Jacobians/Hessians.
# • Always use MEANS (not sums) across batch/output dims to keep loss scale stable vs B and d_out.

In [None]:
# ==============================================================================
# CLASSIFIERS: BINARY CROSS-ENTROPY → MULTI-CLASS (BOLTZMANN / SOFTMAX VIEW)
# ==============================================================================

# 0) GENERAL CLASSIFIER SETUP
# ----------------------------------------------------------------------
# • Input features (one example):       x ∈ R^{d_in}
# • Parameters:                         θ  (e.g., weights and biases of a linear or neural net)
# • Classifier as logit-producing map:  f_θ : R^{d_in} → R^{d_out}
#     – Output z = f_θ(x) are called logits: unnormalized scores (can be any real numbers).
#     – Probabilities are obtained by passing z through a suitable nonlinearity:
#         • Binary:    sigmoid/logistic
#         • Multi-class: softmax
#
# For a batch of B examples:
#   • Inputs:  X ∈ R^{B×d_in}
#   • Logits:  Z = f_θ(X) ∈ R^{B×d_out}   (row i = logits for example i)


# 1) BINARY CLASSIFICATION (2 CLASSES)
# ----------------------------------------------------------------------
# Example:  y ∈ {0,1}  (0 = negative, 1 = positive)
#
# MODEL (ONE-LOGIT FORMULATION)
#   • For each example, produce a single real-valued logit:
#         z = f_θ(x) ∈ R
#
#   • Interpret z as the log-odds of class 1 vs class 0:
#         log( p(y=1|x)/p(y=0|x) ) = z
#
#   • Solve for the probabilities:
#         p(y=1|x) = σ(z) = 1 / (1 + e^(−z)) ∈ (0,1)
#         p(y=0|x) = 1 − σ(z)
#
#   For a batch:
#         logits: z ∈ R^{B}           (or R^{B×1})
#         labels: y ∈ {0,1}^{B}
#
#
# 1.1) BINARY CROSS-ENTROPY LOSS
# ----------------------------------------------------------------------
# Cross-entropy measures how “far” the predicted Bernoulli distribution is
# from the true Bernoulli label.
#
# For a single example (logit z, label y ∈ {0,1}):
#
#   • Predicted probability of class 1: p = σ(z)
#
#   • Binary cross-entropy (in terms of p):
#         ℓ(z, y) = − [ y log p + (1 − y) log(1 − p) ]
#                  = − [ y log σ(z) + (1 − y) log(1 − σ(z)) ]
#
# Numerically stable form (in terms of logits directly):
#
#   • softplus(t) = log(1 + e^t)
#
#   • Then:
#         ℓ(z, y) = softplus(z) − y z
#                  = log(1 + e^{z}) − y z
#
#   – These two expressions are mathematically equivalent:
#         softplus(z) − y z  =  − [ y log σ(z) + (1 − y) log(1 − σ(z)) ]
#
# Batch loss (B examples):
#         L(θ) = (1/B) Σ_{i=1}^B ℓ(z_i, y_i)
#
#
# 1.2) BOLTZMANN / ENERGY INTERPRETATION (BINARY CASE)
# ----------------------------------------------------------------------
# Think of the classifier as an energy-based model with two states y ∈ {0,1}.
#
#   • Assign an energy to each label:
#         E_θ(x, y)
#
#   • Boltzmann distribution at "temperature" T=1:
#         p(y | x) = exp(−E_θ(x,y)) / Σ_{y'∈{0,1}} exp(−E_θ(x,y'))
#
# For binary classification, define the logit as an energy difference:
#
#   • Let:
#         z(x) = log [ p(y=1|x) / p(y=0|x) ]
#
#   • From the Boltzmann form:
#         p(y=1|x) = exp(−E_1) / (exp(−E_0) + exp(−E_1))
#         p(y=0|x) = exp(−E_0) / (exp(−E_0) + exp(−E_1))
#
#     where E_0 = E_θ(x,0), E_1 = E_θ(x,1).
#
#   • Then:
#         log [ p(y=1|x) / p(y=0|x) ]
#       = log [ exp(−E_1) / exp(−E_0) ]
#       = −E_1 + E_0
#
#   • Identify the logit with this energy difference:
#         z(x) = E_0(x) − E_1(x)
#
#   • Substituting into the logistic formula:
#         p(y=1|x) = 1 / (1 + e^{−z(x)})
#
# So the logistic/sigmoid output can be viewed as a special case of a 2-state
# Boltzmann distribution where the logit encodes a relative energy difference.
#
#
# 2) MULTI-CLASS CLASSIFICATION (K ≥ 3 CLASSES)
# ----------------------------------------------------------------------
# Example:   y ∈ {0, 1, ..., K−1}   (exactly one class per example)
#
# TARGETS
#   • One example:
#         y_int ∈ {0,...,K−1}     # integer class index
#     or equivalently:
#         y_onehot ∈ {0,1}^K with Σ_c y_onehot[c] = 1.
#
# MODEL OUTPUT
#   • For each example, produce a K-dimensional logit vector:
#         z = f_θ(x) ∈ R^{K}
#
#   • Softmax converts logits to a probability distribution over classes:
#         p_c = exp(z_c) / Σ_j exp(z_j),    for c = 0,...,K−1
#
#   • Properties:
#         p_c ∈ (0,1), Σ_c p_c = 1.
#
#   For a batch:
#         logits: Z ∈ R^{B×K}   (row i = logits for example i)
#         labels: y_int ∈ {0,...,K−1}^{B}
#
#
# 2.1) CATEGORICAL CROSS-ENTROPY LOSS (MULTI-CLASS)
# ----------------------------------------------------------------------
# Cross-entropy between:
#   – predicted distribution p(·|x_i) given by softmax(Z_i,·)
#   – true distribution concentrated at y_int[i].
#
# For a single example (row Z_i ∈ R^{K}, label y_int[i]):
#
#   • With softmax written explicitly:
#         p_c = exp(Z_{i,c}) / Σ_j exp(Z_{i,j})
#
#   • Loss:
#         ℓ(Z_i, y_int[i]) = − log p_{y_int[i]}
#                           = − log softmax(Z_i)[ y_int[i] ]
#
# Numerically stable form:
#
#   • logsumexp(Z_i) = log( Σ_j exp(Z_{i,j}) )
#
#   • Then:
#         ℓ(Z_i, y_int[i]) = logsumexp(Z_i) − Z_{i, y_int[i]}
#
# Batch loss:
#         L(θ) = (1/B) Σ_{i=1}^B ℓ(Z_i, y_int[i])
#
#
# 2.2) BOLTZMANN / ENERGY INTERPRETATION (MULTI-CLASS, SOFTMAX)
# ----------------------------------------------------------------------
# Generalize the binary energy view to K classes y ∈ {0,...,K−1}.
#
#   • Assign an energy for each class:
#         E_θ(x, c)    for c = 0,...,K−1
#
#   • Boltzmann distribution:
#         p(c | x) = exp(−E_θ(x,c)) / Σ_j exp(−E_θ(x,j))
#
#   • If we define logits as negative energies:
#         z_c(x) = −E_θ(x,c)
#
#     then the Boltzmann distribution becomes:
#         p(c | x) = exp(z_c(x)) / Σ_j exp(z_j(x)) = softmax(z(x))_c
#
# So the standard softmax classifier is exactly a Boltzmann distribution over K
# discrete states, with logits playing the role of negative energies.
#
#
# 3) SHAPES / SUMMARY
# ----------------------------------------------------------------------
# BINARY (one-logit formulation):
#   • logits:   z ∈ R^{B}                 (or R^{B×1})
#   • labels:   y ∈ {0,1}^{B}
#   • probs:    σ(z) ∈ (0,1)^{B}
#   • loss:     L(θ) = (1/B) Σ_i [ softplus(z_i) − y_i z_i ]
#
# MULTI-CLASS (K classes):
#   • logits:   Z ∈ R^{B×K}
#   • labels:   y_int ∈ {0,...,K−1}^{B}
#   • probs:    P ∈ [0,1]^{B×K}, row-wise softmax
#   • loss:     L(θ) = (1/B) Σ_i [ logsumexp(Z_i) − Z_{i,y_int[i]} ]
#
# KEY TAKEAWAYS
#   • Binary logistic regression: special case of a 2-state Boltzmann model with
#     logistic (sigmoid) link.
#   • Multi-class softmax: Boltzmann distribution over K states, with logits as
#     negative energies.
#   • Cross-entropy is the natural loss: it is the negative log-likelihood of the
#     correct class under these probability models.


In [None]:
# ----------------------------------------------------------------------------------
# 4) CONVOLUTIONAL NEURAL NETWORKS (CNNs)
# ----------------------------------------------------------------------------------
# Motivation:
#   • Many data types have an intrinsic grid structure:
#       – 1D: time series, audio waveforms
#       – 2D: images (height × width), spectrograms
#       – 3D: videos (time × height × width), volumetric data
#   • In such data, nearby inputs are strongly correlated, and the same local
#     pattern (edge, corner, texture) can appear anywhere in the grid.
#   • Fully-connected layers ignore this structure (they mix all input coordinates
#     with separate parameters), which is:
#       – parameter-inefficient,
#       – insensitive to translations unless explicitly learned,
#       – prone to overfitting on large inputs (e.g., images).
#
# CNNs address this by:
#   • Using local receptive fields (each neuron “sees” a small patch).
#   • Sharing the same set of weights across spatial locations (convolution).
#   • Producing feature maps that are translation-equivariant.
#
#
# 4.1) BASIC OBJECTS IN A CNN
# ----------------------------------------------------------------------------------
# INPUT TENSOR (IMAGE EXAMPLE)
#   • A color image can be thought of as:
#         x ∈ R^{H×W×C_in}
#     where:
#         – H = height (pixels)
#         – W = width  (pixels)
#         – C_in = number of input channels (e.g., 3 for RGB)
#   • For a batch of B images:
#         X ∈ R^{B×H×W×C_in}    (or another convention like B×C_in×H×W)
#
# CONVOLUTIONAL FILTER / KERNEL
#   • A convolution layer has a set of learnable filters (kernels).
#   • Each filter is a small spatial window, e.g.:
#         K ∈ R^{k_h×k_w×C_in}
#     where k_h, k_w are the kernel’s height/width.
#   • The layer typically has C_out such filters:
#         – So all filters together form:
#               W_conv ∈ R^{C_out×k_h×k_w×C_in}
#         – Each filter produces one output channel (one feature map).
#
# CONVOLUTION OPERATION (CROSS-CORRELATION IN PRACTICE)
#   • At each spatial position (i, j), we:
#       – Take a local patch of the input x[i : i+k_h, j : j+k_w, :] ∈ R^{k_h×k_w×C_in}
#       – Compute a dot product with the filter weights K:
#             z[i, j, c_out] = Σ_{u,v,c_in} K_{c_out}[u,v,c_in] * x[i+u, j+v, c_in] + b_{c_out}
#         where b_{c_out} is a bias for the output channel c_out.
#   • Sliding the filter across all valid (i, j) positions yields a feature map
#     z[:,:,c_out] ∈ R^{H'×W'}.
#   • Stacking C_out such maps gives:
#         z ∈ R^{H'×W'×C_out}
#
# HYPERPARAMETERS
#   • Kernel size: (k_h, k_w) – size of the local patch.
#   • Stride: how far the filter moves between positions (e.g., stride 1 vs stride 2).
#   • Padding:
#       – “valid”: no padding, output is smaller than input.
#       – “same”: pad borders so output has similar spatial size as input.
#
#
# 4.2) KEY PROPERTIES OF CONVOLUTION LAYERS
# ----------------------------------------------------------------------------------
# LOCAL RECEPTIVE FIELDS
#   • Each output unit depends only on a local neighborhood in the input (determined
#     by kernel size and depth of the network).
#   • Deeper layers have larger “effective” receptive fields, combining information
#     from larger regions of the original input.
#
# WEIGHT SHARING
#   • The same filter (same set of weights) is applied at every spatial location.
#   • This enforces:
#       – Translation equivariance: if the input shifts, the feature map shifts
#         similarly (ignoring boundary effects).
#       – Parameter efficiency: number of parameters depends on kernel size and
#         channel counts, not on the spatial size (H, W).
#
# FEATURE MAPS
#   • Each output channel (feature map) corresponds to one learned pattern (filter).
#   • Early layers often learn simple patterns (edges, color contrasts, textures).
#   • Deeper layers combine these into more abstract features (shapes, object parts).
#
#
# 4.3) NONLINEARITIES AND POOLING
# ----------------------------------------------------------------------------------
# NONLINEAR ACTIVATIONS
#   • After each convolution, an elementwise nonlinearity is typically applied:
#         h = σ(z)   (e.g., ReLU, GELU, etc.)
#   • This makes the network a universal approximator, not just a stack of linear ops.
#
# POOLING (OPTIONAL)
#   • Pooling reduces spatial resolution while preserving important information.
#   • Common types:
#       – Max pooling: take the maximum value over a small window (e.g., 2×2 region).
#       – Average pooling: take the mean over the window.
#   • Effects:
#       – Adds some translation invariance (small shifts don’t change pooled output much).
#       – Reduces H and W, thus reducing computation and number of subsequent parameters.
#
#
# 4.4) CNN AS FEATURE EXTRACTOR + CLASSIFIER HEAD
# ----------------------------------------------------------------------------------
# VIEW AS TWO PARTS:
#
#   (1) Feature extractor (convolutional backbone):
#       • Sequence of blocks:
#             input → [conv → nonlinearity → (pool)] → ... → feature maps
#       • After L layers, we obtain:
#             F ∈ R^{H_L×W_L×C_L}
#         where H_L, W_L are reduced spatial dimensions, C_L is the number of channels.
#
#   (2) Classifier head:
#       • Convert feature maps into a vector:
#             – Flatten: F → f ∈ R^{H_L·W_L·C_L}, or
#             – Global pooling: e.g., average over spatial dimensions to get f ∈ R^{C_L}.
#       • Feed f into a standard classifier (as in Sections 1–2):
#             – Linear or MLP: f → logits z ∈ R^{K}
#             – Probabilities: softmax(z) for K-class classification.
#       • Loss: categorical cross-entropy (multi-class) or binary cross-entropy
#         (if the task is binary or multi-label), exactly as previously described.
#
#
# 4.5) SUMMARY: CNNs IN THE CLASSIFICATION PIPELINE
# ----------------------------------------------------------------------------------
#   • The classifier from Sections 1–2 assumes an input vector x ∈ R^{d_in}.
#   • CNNs provide a way to map structured inputs (e.g. images) to such a vector:
#         raw image → conv layers → feature maps → pooled / flattened feature vector f
#         f → fully-connected classifier → logits → probabilities → cross-entropy loss
#
#   • Convolution layers:
#       – exploit spatial locality,
#       – reuse filters across the grid (weight sharing),
#       – produce feature maps that respond to patterns (edges, shapes, textures),
#       – feed these high-level representations into the same probabilistic framework
#         (logits + softmax/sigmoid + cross-entropy) already developed for classifiers.
