# 05 · Information Geometry & Natural Gradient on Free Energy

We compare **vanilla gradient**, **natural gradient**, and **KL mirror descent** when minimizing a simple variational free energy (ELBO form) over categorical beliefs.

Setup:
- A fixed vector of (unnormalized) log-joint scores $\phi_s \propto \log p(s,o)$.
- A variational family $q(s)=\mathrm{softmax}(\theta)$.
- Objective: $F(q) = \mathbb{E}_q[\log q - \phi]$.

We track:
- Convergence of $F$.
- KL($q\,\Vert\,p_\phi$), where $p_\phi=\mathrm{softmax}(\phi)$.
- **Information length** along the path using the Fisher metric on the simplex.

In [None]:
# CI-friendly parametersimport osCI = os.getenv("CI", "").lower() in ("1","true","yes")N_STEPS = 60 if CI else 200STEP_VAN = 0.2 if CI else 0.1STEP_NAT = 0.8 if CI else 0.4STEP_MD  = 0.8 if CI else 0.4print({"CI": CI, "N_STEPS": N_STEPS})

In [None]:
import numpy as np, matplotlib.pyplot as plt, pandas as pdfrom persystems.geometry import (    softmax, elbo_grad_theta, natural_grad_from_theta, natural_step_theta,    mirror_descent_kl, fisher_categorical, information_length)np.set_printoptions(precision=4, suppress=True)plt.rcParams['figure.dpi'] = 120rng = np.random.default_rng(0)

## Define the problem
$\phi$ induces the target distribution $p_\phi=\mathrm{softmax}(\phi)$. We'll start from random logits $\theta_0$.

In [None]:
S = 8phi = rng.normal(0, 1, size=S) * 1.0 + np.linspace(-0.5, 0.5, S)p_phi = softmax(phi)theta0 = rng.normal(0, 1, size=S)q0 = softmax(theta0)print("p_phi:", np.round(p_phi,4), "| sum=", p_phi.sum())

Helpers for $F$, KL, and a single vanilla-gradient step in $\theta$.

In [None]:
def F_elbo(theta):    q = softmax(theta)    return float(np.sum(q * (np.log(q + 1e-12) - phi)))def KL(q, p):    q = np.clip(q, 1e-12, 1); p = np.clip(p, 1e-12, 1)    return float(np.sum(q * (np.log(q) - np.log(p))))def vanilla_step_theta(theta, step):    g = elbo_grad_theta(theta, phi)  # ∂F/∂θ    th_new = theta - step * g    return th_new, softmax(th_new), g

## Run three optimizers
- **Vanilla**: $\theta \leftarrow \theta - \alpha \nabla_\theta F$.
- **Natural**: $\theta \leftarrow \theta - \eta \nabla^{\text{nat}}_\theta F$ (centered gradient).
- **Mirror descent (KL)** in probability coordinates.

In [None]:
def run_vanilla(theta, nsteps, step):    thetas, qs, Fvals, KLs = [], [], [], []    th = theta.copy()    for t in range(nsteps):        th, q, g = vanilla_step_theta(th, step)        thetas.append(th.copy()); qs.append(q.copy())        Fvals.append(F_elbo(th)); KLs.append(KL(q, p_phi))    return np.array(thetas), np.array(qs), np.array(Fvals), np.array(KLs)def run_natural(theta, nsteps, step):    thetas, qs, Fvals, KLs = [], [], [], []    th = theta.copy()    for t in range(nsteps):        g = elbo_grad_theta(th, phi)        th, q = natural_step_theta(th, g, step)        thetas.append(th.copy()); qs.append(q.copy())        Fvals.append(F_elbo(th)); KLs.append(KL(q, p_phi))    return np.array(thetas), np.array(qs), np.array(Fvals), np.array(KLs)def run_mirror(q_init, nsteps, step):    qs, Fvals, KLs = [], [], []    q = q_init.copy()    # grad in q-coordinates for F = sum q(log q - phi): ∂F/∂q = log q - phi + 1    for t in range(nsteps):        grad_q = np.log(np.clip(q,1e-12,1)) - phi + 1.0        q = mirror_descent_kl(q, grad_q, step)        qs.append(q.copy())        # Back out theta only for reporting (any logits with softmax(q)=q works):        th = np.log(np.clip(q,1e-12,1))        Fvals.append(F_elbo(th)); KLs.append(KL(q, p_phi))    return np.array(qs), np.array(Fvals), np.array(KLs)

In [None]:
th_v, q_v, F_v, KL_v = run_vanilla(theta0, N_STEPS, STEP_VAN)th_n, q_n, F_n, KL_n = run_natural(theta0, N_STEPS, STEP_NAT)q_m,  F_m, KL_m      = run_mirror(q0,    N_STEPS, STEP_MD)print(F_v[-1], F_n[-1], F_m[-1])

## Information length of the optimization path
We compute Fisher information length along the path in probability space.

In [None]:
from persystems.geometry import information_lengthLv = information_length(q_v)  # vanillaLn = information_length(q_n)  # naturalLm = information_length(q_m)  # mirror descentprint({"L_vanilla": Lv, "L_natural": Ln, "L_mirror": Lm})

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,4))ax[0].plot(F_v, label='vanilla'); ax[0].plot(F_n, label='natural'); ax[0].plot(F_m, label='mirror')ax[0].set_title('Free energy F over iterations'); ax[0].set_xlabel('step'); ax[0].set_ylabel('F')ax[0].legend(fontsize=8)ax[1].plot(KL_v, label='vanilla'); ax[1].plot(KL_n, label='natural'); ax[1].plot(KL_m, label='mirror')ax[1].set_title('KL(q || p_phi)'); ax[1].set_xlabel('step'); ax[1].set_ylabel('KL')ax[1].legend(fontsize=8)plt.tight_layout(); plt.show()plt.figure(figsize=(5,3.5))plt.bar([0,1,2],[Lv, Ln, Lm]); plt.xticks([0,1,2], ['van','nat','mirror'])plt.ylabel('information length'); plt.title('Path efficiency (lower is better)')plt.tight_layout(); plt.show()

### Notes
- **Natural gradient** corresponds to steepest descent in the **Fisher geometry**; often yields shorter information length and faster decrease in $F$.
- **Mirror descent** (KL) keeps updates on the simplex and can behave similarly to natural steps for exponential families.
- These tools connect the *geometry* behind active inference updates with concrete computational benefits.