# Turning Probability into Dynamics or Optimization  
When distributions become trajectories or objectives

---

## 1. The Core Motivation

Exact probability is a static object:

$$
p(x), \qquad \int p(x)\,dx = 1
$$

But static objects are hard:

- normalization constants are intractable  
- marginalization is exponential  
- likelihoods are unavailable  

So we ask a radical question:

Can we replace **“what is the probability?”** with **“how does it change?”** or **“what should be optimized?”**

This leads to two transformations:

- Probability → Optimization  
- Probability → Dynamics  

---

## 2. Probability → Optimization  
Inference becomes finding extrema

### 2.1 Log-Probability as an Objective

Instead of manipulating  
$$
p(x)
$$
work with:

$$
\log p(x)
$$

Why?

- products → sums  
- numerical stability  
- gradients exist  

This already turns probability into geometry.

---

### 2.2 Maximum Likelihood Estimation (MLE)

$$
\theta^\* = \arg\max_\theta \log p(x \mid \theta)
$$

Key shift:

- distribution → single best parameter  
- integration → differentiation  

Interpretation:

Probability becomes a **loss function**.

This is the foundation of:

- regression  
- classification  
- neural networks  

---

### 2.3 Maximum A Posteriori (MAP)

$$
\theta^\* = \arg\max_\theta \log p(x \mid \theta) + \log p(\theta)
$$

Adds:

- prior knowledge  
- regularization  

MAP collapses Bayesian inference into deterministic optimization.

---

### 2.4 Energy-Based Reformulation

Define:

$$
E(x) = -\log p(x)
\quad \Rightarrow \quad
p(x) = \frac{e^{-E(x)}}{Z}
$$

Now:

- high probability = low energy  
- inference = energy minimization  

This reframes probability as physics.

---

### 2.5 Variational Inference (Optimization over Distributions)

Instead of optimizing parameters, optimize functions:

$$
q^\* = \arg\min_q \mathrm{KL}(q \| p)
$$

Equivalent to maximizing:

$$
\mathrm{ELBO}
=
\mathbb{E}_q[\log p]
-
\mathbb{E}_q[\log q]
$$

Here:

- probability → functional objective  
- inference → calculus of variations  

This is probability as optimization in function space.

---

## 3. Probability → Dynamics  
Inference becomes motion

Instead of asking:

What is  
$$
p(x)
$$

We ask:

How does a system move so that its equilibrium is  
$$
p(x)
$$

---

## 4. Gradient Flow Interpretation

### 4.1 Probability as a Landscape

Consider:

$$
\log p(x)
$$

This defines a surface.  
Gradients point toward higher probability.

---

### 4.2 Langevin Dynamics

$$
dx_t
=
\nabla \log p(x_t)\,dt
+
\sqrt{2}\,dW_t
$$

Interpretation:

- gradient ascent → move toward high probability  
- noise → explore uncertainty  

This is:

- sampling  
- optimization  
- diffusion  

all at once.

Langevin dynamics is probability turned into motion.

---

## 5. Stochastic Differential Equations (SDEs)

General form:

$$
dx = f(x,t)\,dt + g(t)\,dW
$$

Here:

- probability is not computed  
- probability is generated  

The distribution of  
$$
x_t
$$
*is* the probability.

---

### 5.1 Fokker–Planck Equation

Instead of tracking particles, track density:

$$
\frac{\partial p}{\partial t}
=
-\nabla \cdot (f p)
+
\frac{1}{2}\nabla^2(g^2 p)
$$

This converts:

- stochastic dynamics → deterministic PDE  
- randomness → density flow  

---

## 6. Reverse-Time Dynamics

A profound insight:

If you know how probability diffuses forward, you can reverse it.

Reverse SDE:

$$
dx
=
\big[
f(x,t)
-
g^2(t)\nabla \log p_t(x)
\big]\,dt
+
g(t)\,d\bar{W}
$$

This requires only:

$$
\nabla \log p_t(x)
$$

Not  
$$
p(x)
$$
itself.

---

## 7. Score-Based Models  
Probability via vector fields

Instead of learning  
$$
p(x)
$$
learn:

$$
s(x,t) = \nabla_x \log p_t(x)
$$

Why this matters:

- no normalization  
- local information only  
- dimension-agnostic  

Generation equals integrating a differential equation.

---

## 8. Diffusion Models  
Probability as reversible destruction

Forward process:

$$
x_t
=
\alpha_t x_0
+
\sqrt{1-\alpha_t}\,\epsilon
$$

Reverse process:

- remove noise gradually  
- guided by learned score  

This replaces density estimation with trajectory synthesis.

---

## 9. Optimal Transport and Probability Flow

### 9.1 Probability as Mass Movement

Instead of random motion, move probability mass optimally:

$$
\int \|x - T(x)\|^2 \, d\mu(x)
$$

This yields:

- deterministic flows  
- invertible mappings  

Used in:

- normalizing flows  
- probability flow ODEs  

---

## 10. Probability Flow ODEs

Replace SDE:

$$
dx = f(x,t)\,dt + g(t)\,dW
$$

With deterministic ODE:

$$
dx
=
\big[
f(x,t)
-
\frac{1}{2}g^2(t)\nabla \log p_t(x)
\big]\,dt
$$

This gives:

- exact likelihood  
- deterministic generation  

---

## 11. Optimization–Dynamics Duality

A unifying principle:

| Optimization | Dynamics |
|-------------|----------|
| Gradient descent | Deterministic flow |
| Stochastic gradient descent | Langevin dynamics |
| Loss landscape | Energy landscape |
| Convergence | Stationary distribution |

Optimization is frozen dynamics.  
Dynamics is noisy optimization.

---

## 12. Why This Works (Deep Reason)

Probability is hard because:

- it is global  
- it requires integration  
- it needs normalization  

Dynamics and optimization:

- are local  
- use gradients  
- avoid normalization  

They turn global structure into local motion.

---

## 13. Failure Modes

This transformation fails when:

- gradients are inaccurate  
- landscapes are multimodal  
- dynamics mix slowly  
- optimization collapses diversity  

This explains:

- mode collapse  
- overconfident models  
- poor uncertainty estimates  

---

## 14. The Meta Insight

Probability does not need to be represented to be used.  
It only needs to guide motion or descent.

This is why modern generative models:

- don’t compute likelihoods  
- don’t normalize densities  
- don’t store distributions  

They **move through probability**.

---

## 15. Final Synthesis

Turning probability into dynamics or optimization replaces  
**“belief as a function”** with **“belief as behavior.”**

Instead of asking:

- What is likely?

We ask:

- Where does the system move?  
- What does it converge to?

This is the deepest computational reframing of probability ever achieved.
