# How Probability Became Computable  
An intellectual journey from belief to optimization, dynamics, and generation

---

## 1. Turning Probability into Optimization  
From uncertainty to objectives

### Ronald A. Fisher (1922–1925) — Maximum Likelihood Estimation

**Key works**
- *On the Mathematical Foundations of Theoretical Statistics*  
- *Theory of Statistical Estimation*

**Core Idea**

Early probability theory treated distributions as objects to be calculated:

$$
p(x)
$$

But Fisher introduced a radical reframing:

Do not try to compute the entire distribution.  
Instead, find the parameters that make the observed data most probable.

Formally:

$$
\theta^\* = \arg\max_\theta \log p(x \mid \theta)
$$

**What changed conceptually?**

- Probability → objective function  
- Integration → differentiation  
- Inference → optimization  

This was the first great computational rupture in statistics.

**Why it matters historically**

This single move transformed statistics from:

- a descriptive, philosophical discipline  

into:

- a procedural, algorithmic science  

Modern machine learning still rests on this idea.

---

### Edwin T. Jaynes (1957) — Maximum Entropy

**Key work**
- *Information Theory and Statistical Mechanics*

**Core Idea**

What if:

- the distribution is unknown?  
- the full probability is impossible to compute?  
- only partial constraints are available?  

Jaynes’ answer:

Choose the distribution with maximum entropy subject to known constraints.

$$
\max_p \; H(p) \quad \text{s.t. known constraints}
$$

**Conceptual shift**

- Probability → constrained optimization  
- Ignorance → principled uncertainty  
- Partial knowledge → computable distribution  

**Philosophical importance**

This established the epistemological foundation of modern Bayesian reasoning:

Probability is not truth — it is rational belief under constraints.

---

## 2. Turning Probability into Approximation  
From exactness to controlled error

### Pierre-Simon Laplace (1774) — Laplace Approximation

**Key work**
- *Mémoire sur la probabilité des causes*

**Core Idea**

Exact Bayesian posteriors are often impossible to integrate.  
Laplace proposed approximating a complex distribution locally by a Gaussian around its mode:

$$
p(x) \approx \mathcal{N}(\mu, \Sigma)
$$

**Consequences**

- Intractable distribution → Gaussian  
- Global complexity → local curvature  
- Integration → second-order geometry  

**Historical role**

This approximation underlies almost all classical Bayesian statistics.

---

### Michael I. Jordan et al. (1999) — Variational Inference

**Key work**
- *An Introduction to Variational Methods for Graphical Models*

**Core Idea**

Instead of computing:

$$
p(z \mid x)
$$

Approximate it with a tractable distribution:

$$
q(z) \approx p(z \mid x)
$$

By solving:

$$
\min_q \; \mathrm{KL}(q \| p)
$$

**Conceptual transformation**

- Inference → optimization  
- Integration → factorization  
- Probability → functional approximation  

**Why this is foundational**

This work is the backbone of:

- Variational Inference (VI)  
- Variational Autoencoders (VAEs)  
- Large-scale Bayesian learning  

---

## 3. Turning Probability into Sampling  
From formulas to experience

### Metropolis et al. (1953)
- *Equation of State Calculations by Fast Computing Machines*

### Hastings (1970)
- *Monte Carlo Sampling Methods Using Markov Chains*

**Core Idea**

Do not compute the distribution — draw samples from it.

- Impossible integral → sample average  
- Probability → repeated random experiment  
- Normalization constant → unnecessary  

This shift created Markov Chain Monte Carlo (MCMC).

---

### Stuart Geman and Donald Geman (1984) — Gibbs Sampling

**Key work**
- *Stochastic Relaxation, Gibbs Distributions, and Bayesian Restoration*

**Core Idea**

Decompose a joint distribution into simple conditionals.  
Sample each variable given the others.

- Joint probability → local conditionals  
- Foundation of graphical models  

---

## 4. Turning Probability into Dynamics  
From distributions to motion

### Paul Langevin (1908) — Langevin Dynamics

**Key work**
- *On the Theory of Brownian Motion*

**Core Idea**

Sampling can be expressed as stochastic motion:

$$
dx = \nabla \log p(x)\,dt + dW_t
$$

**Conceptual leap**

- Probability → force  
- Sample → trajectory  
- Inference → stochastic dynamics  

---

### Radford Neal (2011) — Hamiltonian Monte Carlo

**Key work**
- *MCMC Using Hamiltonian Dynamics*

- Random walks → physical trajectories  
- Slow mixing → energy-preserving motion  

This enabled scalable Bayesian inference in high dimensions.

---

## 5. Turning Probability into Energy  
From belief to physics

### Geoffrey Hinton (2002)
- *Training Products of Experts by Minimizing Contrastive Divergence*

### Yann LeCun et al. (2006)
- *A Tutorial on Energy-Based Learning*

**Core Idea**

Replace probability with energy:

$$
p(x) \propto e^{-E(x)}
$$

**Meaning**

- Probability → energy landscape  
- Learning → lowering energy of data  
- Normalization → often ignored  

This is the deep root of:

- Energy-Based Models (EBMs)  
- Later diffusion and score models  

---

## 6. Turning Probability into Flow  
From density to transformation

### Danilo Rezende and Shakir Mohamed (2015)
- *Variational Inference with Normalizing Flows*

### Laurent Dinh et al. (2017)
- *Density Estimation using Real NVP*

- Complex distribution → invertible transformations  
- Exact likelihood via Jacobian  

---

## 7. Turning Probability into a Vector Field  
From densities to directions

### Aapo Hyvärinen (2005) — Score Matching
- *Estimation of Non-Normalized Statistical Models by Score Matching*

### Yang Song et al. (2019–2021)
- *Score-Based Generative Modeling through SDEs*

**Core Idea**

Do not learn:

$$
p(x)
$$

Learn instead:

$$
\nabla_x \log p(x)
$$

**Why this matters**

- Distribution → direction  
- Generation → solving reverse differential equations  
- Normalization → irrelevant  

This is the current apex of diffusion models.

---

## Philosophical Synthesis

All these breakthroughs say the same thing, in different languages:

If probability is intractable, change the question.

Do not ask:

- What is the probability?

Ask instead:

- What can be optimized?  
- What can be approximated?  
- What can be sampled?  
- What can be simulated dynamically?  

---

## Final Insight

Probability did not become simpler.  
Our questions became computational.

This is how probability evolved:

- from a closed theoretical science  

into:

- the engine of modern generative intelligence
