# Approximation of Probability  
How humans learned to reason when exact probability is impossible

---

## 1. Why Approximation Is Inevitable

Exact probability becomes impossible when:

- integrals are high-dimensional,
- normalization constants are unknown,
- distributions have no closed form,
- dependencies are dense or nonlinear.

Formally, problems like:

$$
p(x) = \int p(x, z)\,dz
$$

$$
Z = \int e^{-E(x)}\,dx
$$

$$
\mathbb{E}_p[f(x)]
$$

are provably intractable in general.

So approximation is not a shortcut — it is the only option.

Approximate probability is the science of controlled error.

---

## 2. The Four Grand Approximation Strategies

All approximation methods fall into four families:

| Strategy | Replace probability with |
|--------|--------------------------|
| Optimization | A mode |
| Local geometry | A Gaussian |
| Sampling | Empirical averages |
| Functional bounds | Simpler distributions |

Everything else is a refinement.

---

## 3. Optimization-Based Approximation  
Probability collapses into a single point

### 3.1 Maximum Likelihood (MLE)

$$
\theta^\* = \arg\max_\theta p(x \mid \theta)
$$

- Ignores uncertainty  
- Keeps only the most plausible parameter  
- Oldest approximation (Laplace, Gauss)

---

### 3.2 Maximum A Posteriori (MAP)

$$
\theta^\* = \arg\max_\theta p(\theta \mid x)
$$

- Bayesian inference → optimization  
- Replaces posterior distribution with a delta function  

MAP approximates probability by belief in the best explanation.

---

## 4. Local Geometry Approximation  
Probability is “almost Gaussian”

### 4.1 Laplace Approximation

Approximate posterior near mode:

$$
p(\theta \mid x) \approx \mathcal{N}(\theta^\*, H^{-1})
$$

Where  
$$
H
$$
is the Hessian of  
$$
-\log p
$$

Assumes:

- unimodality  
- smooth curvature  

Used in:

- Bayesian statistics  
- model evidence estimation  
- early neural networks  

---

### 4.2 Saddle-Point & Asymptotic Expansions

Use local curvature to approximate integrals:

$$
\int e^{n f(x)}\,dx \approx e^{n f(x^\*)}
$$

Extremely powerful for:

- large-data regimes  
- statistical physics  
- information theory  

---

## 5. Sampling-Based Approximation  
Probability becomes frequency

### 5.1 Monte Carlo Approximation

$$
\mathbb{E}_p[f(x)] \approx \frac{1}{N}\sum_{i=1}^N f(x_i)
$$

Converts:

- integrals → averages  
- uncertainty → randomness  

Key theorem: Law of Large Numbers

---

### 5.2 Importance Sampling

Rewrite:

$$
\mathbb{E}_p[f(x)]
=
\mathbb{E}_q\!\left[f(x)\frac{p(x)}{q(x)}\right]
$$

Trade:

- difficult distribution  
- for easier one + weights  

Failure mode: weight degeneracy

---

### 5.3 Markov Chain Monte Carlo (MCMC)

Avoid normalization entirely.

Construct a chain with:

$$
\pi(x) = p(x)
$$

Methods:

- Metropolis–Hastings  
- Gibbs sampling  
- Hamiltonian Monte Carlo  
- Langevin dynamics  

MCMC approximates probability by time.

---

## 6. Variational Approximation  
Probability becomes a solvable optimization problem

### 6.1 Variational Inference (VI)

Replace intractable  
$$
p
$$
with tractable  
$$
q
$$

$$
q^\* = \arg\min_{q \in \mathcal{Q}} \mathrm{KL}(q \| p)
$$

Equivalent to maximizing:

$$
\mathrm{ELBO}
=
\mathbb{E}_q[\log p(x,z)]
-
\mathbb{E}_q[\log q(z)]
$$

Key idea:

Approximate inference becomes function fitting.

---

### 6.2 Mean-Field Approximation

Assume:

$$
q(z) = \prod_i q_i(z_i)
$$

- Breaks dependencies  
- Makes inference tractable  
- Introduces bias  

Used heavily in:

- physics  
- large Bayesian models  
- VAEs  

---

### 6.3 Expectation Propagation (EP)

- Instead of global KL: match local moments  
- Refine approximations iteratively  
- Often more accurate but less stable  

---

## 7. Discretization & Relaxation  
Turn continuous into computable

### 7.1 Grid-Based Approximation

- Discretize space  
- Approximate integrals with sums  
- Curse of dimensionality applies  

---

### 7.2 Relaxation of Constraints

- Replace hard constraints with soft penalties  
- Common in optimization-based probabilistic models  

---

## 8. Asymptotic Probability  
Let infinity do the work

### 8.1 Law of Large Numbers

Random → deterministic

### 8.2 Central Limit Theorem

Everything → Gaussian

### 8.3 Large Deviations Theory

Rare events become exponentially structured

This justifies:

- Gaussian approximations  
- confidence intervals  
- error bars  

---

## 9. Functional Approximation  
Approximate probability indirectly

### 9.1 Energy-Based Approximation

Model:

$$
p(x) = \frac{e^{-E(x)}}{Z}
$$

Approximate:

- gradients  
- energy differences  

not  
$$
Z
$$

---

### 9.2 Score Approximation

Learn:

$$
\nabla_x \log p(x)
$$

Instead of  
$$
p(x)
$$

Used in:

- score matching  
- diffusion models  

---

## 10. Dynamic Approximation  
Replace static probability with evolution

### 10.1 Langevin Dynamics

$$
x_{t+1}
=
x_t
+
\epsilon \nabla \log p(x_t)
+
\sqrt{2\epsilon}\,\xi_t
$$

Approximates sampling via stochastic motion.

---

### 10.2 Diffusion Processes

- Gradually destroy structure  
- Then reverse it  

This approximates:

- complex distributions  
- without ever evaluating likelihoods  

---

## 11. Learning-Based Approximation  
Let neural networks approximate probability

### 11.1 Autoregressive Models

Exact factorization, approximate conditionals

### 11.2 Normalizing Flows

Exact likelihood, approximate transformations

### 11.3 Implicit Models (GANs)

Approximate samples, no density

---

## 12. Approximation Tradeoffs (The Triangle)

Every approximation chooses two:

| Property | Cost |
|--------|------|
| Accuracy | Computation |
| Tractability | Bias |
| Speed | Variance |

No free lunch.

---

## 13. Deep Unifying Insight

Exact probability is a mathematical object.  
Approximate probability is an epistemic stance.

Each approximation answers:

- What do I care about preserving?  
- mass?  
- modes?  
- moments?  
- samples?  

Different methods preserve different truths.

---

## 14. Final Synthesis

Approximation turns probability from an object you cannot compute into a belief you can act upon.

It does so by:

- collapsing distributions,  
- smoothing geometry,  
- averaging randomness,  
- or bounding uncertainty.
