# Factorization of Probability  


## 1. Why Factorization Exists at All

At its core, probability becomes intractable because of combinatorial explosion.

For a joint distribution:

$$
p(x_1,x_2,\dots,x_n)
$$

If each variable has $k$ states, then the joint space contains:

$$
k^n
$$

possibilities.

Storage, inference, and marginalization all become exponential.

Factorization is the act of rewriting a joint probability into smaller, structured components such that:

- computation becomes polynomial or linear  
- inference becomes local  
- learning becomes feasible  

Factorization is not a modeling trick — it is a survival mechanism.

---

## 2. The Fundamental Identity (Chain Rule)

Everything starts here:

$$
p(x_1,\dots,x_n)
=
\prod_{i=1}^{n} p(x_i \mid x_1,\dots,x_{i-1})
$$

This identity is:

- exact  
- always true  
- still intractable  

Because each conditional depends on everything before it.

So the real question is:

How can we simplify the conditionals?

---

## 3. Independence as the First Factorization

### Full Independence

Assume:

$$
x_i \perp x_j \quad \forall\, i \ne j
$$

Then:

$$
p(x_1,\dots,x_n)=\prod_i p(x_i)
$$

Maximal simplification  
Almost never true in reality

Used in:

- Naive Bayes  
- Classical probability tables  
- Early statistics  

This is brute-force factorization.

---

## 4. Conditional Independence: The Real Power

Instead of saying:

“Nothing depends on anything”

We say:

“Things depend only on what matters”

Formally:

$$
x_i \perp x_j \mid z
$$

Meaning: once $z$ is known, $x_i$ and $x_j$ do not exchange information.

This allows:

$$
p(x_1,\dots,x_n)=\prod_i p(x_i \mid \mathrm{parents}(x_i))
$$

This is the birth of graphical models.

---

## 5. Directed Factorization (Bayesian Networks)

If variables form a DAG:

$$
p(x_1,\dots,x_n)=\prod_i p(x_i \mid \mathrm{Pa}(x_i))
$$

### Key Properties

- encodes causal or generative structure  
- each factor is local  
- global joint reconstructed exactly  

### Consequences

- storage: exponential → linear  
- inference: global → message passing  
- learning: local likelihoods  

This is the most important factorization in probabilistic reasoning.

---

## 6. Undirected Factorization (Markov Random Fields)

Instead of conditionals, we use potentials:

$$
p(x)=\frac{1}{Z}\prod_{c\in C}\psi_c(x_c)
$$

Where:

- $\psi_c$: compatibility functions  
- $Z$: partition function  

This trades:

- easy local structure  

for:

- hard global normalization  

Used when:

- symmetry matters  
- no natural direction  
- constraints dominate  

---

## 7. Factor Graphs: Explicit Factorization Objects

Instead of variables and edges, we model:

$$
p(x)=\prod_f f(x_f)
$$

This makes factorization:

- explicit  
- modular  
- algorithm-friendly  

Belief propagation operates directly on this structure.

---

## 8. Exchangeability and Symmetry Factorization

When order does not matter:

$$
p(x_1,\dots,x_n)
=
\int \prod_i p(x_i\mid \theta)\, dp(\theta)
$$

This is de Finetti’s theorem.

Meaning:

- dependency is factored through a latent variable  

Massive implication:

- Bayesian models  
- hierarchical models  
- modern foundation of probabilistic learning  

---

## 9. Temporal Factorization

Time introduces structure:

$$
p(x_{1:T})
=
p(x_1)\prod_{t=2}^{T} p(x_t \mid x_{t-1})
$$

This is:

- Markov assumption  
- dynamic programming friendly  

Extensions:

- higher-order Markov  
- hidden states  
- state-space models  

---

## 10. Spatial and Locality-Based Factorization

Used in:

- physics  
- vision  
- grids  

Assumption: variables interact locally

$$
p(x)\propto \prod_{\langle i,j\rangle}\psi(x_i,x_j)
$$

This is how:

- Ising models  
- image priors  
- CNN-inspired probabilistic models work  

---

## 11. Autoregressive Factorization

Instead of assuming independence, we choose an order:

$$
p(x)=\prod_i p(x_i \mid x_{<i})
$$

Properties:

- exact likelihood  
- no independence assumption  
- still tractable if conditionals are simple  

Used in:

- language models  
- pixel models  
- sequence generators  

---

## 12. Latent Variable Factorization

Introduce hidden variables $z$:

$$
p(x)=\int p(x\mid z)p(z)\,dz
$$

Why this helps:

- high-dimensional dependencies become conditional independence  
- explains correlation via shared cause  

This is:

- mixture models  
- VAEs  
- topic models  

---

## 13. Mean-Field Factorization (Approximate)

When exact factorization is impossible:

$$
p(x)\approx \prod_i q_i(x_i)
$$

This is not a property of reality, but a computational compromise.

Used in:

- variational inference  
- physics  
- large-scale Bayesian systems  

---

## 14. Factorization via Dynamics

Instead of factoring distributions, we factor transitions:

$$
p(x_T \mid x_0)=\prod_t p(x_{t+1}\mid x_t)
$$

This converts:

- static complexity → sequential simplicity  

Foundation of:

- diffusion models  
- Langevin dynamics  
- score-based models  

---

## 15. Functional Factorization

Instead of probabilities, factor:

log-probabilities, energies, scores

$$
\log p(x)=\sum_i \phi_i(x)
$$

Or:

$$
\nabla_x \log p(x)=\sum_i s_i(x)
$$

This avoids normalization entirely.

---

## 16. What Factorization Really Does

Across all forms, factorization:

- localizes computation  
- turns global integrals into sums  
- exposes conditional structure  
- enables reuse of information  
- makes learning modular  

Probability is intractable because it is global.  
Factorization makes it local.

---

## 17. The Deep Meta-Truth

All probabilistic models are arguments about which dependencies matter and which can be ignored.

Factorization is:

- epistemology (what we assume)  
- computation (what we can compute)  
- modeling (what structure we impose)  

---

## Final One-Line Synthesis

Factorization is the art of replacing an exponential joint belief with a network of small, reusable, local beliefs that reconstruct the same global uncertainty.
