### 1) The original difficulty: an open probabilistic space

If one tries to model the true data distribution

$$
p(x),
$$

where \( x \) may be an image, text, audio, or video sample, the problem quickly becomes intractable. The space is extremely high-dimensional, the required integrals are not computable in practice, the normalization constant is often unknown, and the distribution cannot be exhaustively enumerated or directly approximated. In this sense, direct modeling of the full distribution is beyond what classical computation (and human reasoning) can handle straightforwardly.

---

### 2) The conceptual shift

Rather than asking, “What is the probability of the entire sample?”, modern generative modeling reframes the goal as:

“What local, mathematically stable quantity can be learned reliably?”

This reframing motivates a family of “noble hacks”—not an escape from probability, but a redesign of what is modeled so the task becomes feasible.

---

### 3) Autoregressive generation (e.g., GPT): atomizing the distribution into local steps

Autoregressive models rewrite the joint distribution through the chain rule:

$$
p(x_1, x_2, \dots, x_T)
=
\prod_{t=1}^{T} p(x_t \mid x_{<t}).
$$

The core idea is not to model the entire space at once, but to predict one token at a time. This is effective because the softmax operates over a bounded vocabulary, normalization becomes local, maximum-likelihood training becomes practical, and numerical stability is easier to maintain.

The model is not “understanding language” in a human sense; it excels at answering a local question repeatedly:

“Which token is most plausible next?”

Coherent global behavior then emerges from accumulation.

---

### 4) Attention (Q, K, V): not probability, but information geometry

Self-attention is not a probabilistic model in the classical sense. Queries, keys, and values are learned linear projections; the softmax here functions primarily as a weighting and normalization mechanism rather than a literal probability model of the world.

The output is a geometric aggregation of information, effectively defining a context-dependent “influence field” over tokens. This helps explain why such models can remain stable despite their complexity: much of the mechanism is structured as controlled linear algebra with normalized mixing.

---

### 5) GANs: replacing explicit likelihood with a game

Generative adversarial networks make the reframing even more explicit. There is no explicit likelihood, no direct \( p(x) \), and no tractable normalization. Instead, generation is learned through a min–max objective:

$$
\min_G \max_D
\;
\mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)]
+
\mathbb{E}_{z}[\log(1 - D(G(z)))].
$$

The distribution is learned implicitly—it is never written down or computed explicitly.

---

### 6) Diffusion and score-based models: modeling a derivative instead of a function

One of the most striking modern inversions replaces density modeling with score modeling. Instead of learning \( p(x) \), these methods learn

$$
\nabla_x \log p(x).
$$

This is powerful because derivatives can be easier to learn than normalized densities, the normalization constant is not required, the problem becomes learning a vector field, and sampling becomes the task of running a dynamical process (for example, Langevin dynamics or a stochastic differential equation) to produce data.

---

### 7) Energy-based and Boltzmann-style models: probability is present but computationally disabled

Energy-based models define

$$
p(x) = \frac{e^{-E(x)}}{Z},
$$

where the partition function \( Z \) is computationally prohibitive. Practical learning therefore relies on alternative strategies—contrastive methods, score matching, and noise-contrastive estimation—once again avoiding direct computation of the fully normalized distribution while still grounding the model in probabilistic principles.

---

### 8) Not an escape from probability, but a maturation of it

Modern generative modeling does not abandon probability; it grows more sophisticated about what must be modeled to make probabilistic learning workable. The implicit philosophy is:

Do not attempt to capture the world in one shot; decompose it into stable local operations, and let accumulation produce global structure and meaning.

---

### 9) The precise scientific framing

These models did not flee from probability. They redefined the modeling target—factorizations, games, vector fields, and energy surrogates—so that generative learning becomes computationally viable while remaining deeply connected to probabilistic foundations.
