# Historical and Conceptual Lineage of Denoising to Diffusion Models

## 1. Denoising predates modern machine learning by decades

Denoising has been a central problem in signal processing and statistics since the mid-20th century. The core objective—remove noise while preserving structure—long predates modern deep learning.

### Classical roots (1940s–1970s)

**Wiener filtering (1949)**  
Optimal *linear* denoising under Gaussian assumptions, derived from minimizing mean squared error:
$$
\hat{x} = \arg\min_{\tilde{x}} \ \mathbb{E}\big[\|x-\tilde{x}\|^2\big]
$$
with solutions expressible via second-order statistics.

**Kalman filtering (1960)**  
Recursive Bayesian inference for dynamical systems:
$$
x_t = F x_{t-1} + \epsilon_t,\quad z_t = H x_t + \eta_t,
$$
where inference is performed sequentially with Gaussian assumptions.

**Core point**: these methods already framed denoising as statistical inference.  
**Core limitation**: linearity and Gaussian modeling.

---

## 2. Nonlinear and statistical denoising (1980s–1990s)

This era moved denoising beyond “variance reduction” toward “structure-aware inference,” in a way that is conceptually close to later generative denoising trajectories.

### Projection pursuit and non-Gaussian structure

Projection pursuit emphasized that informative signal structure often appears in non-Gaussian projections:
$$
y = w^\top x,\quad \text{seek } w \text{ such that } y \text{ is maximally non-Gaussian.}
$$
This links “signal” to higher-order statistics, not just covariance.

---

## 3. Wavelet shrinkage (early 1990s)

Wavelet shrinkage is a major pre-ICA milestone that introduced practical nonlinear denoising at scale.

Given a transform (wavelet basis) producing coefficients $c_i$, the key heuristic insight is:

- small coefficients are likely noise  
- large coefficients are likely signal

A common form is soft-thresholding:
$$
\hat{c}_i = \mathrm{sign}(c_i)\max(|c_i|-\lambda,0),
$$
followed by inverse transform reconstruction.

**Conceptual leap**: denoising via nonlinear shrinkage in a coefficient space.  
**Limitation**: the representation is fixed (not learned), and the shrinkage is typically chosen by analytic/heuristic rules.

---

## 4. ICA and Sparse Code Shrinkage (mid–late 1990s)

This is a direct conceptual ancestor of diffusion/score-based thinking: denoising as probabilistic inference in a learned coordinate system.

### ICA representation learning

ICA learns a linear transform (unmixing) that produces statistically independent, non-Gaussian coordinates:
$$
x = A s,\quad s = W x,
$$
with independence:
$$
p(s) = \prod_i p(s_i).
$$

### Sparse Code Shrinkage (SCS): denoising as inference

With additive Gaussian noise:
$$
z = x + n,\quad n\sim \mathcal{N}(0,\sigma^2 I),
$$
and a sparse (super-Gaussian) prior on latent coefficients:
$$
p(s_i)\propto \exp(-\phi(s_i)),
$$
SCS performs ML/MAP-style inference:
$$
\hat{s}=\arg\max_s \ \log p(z\mid s)+\log p(s)
\quad \Longleftrightarrow \quad
\hat{s}=\arg\min_s \left[\frac{1}{2\sigma^2}\|z-As\|^2+\sum_i \phi(s_i)\right].
$$

This yields a *nonlinear*, component-wise shrinkage rule in the learned latent space:
$$
\hat{s}_i = g(y_i),\quad y = Wz,
$$
where $g(\cdot)$ is derived from the assumed prior and noise model (not hand-designed thresholding).

**Key point**: denoising is framed as statistical inference, not smoothing.

---

## 5. Energy-based and score ideas (1990s–2000s)

The score function exists well before diffusion models:
$$
\nabla_x \log p(x).
$$
Scores appear implicitly/explicitly across:
- energy-based modeling, where $p(x)\propto e^{-E(x)}$ and $\nabla_x \log p(x) = -\nabla_x E(x)$
- learning rules related to non-Gaussian modeling and independence criteria

**What was still missing**: an explicit framing of *generation* as a multi-step denoising trajectory driven by a controlled noise process.

---

## 6. What Sohl-Dickstein introduced (2015): denoising as a generative mechanism

The decisive novelty was not denoising itself, but *turning denoising into sampling* by defining:

1) a forward noising process (diffusion) that maps data to noise  
2) a learned or approximated reverse process that generates data by iterative denoising

Conceptually:
$$
x_0 \to x_1 \to \cdots \to x_T \quad (\text{forward noise}),
$$
then sampling via:
$$
x_T \to x_{T-1} \to \cdots \to x_0 \quad (\text{reverse denoise}).
$$

This reframed generation as a time-reversed stochastic process, influenced by statistical physics and stochastic dynamics.

**Not new**: Gaussian corruption and inference-based denoising.  
**New**: generation as repeated denoising along a defined stochastic path.

---

## 7. What Song unified (2019–2021): score-based modeling as iterative denoising

A key consolidation was showing deep connections among:
- score matching
- denoising objectives
- diffusion-like procedures

The denoising view becomes explicitly score-driven, where denoising steps align with estimated scores at different noise levels.

---

## 8. Where Hyvärinen fits historically

Hyvärinen’s contributions sit much closer to the conceptual core of diffusion models than common “novelty narratives” suggest:

- non-Gaussian modeling as a signal principle  
- sparse latent inference under explicit probabilistic assumptions  
- shrinkage as statistically derived estimation  
- learning a coordinate system where inference becomes simple and component-wise

**What he did not do**:
- frame denoising as a time-reversed stochastic process  
- construct a generative sampler by repeated denoising along a noise schedule

---

## 9. One-sentence academic truth

Diffusion models did not invent denoising; they transformed decades of statistical denoising, sparse inference, and score estimation into a generative dynamical system.

---

## 10. Conceptual lineage (clean summary)

$$
\text{Wiener / Kalman}
\ \rightarrow\
\text{Projection Pursuit}
\ \rightarrow\
\text{Wavelet Shrinkage}
\ \rightarrow\
\text{ICA \& Sparse Code Shrinkage}
\ \rightarrow\
\text{Energy Models / Score Ideas}
\ \rightarrow\
\text{Denoising Autoencoders}
\ \rightarrow\
\text{Diffusion \& Score-Based Models}.
$$

---

## Final takeaway

The “novelty” of diffusion models is best stated precisely:

- Denoising as inference is old.  
- Nonlinear shrinkage and sparse priors are old.  
- Score concepts are old.  
- The distinctive innovation is turning denoising from a tool into a principled *sampling mechanism* via a stochastic forward process and its learned (or modeled) reverse.
