# DDPM diffusion

In diffusion models there are two main paths along which data is transformed. The first is the forward process which transforms data into noise through a sequence of predefined steps, simulating a diffusion-like process that corrupts the input. The forward process is typically parameterized to ensure that the final step produces pure Gaussian noise. The second is the backward process which iteratively denoises the data, step by step to reconstruct the original data.
During training a diffusion model learns to predict the noise added at each forward step, enabling it to perform the backward process during sampling. Together, these processes form the foundation of diffusion models. The diffusion model proposed in "Denoising Diffusion Probabilistic Models" (DDPM) relies on a few key mathematical assumtions:

### Key Assumtions

1. The forward process is a Markovian and can defined as:$$q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) \quad$$ 
   Here, $q$ is the probability density function of the next slightly more noisy image given the previous cleaner image. $t$ is the specific step in the forward process. Essentially, we are making each step by using a Gaussian distribution, by mixing the cleaner image $\mathbf{x}_{t-1}$ with Gaussian noise with proper weights governed by the $\beta_t$ parameter to produce the output for the next step $\mathbf{x}_t$.

2. A nice property of the above definition is the fact that we do not need to go over each step iteratively-we can jump to any arbitrary step $t$ using a closed form expression by using a reparameterization trick. Namely if  we introduce a new parameter:  $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ then:$$\begin{aligned}
\mathbf{x}_t 
&= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\boldsymbol{\epsilon}_{t-1} & \text{ ;where } \boldsymbol{\epsilon}_{t-1}, \boldsymbol{\epsilon}_{t-2}, \dots \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\
&= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\boldsymbol{\epsilon}}_{t-2} & \text{ ;where } \bar{\boldsymbol{\epsilon}}_{t-2} \text{ merges two Gaussians (*).} \\
&= \dots \\
&= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon} \\
q(\mathbf{x}_t \vert \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I})
\end{aligned}$$ We are making a leap over an arbitrary number of steps by using another Gaussian, which accumulates the incremental noise updates for each step. Usually larger update steps are used when the sample gets noisier, therefore: $\beta_1 < \beta_2 < \dots < \beta_T$ and $\bar{\alpha}_1 > \dots > \bar{\alpha}_T$

3. We are very interested in the reverese probability $q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$ because it would allow us to reverse the corruption process and gradually clean the noisy sample. Using Bayes rule directly lead to a dead end, however if the reverse conditional probability becomes tractable when conditioned on $\mathbf{x}_0$: $$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$$
Now, when we use Bayes we get:$$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)= q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) }$$ Using the fact tht all the terms on the right side are Gaussians and because of the markovian nature of the forward process $q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$ we can derive the following expressions for the mean and the variance: <br>
Mean $$\tilde{\boldsymbol{\mu}}_t = \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big)$$ Variance
$$\tilde{\beta}_t =\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t$$

4. When we substitute in the original probability we get a Gaussian where the variance is a constant computed using the parameters and we have an expression for the mean.
$$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, {\epsilon}_t)=  \mathcal{N}(\mathbf{x}_{t-1}; \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}} \boldsymbol{\epsilon}_t \Big),\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t \mathbf{I})$$

5. Since we aim to learn the reverse diffusion process we ideally would like a learned distribution $p_{\theta((\mathbf{x}_{t-1} \vert \mathbf{x}_t)}$ to be as close as possible to $q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)$. Fort this we need a simplifying assumption - namely that $p_{\theta((\mathbf{x}_{t-1} \vert \mathbf{x}_t)}$ itself is Gaussian. Under this assumption, the task becomes to learn one Gaussian as close to another as possible and we simply need to match their moments. Since the variance is simply a scalar, the neural network we use should learn to estimate only the mean. Furthermore, since the only unknown in the epxression for the mean is the noise, the model only needs to learn to predict the noise $\hat{\boldsymbol{\epsilon}}_t$.


### Loss function

Using the key assumptions, we get a setup whhich is very similar to the one used for Variational Autoencoders VAE and thus we can use the variational lower bound to optimize the negative log-likelihood. After a long derivation we arrive at:
$$
\begin{aligned}
L_t = \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \Big[\frac{ (1 - \alpha_t)^2 }{2 \alpha_t (1 - \bar{\alpha}_t) \| \boldsymbol{\Sigma}_\theta \|^2_2} \|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2 \Big] 
\end{aligned}
$$
However, the authors of DDPM found that training the diffusion model works better with a simplified objective that ignores the weighting term:
$$\begin{aligned}
L_t^\text{simple}
&= \mathbb{E}_{t \sim [1, T], \mathbf{x}_0, \boldsymbol{\epsilon}_t} \Big[\|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \Big] \\
&= \mathbb{E}_{t \sim [1, T], \mathbf{x}_0, \boldsymbol{\epsilon}_t} \Big[\|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2 \Big]
\end{aligned}$$
