# Diffusion models

In diffusion model, we have a forward process $q(\mathbf{x}_t|\mathbf{x}_{t-1})$ which adds noise according to some variance schedule $\{\beta_t\}\in (0,1)$ (where $\mathbf{x}_0$ denote the original image). Formally, define

$$q(\mathbf{x}_t|\mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t|\sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I})$$

Letting $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$, we have

$$
\begin{align*}
    \mathbf{x}_t&= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1-\alpha_t} \epsilon_{t-1}\\
    &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1-\alpha_{t-1}}\epsilon_{t-2}) + \sqrt{1-\alpha_t}\epsilon_{t-1}\\
    &= \sqrt{\alpha_t\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1-\alpha_t\alpha_{t-1}}\epsilon_{t-3}\\
    &= \sqrt{\bar{\alpha}_t} \mathbf{x}_{0} + \sqrt{1-\bar{\alpha}_t}\epsilon_0
\end{align*}
$$

Since $\alpha_t\in (0,1)$, we see that $\bar{\alpha}_t\rightarrow 0$ as $t\to\infty$. This suggests that as $t\to\infty$, we have that

$$q(\mathbf{x}_t|\mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t|\sqrt{\bar{\alpha}_t} \mathbf{x}_{0}, (1-\bar{\alpha}_t)\mathbf{I})\rightarrow \mathcal{N}(\mathbf{0}, \mathbf{I})$$

In other words, the distribution becomes an isotropic Gaussian. If we can reverse the above process, then we can generate new samples by sampling from $\mathcal{N}(\mathbf{0}, \mathbf{I})$ and then sample from $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$. Unfortunately, we can not easily estimate $q(\mathbf{x}_{t-1}|\mathbf{x}_t)$. Therefore, consider a parameterized estimation of these distributions $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)$. Note that this is essentially a latent variable model with latent variables $\mathbf{z}=\mathbf{x}_{1:T}$. Therefore, applying the variational lower bound, we have 

$$
\begin{align*}
\log p(\mathbf{x}_0) &\geq \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} [\log p(\mathbf{x}_0|\mathbf{x}_{1:T})] - \mathcal{D}_{KL}(q(\mathbf{x}_{1:T}|\mathbf{x}_0)||p(\mathbf{x}_{1:T}))\\
    &= \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[\log \frac{p(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\bigg]\\
    &= \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[\log \frac{p(\mathbf{x}_T)p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)\prod_{t=1}^{T-1}p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{t+1})}{q(\mathbf{x}_T|\mathbf{x}_{T-1})\prod_{t=1}^{T-1} q(\mathbf{x}_t|\mathbf{x}_{t-1})}\bigg]\\
    &= \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0}[p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)] + \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[\log\frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T|\mathbf{x}_{T-1})}\bigg] + \sum_{i=1}^{T-1} \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[\frac{p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{t+1})}{q(\mathbf{x}_t|\mathbf{x}_{t-1})}\bigg]\\
    &= \mathbb{E}_{\mathbf{x}_{1}|\mathbf{x}_0}[p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)] + \mathbb{E}_{\mathbf{x}_{T-1}, \mathbf{x}_T|\mathbf{x}_0} \bigg[\log\frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T|\mathbf{x}_{T-1})}\bigg] + \sum_{i=1}^{T-1} \mathbb{E}_{\mathbf{x}_{t-1}, \mathbf{x}_{t}, \mathbf{x}_{t+1}|\mathbf{x}_{0}} \bigg[\frac{p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{t+1})}{q(\mathbf{x}_t|\mathbf{x}_{t-1})}\bigg]\\
    &= \underbrace{\mathbb{E}_{\mathbf{x}_{1}|\mathbf{x}_0}[p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)]}_{\text{reconstruction term}} + 
    \underbrace{\mathbb{E}_{\mathbf{x}_{T-1}|\mathbf{x}_0} [\mathcal{D}_{KL}(q(\mathbf{x}_T|\mathbf{x}_{T-1})||p(\mathbf{x}_T))]}_{\text{prior matching term}}
    + \sum_{i=1}^{T-1} \underbrace{\mathbb{E}_{\mathbf{x}_{t-1}, \mathbf{x}_{t+1}|\mathbf{x}_{0}} [\mathcal{D}_{KL}(q(\mathbf{x}_t|\mathbf{x}_{t-1})||p_{\theta}(\mathbf{x}_{t}|\mathbf{x}_{t+1}))]}_{\text{consistency term}}\\
\end{align*}
$$

Like the case for variational autoencoder, the variational lower bound consists of three terms

1. Reconstruction term:  measures the likelihood of reconstruction in the first latent layer.
2. Prior matching term: ensures that the learned final latent distribution matches the prior distribution.
3. Consistency term: 

## A simplified training scheme

Note that all of the terms in the variational lower bound can be computed using Monte Carlo estimates. However, in the consistency term, we need to sample over two variables, which has higher variance than having only one variable. Therefore, it is more desirable to reformulate ELBO so that each conditionals only condition on one variable. This can be done by noting that




$$
\begin{align*}
     \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[\log \frac{p(\mathbf{x}_{1:T}|\mathbf{x}_0)}{p(\mathbf{x}_{0:T})}\bigg] &=  \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[\log\frac{\prod_{t=1}^T q(\mathbf{x}_{t}|\mathbf{x}_{t-1})}{p(\mathbf{x}_T)\prod_{t=1}^T p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)}\bigg]\\
     &= \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[-\log p(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{ q(\mathbf{x}_{t}|\mathbf{x}_{t-1})}{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)}\bigg]\\
\end{align*}
$$









We can further simplify the work by noting that. This gives us

$$
\begin{align*}
     \log p(\mathbf{x}_0) &\geq \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[-\log p(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{ q(\mathbf{x}_{t}|\mathbf{x}_{t-1})}{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)}\bigg]\\
     &= \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[-\log p(\mathbf{x}_T) + \sum_{t=1}^T \log \bigg(\frac{ q(\mathbf{x}_{t-1}|\mathbf{x}_{t}, \mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)}\cdot \frac{q(\mathbf{x}_{t}|\mathbf{x}_0)}{q(\mathbf{x}_{t-1}|\mathbf{x}_0)}\bigg)+\log \frac{q(\mathbf{x}_1|\mathbf{x}_0}{p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)}\bigg]\\
     &= \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[-\log p(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{ q(\mathbf{x}_{t-1}|\mathbf{x}_{t}, \mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)}+\sum_{t=1}^T \log\frac{q(\mathbf{x}_{t}|\mathbf{x}_0)}{q(\mathbf{x}_{t-1}|\mathbf{x}_0)}+\log \frac{q(\mathbf{x}_1|\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)}\bigg]\\
     &= \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[-\log p(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{ q(\mathbf{x}_{t-1}|\mathbf{x}_{t}, \mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)}+\sum_{t=1}^T \log\frac{q(\mathbf{x}_{t}|\mathbf{x}_0)}{q(\mathbf{x}_{t-1}|\mathbf{x}_0)}+\log \frac{q(\mathbf{x}_1|\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)}\bigg]\\
     &= \mathbb{E}_{\mathbf{x}_{1:T}|\mathbf{x}_0} \bigg[\log\frac{q(\mathbf{x}_T|\mathbf{x}_0)}{p(\mathbf{x}_T)} + \sum_{t=1}^T \log \frac{ q(\mathbf{x}_{t-1}|\mathbf{x}_{t}, \mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)}-\log p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)\bigg]\\
     &= \mathbb{E}_{\mathbf{x}_{T}|\mathbf{x}_0} \bigg[\log\frac{q(\mathbf{x}_T|\mathbf{x}_0)}{p(\mathbf{x}_T)}\bigg] + \sum_{t=2}^T \mathbb{E}_{\mathbf{x}_{t-1}, \mathbf{x}_{t}|\mathbf{x}_0}\bigg[\log \frac{ q(\mathbf{x}_{t-1}|\mathbf{x}_{t}, \mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t)}\bigg]- \mathbb{E}_{\mathbf{x}_{1}|\mathbf{x}_0}\bigg[\log p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)\bigg]\\
     &= \text{KL}(q(\mathbf{x}_T|\mathbf{x}_0)||p(\mathbf{x}_T)) + \sum_{t=2}^T \mathbb{E}_{ \mathbf{x}_{t}|\mathbf{x}_0}[\text{KL}(q(\mathbf{x}_{t-1}|\mathbf{x}_{t}, \mathbf{x}_0)||p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_t))] - \mathbb{E}_{\mathbf{x}_{1}|\mathbf{x}_0}[\log p_{\theta}(\mathbf{x}_0|\mathbf{x}_1)]\\
\end{align*}
$$

Each of these term can be computed efficiently.