## Overview
This notebook contains the code for the paper "<>" by <>. The paper is available at <>.

DDPMs stands for Denoising Diffusion Probabilistic Models and also knowne as diffusionmodels, score-based generative models or simply [autoencoders](https://benanne.github.io/2022/01/31/diffusion.html) as researchers have been able to achieve remarkable results with them for (un)conditional image/ausio/video generation. For example: GLIDE, DALL-E2, Latent Diffusion and ImageGen. This article employ the discrete-time (latent variable model) perspective.

## What is diffusion model?

A (denoising) diffusion model is not that complex if you compare it to other generatve models such as [Normalizing](https://aisuko.gitbook.io/wiki/ai-techniques/framework/ml_training_components#normalization) Flows, GANs or [VAEs](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/vae):they all convert noise from some simple distribution to a data sample. This is also the case here where **a neural network leanrs to gradually denoise data** startubg from pure noise.

In a bit more detail for images, the  set-up consists of 2 process:

* a fixed (or predefined) forward diffusion process q of our choosing, that gradually adds Gaussian noise to an image, until you end up with pure noise.
* a learned reverse denoising process p0, where a neural network is trained to gradually denoise an image starting from pure noise, until you end up with an actual image.

More example, please see [here](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/diffusion-in-image)

![](https://huggingface.co/blog/assets/78_annotated-diffusion/diffusion_figure.png)

According to the picture above, both the forward and reverse process indexed by t happen for some number of finite time steps T (the DDPM authors use T=1000). You start with t=0 where you sample a real image x0 from your data distribution, and the forward process samples some noise from a Gaussian distribution at each time step t, which is added to the image of the previous time step. Given a sufficientlt large T and a well behaved schedule for adding noise at each time step, you end up with what is called an isotropic Gaussian distribution at t =T via a gradual process.

## In more mathematical form

We need a tractable loss function whcih our neural network needs to optimize. Let q(x0) be the real data distribution, say of "real images". We can sample from this distribution to get an images, x0~q(x0). We define the forward diffusion process q(Xt|Xt-1) which adds Gaussian noise at each time step t, according to a known variance schedule
$$0 < \beta_1 < \beta_2 < ... < \beta_T < 1\\ q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}).$$

### Gaussian distribution

Recall that a normal distribution (also called Gaussian distribution) is defined by 2 paramters:
* a mean $\\\mu$
* a variance parametrized by $\sigma^2>=0$

### Conditional Gaussian Distribution

Basically, each new (slightly noisier) image at time step t is drawn from a **conditional Gaussian distribution** with mean 
$$\mathbf{\mu}_t=\sqrt{1 - \beta_t} \mathbf{x}_{t-1}$$
and
$$\mathbf{\sigma}_t^2=\beta_t$$
which we can do by sampling
$$\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$
and then setting
$$\mathbf{x}_t = \sqrt{1-\beta_t} \mathbf{x}_{t-1} + \sqrt{\beta_t} \mathbf{\epsilon}$$


### Variance schedule
Note that the $$\mathbf{\beta_t}$$ aren't constant ar each time step t(hence the subscript) in fact one defines a so-called "vaeiance schedule" which can be linear, costine, etc. So, starting from X0, we end up with X1, X2, ..., XT, where XT is pure Gaussian noise if we set the schedule appropriately.

### Solving the problem in reverse
Now, if we knew the conditional distrbution
$$p(\mathbf{x}_{t-1}|\mathbf{x}_t)$$
then we could fun the process in reverse: by sampling some random Gaussian noise XT, and the gradually "denoise" it so that were end up with a sample from the real distribution X0.

However, we don't know $$p(\mathbf{x}_{t-1}|\mathbf{x}_t)$$. It is intractable since it requires knowing the distribution of all possible images in order ro calculate the conditional probability. Hence, we are going to leverage a neural network to **approximate (learn) this conditional probability distribution**, let's call it $$p(\mathbf{x}_{t-1}|\mathbf{x}_t)$$, with parameters $$\theta$$, which being the parameters of the neural network, updated via gradient descent.

### Building the noise predictor (neural network/ loss function)

We need a neural network to represent a (conditional) probability distribution of the backward process. If we assume this reverse process is Gaussian as well, then recall the any Guassian distribution is defined by 2 parameters:

* a mean parameterized by $$\mu_\theta$$
* a variance parameterized by $$\Sigma_\theta$$

### Parameterizing the process

Formula below which the mean and variance are also consitioned on the noise level t.

$$ p_\theta (\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_{t},t), \Sigma_\theta (\mathbf{x}_{t},t))$$

### NN learn/represent the mean and variance

This is not shown in the paper, it was then later improved in other papers, where a neural network also learns the variance of this backwards process, bedised the mean.

## Defining an objective funciton (by reparametrizing the mean)

To derive an objective function to learn the mean of the backward process in the paper is the combination of q and p0, it can be seen as a variational VAE(Kingma et al., 2013). Hence the **variatinal lower bound** (ELBO) can be used to minimize the negative log-likelihood with respect to ground truth data sample X0. It turns out that the ELBO for this process is a sum of losses at each time step t, L=L0+L1+...+LT.
By construction of the forward q process and backward process, each term (except for l0) of the loss is actually the **KL divergence between 2 Gaussioan distributions**, which can be written explicitly as an L2-loss with repect to the means.

A direct consuquence of the constructed forward process q, as shown by Sohl-Dickstein et al., is that we can samoke Xt at any arbirtrary noise level cinditioned on X0 (since sums of Gaussians is also Gaussian). This is very convenient: we don't need to apply q repeatedly in order to sample Xt. We have that:
$$q(\mathbf{x}_t | \mathbf{x}_0) = \cal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1- \bar{\alpha}_t) \mathbf{I})$$
with $$\alpha_t := 1 - \beta_t$$ and $$\bar{\alpha}_t := \Pi_{s=1}^{t} \alpha_s$$

Let's refer to this equation as the "nice property". This means we can sample Gaussian noise and scale it appropriatly and add it to \\(\mathbf{x}_0\\) to get \\(\mathbf{x}_t\\) directly. Note that the \\(\bar{\alpha}_t\\) are functions of the known \\(\beta_t\\) variance schedule and thus are also known and can be precomputed. This then allows us, during training, to **optimize random terms of the loss function \\(L\\)** (or in other words, to randomly sample \\(t\\) during training and optimize \\(L_t\\)).

Another beauty of this property, as shown in th paper. is that one can instead **reparametrize the mean to make the neural network learn (predict) the added noise (via a network \\(\mathbf{\epsilon}_\theta(\mathbf{x}_t, t)\\)) for noise level \\(t\\)** in the KL terms which constitute the losses. This means that our **neural network becomes a noise predictor, rather than a (direct) mean predictor**. The mean can be computed as follows:

$$ \mathbf{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left(  \mathbf{x}_t - \frac{\beta_t}{\sqrt{1- \bar{\alpha}_t}} \mathbf{\epsilon}_\theta(\mathbf{x}_t, t) \right)$$

The final objective function Lt then looks as follows:
$$ \| \mathbf{\epsilon} - \mathbf{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 = \| \mathbf{\epsilon} - \mathbf{\epsilon}_\theta( \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{(1- \bar{\alpha}_t)  } \mathbf{\epsilon}, t) \|^2.$$


### Training algorithm

Here, X0 is the inital (real, uncorrupted) image, and we see the direct noise level t sample given by the fixed forward process. \\(\mathbf{\epsilon}\\) is the pure noise sampled at time step t, and \\(\mathbf{\epsilon}_\theta (\mathbf{x}_t, t)\\) is our nerual network. The neural network is optimized using a simple mean squared error (MSE) between the true and the predicted Gaussian noise.

<p align="center">
    <img src="https://huggingface.co/blog/assets/78_annotated-diffusion/training.png" width="500" />
</p>

According to the picture above:
* we take a random sample Xo from the real unknown and possibily complex data distribution q(Xo)
* we sample a noise level t uniformally between 1 and T(i.e: a random time step)
* we sample some noise from a Guassian distribution and corrupt the input by this noise at level t(using the nice property defined above)
* the neural network is trained to predict this noise based on the corrupted image Xt(i.e. noise applied on X0 based on known schedule \(\beta_t\\)

In reality, all of this is done on batches of data, as one uses stochastic gradient descent to optimize the neural network parameters.

## The neural network

The neural network needs to take in a noised image at a particular time step and return the predicted noise. Note that the predicted noise is a tensor that has the same size/resolution as the input image. So, technically, the network takes in and outputs tensors of the same shape. What type of neural network can we use for this?

What is typically used here is very similar to that of an [Autoencoder](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/vae#variational-autoencoder), which you may remember from typical "intro to deep learning" tutorials.

In terms of architecture, the paper went for a U-Net, introduced by ([Ronneberger et al., 2015](https://arxiv.org/abs/1505.04597)) (which at the time, achived state-of-the-art results for medical image segmentation). This network, like any autoencoder, consists of a bottleneck in the middle that makes sure the network learns only the most important information. Importantly, it introduced **redisual connections** between the encder and decoder, greatly improving gradient flow(inspired by ResNet in He et al.,2015).

<p align="center">
    <img src="https://huggingface.co/blog/assets/78_annotated-diffusion/unet_architecture.jpg" width="500" />
</p>

As can be seen, a U-Net model first downsamples the input (i.e. makes the input smaller in terms of spatial resolution), after which upsamling is performed.

## Let's implement this network

### Preparation
Installing the required packages and importing the libraries. `einops` is flexsible and poweful tensor operations tool, [more details.](https://github.com/arogozhnikov/einops)

In [None]:
# For the Kaggle environment
pip install pytorch=2.0.1 einops==0.6.1 datasets=2.13.1 matplotlib==3.7.1 tqdm==4.65.0

In [None]:
#TODO

## Credit

* [The Annotated Diffusion Model](https://huggingface.co/blog/annotated-diffusion)