# Diffusion models

All of this is basically an application of compression theory.

Let's [read a new paper](https://arxiv.org/pdf/2209.00796.pdf) and highlight important things.

Diffusion models are deep generative models, used for images, videos, and molecular design.
[First fundamental paper](https://arxiv.org/pdf/1503.03585.pdf) is from 2015, so very recent.
[Another one, on denoising diffusion models](https://arxiv.org/pdf/2006.11239.pdf) from 2019.
This field is moving very fast.
Main application are text-to-image one, but there are many fields.

Up to now, we saw encoders, GANs, and so on.
Also, score-based models use stochastic differential equations.

How to?
We take sample from real data, then add infinitesimal noise to it (gaussian perturbation).
Then we basically destroy all information: a forward process (Markov gaussian process) will go from image to full noise.
Next, we add a neural network to learn the reverse: we **learn an impossible process**!
We're actually asking a suitable neural network to remove noise from an image, i.e. to create information.
It will still be a combination of gaussian processes, but a complex one.

Let's go into the math.
From the first paper we see measures, factorization, entropy... all topics of this course.
We've two Markov chains: one forward and one reverse.
The forward one adds noise, the reverse removes it by learning transition kernels.
New data points are generated from random vectors (completely gaussian images) going through the reverse chain.
As physicist, we can think this process with generic data (e.g. non-euclidean).

Let $\vec{x}_0$ be a point in our space, i.e. an image, following a distribution $q(\vec{x}_0)$.
Then, with the forward Markov chain we generate a sequence of random variables using transition kernels $q(\vec{x}_t|\vec{x}_{t-1})$.
As we now, we can factorize the process due to the chain property.
What's a transition kernel?
It could be any metrics which satisfies a normalization condition.
In particular, we choose a gaussian distribution which has a mean equal to the one of the previous one shifted by a little (perturbed).
In the same way we change also the variance.
This shift $\beta$ is a hyperparameter of the model.
Each pixel of the next image is generated by the gaussian, **independently**.
Why gaussian?
Well, the composition of more gaussian is still a gaussian... which is a very useful property.

Given that

$\vec{x}_t = \sqrt{\alpha_t}\vec{x}_{t-1} + \sqrt{1-\alpha_t}\epsilon_1$

$\vec{x}_{t-1} = \sqrt{\alpha_{t-1}}\vec{x}_{t-2} + \sqrt{1-\alpha_{t-1}}\epsilon_2$

if $\vec{x}_3 = \vec{x}_2 + \vec{x}_1$ then it follows $\mathcal{N}\left(\bar{x}_1 + \bar{x}_2, \sigma^2_1 + \sigma^2_2\right)$

So, we can compose the entire process.
Given $\vec{x}_0$ we add the first noise $\epsilon$ normally distributed.
We're not destroying the image, we're destroying its distribution (and also the image itself).
Once destroyed all the things, we try to generate from noise trying to remove it and obtain a clear image.
We actually want to remove noise and keep a structure.
We try to go back in time, i.e. using $p_\theta(\vec{x}_{t-1}|\vec{x}_t)$ reverse process.
This process is still a gaussian, but a non-trivial one.
Average and standard deviation have a complex form (matrices), but the idea is the same.
Now we can sample and training, minimizing the Kullback-Leiber divergence.
Due to measure factorization, we can write the KL divergence as a sum.
The KL of gaussian distributions is going to be a $\mathbb{L}^2$ norm of the means.

That's it.
We're free to choose the neural network to use.
One can find more articles [here](https://github.com/heejkoo/Awesome-Diffusion-Models).

For more mathematics, see [this reference](https://lilianweng.github.io/posts/2021-07-11-diffusion-models/).

> Let's try to implement this following [DiffusionFastForward](https://github.com/mikonvergence/DiffusionFastForward) course.

Another technique may be to embed pixels in a lower dimensional space, using convolution operator.
This is called [Latent diffusion](https://arxiv.org/pdf/2112.10752.pdf).

Notice that, in the main formula, $L_T$ is a purely random term, while $L_0$ is weird...
The remaining term represents the Kullback-Leiber divergence.
Computing it results in a $\mathbb{L}^2$ norm of the KL-divergence, which is function of noise.
This formula simplifies a lot, in a way we can implement it.

In the original paper there's a strong assumption: the variance matrix is non-diagonal.

The most important t[rick](https://www.youtube.com/watch?v=dQw4w9WgXcQ&ab_channel=RickAstley) for learning is to use the latent diffusion.
In the latent space we can embed together image and text!

Then we can do conditioned generation (e.g. text to image).