# Goal

* The Goal of this notebook is to understand the theoretical aspect of the paper [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239).

* Given that the Diffusion Model involves a lot of maths, we will also explore the mathematical aspects underlying its foundation.

* Our ultimate goal is to implement this paper in Pytorch which we will cover in the following notebook. --To be done

# Diffusion Model - The Intuition

* The literal meaning of the word Diffusion is the net movement of anything (for example, atoms, ions, molecules, energy) generally from **a region of higher concentration to a region of lower concentration**.

* In our context, Diffusion is the **gradual movement** of a sample (for example, image) from a **Simple Probability Distribution to a Complex Probability Distribution or vice versa**.

* You can consider Distribution over images of Cat/Dog/Church, etc a Complex Distribution and Noisy Zibberish Image drawn from a Normal Distibution 𝑁(0 ,1) as a Simple Distribution.

* In Simple words, Diffusion model learns **how to transform a Normal Distribution into a Complex Distribution.** i.e. Mapping; Simple --> Complex Distribution

* Once this mapping is learnt, you can simply draw the samples from a Normal Distribution and generate the images of Cat/Dog/Church, etc by mapping it to the learnt Complex Distribution.

# Diffusion Model - The Structure

We can break down the process of Diffusion Model into two parts.

    1. Forward Process
    2. Reverse Process

## 1. Forward Process (Complex => Simple)

* Forward Process involves sampling an Image from Input Data and Gradually adding Gaussian Noise (in T time steps) with **carefully chosen Mean and Variance** to the Image.

![picture](https://drive.google.com/uc?export=view&id=1LYkdsDmA_bEfXr5TFW233p_rn6K7GV2H)

* we will prove that if T is large enough; the end image x<sub>T</sub>  at time step T will approximately follow Normal Distribution with mean = 0 and Variance = 1.

* Forward Process is **completely known to us** since we know the input image, as well as Parameters of the Gaussian noise being added at each step. i.e. we control the forward process.

* The Forward Process also operates under the **Markovian assumption**, meaning that given the present state, the future state depends solely on the current state.

* Lets consider image at time step t is x<sub>t</sub> and β<sub>t</sub> is the parameter which controls the noise to be added at each step then forward process under markovian assumption is defined by the author as:

$$ q(x_{t} | x_{t-1}) := \mathcal{N} \left( x_{t}; \sqrt{(1 - \beta_{t})}x_{t-1}, \beta_{t} I \right) \text{ where } \beta_{t} \lt 1 \text{ for all } t $$

* The reason behind choosing this form is if we keep on adding noise using this formulation, the image at time stamp T will approach Normal Distribution with mean = 0 and var = 1. This is what we want to achieve with the forward process: Mapping of a Complex Distribution to a Simpler Distribution. Let's prove this.

![picture](https://drive.google.com/uc?export=view&id=1Vbop6oRfz9hpmc5isW89AN-e8QHJQd0M)

![picture](https://drive.google.com/uc?export=view&id=1Vg-h5Mm5Jui5-ArmlwjkndCzoFlPnKZR)

* For Forward Process, we can precompute all x<sub>t</sub>'s at once. Please check below derivation.

![picture](https://drive.google.com/uc?export=view&id=1cTdqurscidt8fdzKrvK0tXw8lLuPJ0R0)

## 2. Reverse Process (Simple => Complex)

* Given the Noisy Image at time step x<sub>T</sub> obtained by forward process, the goal of the reverse process is to **gradually denoise the image step by step to reconstruct the original image x<sub>0</sub>.**

![picture](https://drive.google.com/uc?export=view&id=1JvdxspnCoyga747DachDR8a0YzZjvgKC)

* In other words, Find the **parameterized reverse distribution** $$ p_{\theta}(x_{t-1} | x_t) $$

* Parameters are introduced by the **U-net** which is being used to predict the *x<sub>t-1</sub>* from *x<sub>t</sub>*.

* These parameters are learnt by Maximizing the Expectation of the log likelihood of the observerd Input samples $$ E[\log p_{\theta}(x_0)] $$

* Let's dig this term a bit to reach at the root of it.

![picture](https://drive.google.com/uc?export=view&id=1rgNwdpvO8qRABUS1XEyQ_kdBkckf_JVp)

![picture](https://drive.google.com/uc?export=view&id=1VfFn2uONsxpUS73D4HT6lB_m3PhQV0Se)

The Final Loss function can be written as:

![picture](https://drive.google.com/uc?export=view&id=1hhsdB-saGctLMSX3UUh-e6VMgAlkY4_2)

* Note that *L<sub>T</sub>* is parameter free and hence can be ignored from the Loss Function.

* Author also ignored *L<sub>0</sub>* term because he got better results without it?? [Source](https://learnopencv.com/denoising-diffusion-probabilistic-models/)

* So our Final Loss function boils down to:

![picture](https://drive.google.com/uc?export=view&id=16BU2koPc-R5rf5eo3Q0flMDUeLfj_xqf)

* This can be think of as a **minimizing the distances between the Ground Truth Reverse Distribution and Predicted Reverse Distribution**.

* Let's first derive the Ground Truth Reverse Distribution *q(x<sub>t-1</sub>|x<sub>t</sub>, x<sub>0</sub>)* also known as the forward process posterior distribution.

![picture](https://drive.google.com/uc?export=view&id=1ZbGDC3U_iuKDfzj93AASfMLkfTr31fwo)

* Upon solving this; we find that *q(x<sub>t-1</sub>|x<sub>t</sub>, x<sub>0</sub>)* is also a Gaussian Distribution.

![picture](https://drive.google.com/uc?export=view&id=1QdVNJZBa5pLu0NexZwLg2Fek23gvIkbe)

* Author makes the assumption? that *p<sub>θ</sub>(x<sub>t-1</sub>|x<sub>t</sub>)* is also a Gaussian Distribution with mean µ<sub>θ</sub>(x<sub>t</sub>, t) and set the variance to  β<sup>~</sup><sub>t</sub> (same as Ground Truth Reverse Distribution as derived in the previous step)

* KL Divergence between two Gaussians Distributions can be written as: [[Source]](https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians)

![picture](https://drive.google.com/uc?export=view&id=1fMJqjXltqvRjMTTXss5fa-OfDxboooo3)

* Using this formula and plugging in σ<sub>1</sub> = σ<sub>2</sub>; the loss function boils down to:

![picture](https://drive.google.com/uc?export=view&id=1mBAbufeF3hWcGmSVqSl37xe3jgdCkOll)

* The loss term can be further decomposed in terms of noise *e* as shown below.

![picture](https://drive.google.com/uc?export=view&id=1VETWuKOJDis_20X2npeZH6p9vAUa9nQL)

* Plugging in equation-1 and 2 in loss term *L<sub>t-1</sub>*, the loss *L<sub>t-1</sub>* will reduce to:

![picture](https://drive.google.com/uc?export=view&id=12q2yAclDoUXgPeDxcm01KasyykQp505R)

# Training

![picture](https://drive.google.com/uc?export=view&id=1F4pkWxQFrGpFW1qztKN-awGyUj6FmaVF)

* Once training is done, sampling is performed using.

![picture](https://drive.google.com/uc?export=view&id=1SoSM6DHLnMLJiOqx6qCXTIr1Zxqp3_2U)

# References:

1. [Original Paper: Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239)
2. [Amazing YouTube Video on DDPM by ExplainingAI](https://www.youtube.com/watch?v=H45lF4sUgiE&lc=UgyatBNQ6TinVIgOgap4AaABAg.A2XIyuLyPwjA2_RsiJBPLP)
3. [Another Amazing YouTube Video on ELBO by Umar Jamil](https://www.youtube.com/watch?v=H45lF4sUgiE&lc=UgyatBNQ6TinVIgOgap4AaABAg.A2XIyuLyPwjA2_RsiJBPLP)