# A Basic Tutorial of Stable Diffusion with Math and Code

As learners like you, when we first touch upon the state-of-art model of Stable Diffusion, we found it complicated and perplexing. Based of empirical results and combined with multiple distinct parts, the model itself is hard to appreciate and even hard to follow without a clear and logical path😖. If you feel the same way, don't worry, we were all at the same boat😤. As learners, we looked through different materials and tutorials, but we could not find something that both comprehensive enough logic-wise or coding-wise😔.

Therefore, we aim to create a natural and straightforward introduction in both the foundation and implementation of the model🤗.

So let's begin our journey!

# What is Stable Diffusion?

According to Wikipedia and its github page, Stable Diffusion is a deep learning, text-to-image model released in 2022 based on diffusion techniques. From this, we know its backbone is essentially a model called diffusion. So, before we dive into the actual stable diffusion, we think it's better to learn about what is Diffusion. Or more specifically, DDPM - Denoising Diffusion Probabilistic Models. 

# Denoising Diffusion Probabilistic Models

If you are intereted in reading the original paper of DDPM by Ho et.al published 2020, here is the link for the paper: https://arxiv.org/pdf/2006.11239.pdf.

However, the original paper is really math-heavy, and skipped a lot of details. Thus, we provide our understanding and visions which should be slighly easier (?) to comprehend.

In short, the ddpm (diffusion) model is basically a process of adding noise and denoising, which could be easliy divided into 2 seperate directions, the forward process - adding noise and the reverse process - denosing. The general idea behind this model is, if by defined, we know how to iteratively add noise to a image that let it eventually become a completely noisy image (in a sense equals to random samples), could we use some deep learning networks to learn the noise added to the model in each step?

And if we could accomplish that, does it mean that we have a way to predict the added noise for each state during the process (the difference of noise level between each timestep)? Then using the noise we predicted, does it mean that we could start from a random noise, and continuing substract the noise we predicted, which at the end give us a image without noise? That is the generative result we want. 

Above is a very brief introduciton/ intuition for you to have something in mind before we head into the detail maths. But maths is something that we not able to avoid. I'll start with the forward process.

# Forward Process - Adding Noise

![](forward_pass.png)

- In the forward process, the paper defined $X_t = \sqrt{a_t}X_{t-1} + \sqrt{1-a_t}Z_t$, 
    - where $X_t$ is the image at timestep $t$, after added noise t times. 
    - $Z_t$ is the noise added from Normal Distribution ~ N(0, 1)
    - Yes, we need to also explain what timestep $t$ means. It is the number of steps we take to reach the final noisy image, shown in the figure above, timestep is the underscore number below each state of adding noise to the previous image.  
    - $a_t$ is the noise level at timestep $t$, and it is defined by using a function $a_t = 1 - \beta_t$ where $\beta_t$ is a hyperparameter that increases through timesteps, which t increases $\beta_t$ also increase. This would cause $a_t$ to decrease through timesteps.
        - So why? Why we defined $a_t$ in this way? Observe the formula, when $a_t$ decreases through timestep, $\sqrt{1-a_t}$ would conversely increase. This means that the noise added to the image would increase through timesteps. What it means? Imagine we start from a clean image, and we add noise to it, even a little bit of noise would be noticeable. But if we add noise to a noisy image, the noise added would be insignificant. As a result, with more timesteps, we also need to add more noise to make it be more inflential. This is the logic behind the definition of $a_t$.

- However, we don't want to add noise actually timestep by timestep, we want to add noise to the image in one go. If we have our image $X_0$, we want to have $X_T$ immediately. So now we want to relate $X_T$ to $X_0$ directly. 
    - We can do this by expanding the formula $X_t = \sqrt{a_t}X_{t-1} + \sqrt{1-a_t}Z_t$. 
    - => $X_T = \sqrt{a_t}(\sqrt{a_{t-1}}X_{t-2} + \sqrt{1-a_{t-1}}Z_{t-1}) + \sqrt{1-a_t}Z_t$
    - => $X_T = \sqrt{a_ta_{t-1}}X_{t-2} + \sqrt{a_t(1-a_{t-1})}Z_{t-1} + \sqrt{1-a_t}Z_t$
    - For Gaussian distribution, if we multiply two Gaussian distribution, in this case $Z_t$ and $Z_{t-1}$ 
        - if we multiply a Gaussian distribution with constant $c$: new $\mu' = c\mu$, new $\sigma' = |c|\sigma => \sigma'^2 = |c|\sigma^2$
        - $\sqrt{a_t(1-a_{t-1})}Z_{t-1}$ ~ $N(0, a_t(1-a_{t-1}))$ and $\sqrt{1-a_t}Z_t$ ~ $N(0, 1-a_t)$ are still Gaussian distributions
        - for any Guassion distribution $N(\mu_1, \sigma_1^2)$ and $N(\mu_2, \sigma_2^2)$, if they are independent, then $N(\mu_1, \sigma_1^2) + N(\mu_2, \sigma_2^2)$ ~ $N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$
    - => $X_T = \sqrt{a_ta_{t-1}}X_{t-2} + \sqrt{1 - a_t + a_t -a_ta_{t-1}}\bar Z$ where $\bar Z$ ~ $N(0, 1)$
    - => $X_T = \sqrt{a_ta_{t-1}}X_{t-2} + \sqrt{1 - a_ta_{t-1}}\bar Z$
    - continue this process, we can get $X_T = \sqrt{a_ta_{t-1}...a_1}X_0 + \sqrt{1 - a_ta_{t-1}...a_1}\bar Z$
- So if let $\bar a_t = \prod_1^t a_i$, 
- => $X_T = \sqrt{\bar a_t}X_0 + \sqrt{1 - \bar a_t}\bar Z$

And this is our final equation for adding noise in the forward process. It's possible you find the math above is a bit hard to follow, you should remember this formula, and we will use it in actual implementation.
This concludes the forward process. Hurrah! We are halfway there! 😀 

Next, we will move on to the reverse process - denoising.

# Reverse Process - Denoising

![](backward_pass.png)

As the name suggests, the reverse process is the opposite of the forward process. In the forward process, we add noise to the image, and in the reverse process, we remove the noise from the image. So if we are given the noisy image $X_t$, how can we know the less noisy image $X_{t-1}$?

Basically, we want the probability $q(X_{t-1}|X_t)$