---
layout: post
title:  "Stable Diffusion"
date:   2023-06-13 10:14:54 +0700
categories: DeepLearning
---

# Introduction

A diffusion process is a physical process in which a liquid gradually diffuses into another liquid, for example, milk particles into caffe. This idea is applied into a deep learning model called stable diffusion, to generate new images. In this model, there are two different processes: diffusion process and denoising process. In the diffusion process, the model starts with original images. It gradually adds Gaussian noise into the image in a series of time steps. This slow process chips away the original image's structure and detail, turning it into a simpler distribution. At the end, the image becomes completely noise (following a simple Gaussian distribution). The denoising process is the reversed process and it is used to generate new samples. It starts from samples from the Gaussian distribution (noisy images). Then it uses a neural network to trace out a reverse path. The path is to gradually remove the noise that was supposedly added during the diffusion process. If successful, this denoising process would learn to transform the Gaussian blob back into the structured piece of data. And then that same neural network can be used to generate a new image from a random noisy images. The generated image would look real, since the neural network has learned the underlying patterns of a real image, even if it was actually hallucinated. The training of the neural network involves comparing the denoised data to the original data at each step, and adjusting the network's parameters to minimize the difference.

# Diffusion model

Given a data distribution $$ x_0 \sim q(x_0) $$, a forward noising process q that adds Gaussian noise to x at time t with variance $$ \beta_t \in (0,1) $$ would be defined as follows:

$$ q(x_1, ...,x_T \mid x_0) = \prod_{t=1}^T q(x_t \mid x_{t-1} ) $$

$$ q(x_t \mid x_{t-1}) = N(x_t; \sqrt{1 - \beta_t}x_{t-1}, \beta_t I) $$ 

The first step is to generate a sequence of beta value in each time step t. This sequence is a function of time t and is the variance of the Gaussian noise that would be added to the data at each time step t. Given a large T and a good schedule of $$ \beta_t $$, the latent data at timestep T would be completely noise, i.e. $$ x_T $$ is an isotropic Gaussian distribution. 

After having $$ \beta_t $$ we scale the data $$ x_{t-1} $$ by the factor $$ \sqrt{1 - \beta_t} $$ and the noise by factor $$ \beta $$. Then we sample for the $$ x_t $$. As the $$ \beta_t $$ increases, the image loses more of its structure.

Each step from $$ x_{t-1} $$ to $$ x_t $$ is a probabilistic one so the entire diffusion process can be viewed as a Markov chain. That is, a sequence of random variables in which the value of each step depends only on that of the previous step.

If we know the reverse distribution $$ q(x_{t-1} \mid x_t) $$ we can sample $$ x_T \sim N(0,I) $$ and run the process backward to get a sample from $$ q(x_0) $$. The distribution $$ q(x_{t-1} \mid x_t) $$ can be approximated with a neural network:

$$ p_{\theta} (x_{t-1} \mid x_t) = N(x_{t-1}; \mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t)) $$

To denoise, first we define $$ \alpha_t = 1 - \beta_t $$ and $$ \bar{\alpha_t} = \prod_{s=0}^t \alpha_s $$, then we can define the sample of arbitrary step of the noised latent, conditioning on the original input $$ x_0 $$:

$$ q(x_t\mid x_0) = N (x_t; \sqrt{\bar{\alpha_t}}x_0, (1 - \bar{\alpha}_t) I) $$ and $$ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon $$ where $$ \epsilon \sim N(0,1) $$.

Second, we can calculate the backward process as follows:

$$ \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t $$

$$ \tilde{\mu}_t(x_t,x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t  $$

$$ q(x_{t-1} \mid x_t, x_0) = N(x_{t-1};\tilde{\mu}(x_t,x_0), \tilde{\beta}_t I) $$

# Latent diffusion model

# Conclusion
