# FIT5230 Week 8: Deep Generative Models (GDMs)

## 1. Introduction to Generative Diffusion Models (GDMs)

**Generative Diffusion Models (GDMs)** are a class of deep learning models that have emerged as a powerful alternative to GANs, often outperforming them in high-fidelity image synthesis.

The core idea is to learn how to generate data by modeling a reversal of a gradual noising process. The model consists of two main processes:

1.  **Forward Diffusion Process**: A fixed process where an input image is progressively destroyed by adding Gaussian noise over a series of `T` timesteps.
2.  **Reverse Diffusion Process**: A learned process where a neural network is trained to gradually remove noise, starting from a pure noise input, to generate a clean, coherent image.



---
<hr>

## 2. The Forward Diffusion Process (Noising)

This is a predefined Markov chain that adds a small amount of noise at each of `T` discrete timesteps. The amount of noise is controlled by a **noise scheduler**.

### The Noise Scheduler (`βt`)

The scheduler determines the variance of the noise, `βt`, added at each timestep `t`. This schedule is crucial for training stability and quality.
* `αt = 1 - βt`: Represents the portion of the image signal that is preserved at step `t`.
* `āₜ = Πᵢ₌₁ᵗ αᵢ`: The cumulative product of `α` values. This term represents the total amount of signal from the original image `x₀` that remains at step `t`.

Using `āₜ`, we can directly sample a noisy image `xₜ` from the original `x₀` at any timestep `t`:

$$x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon$$

* **Conceptual Meaning**: This equation shows that any noisy image is simply a weighted blend of the original image (`x₀`) and pure Gaussian noise (`ϵ`). As `t` increases, `āₜ` decreases, so the influence of the original image fades while the noise dominates.

**Types of Schedulers:**
* **Linear Scheduler**: `βt` increases linearly. This can add noise too quickly, destroying image information early on and making the reverse process difficult to learn.
* **Cosine Scheduler**: `βt` increases according to a cosine curve, adding noise more slowly at the beginning. This preserves information for longer, which has been shown to improve training and final image quality.

---
<hr>

## 3. The Reverse Diffusion Process (Denoising)

This is where the learning occurs. The goal is to train a neural network `ϵᵧ` to reverse the diffusion process by predicting the noise that was added at each step.

* Starting with pure noise `xₜ`, the model predicts the noise `ϵᵧ(xₜ, t)`.
* This prediction is used to compute a less noisy image `xₜ₋₁`.
* The process is repeated `T` times to arrive at the generated image `x₀`.

The formula to go from `xₜ` to `xₜ₋₁` is:
$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) + \sigma_t z$$

* **Conceptual Meaning**: This equation essentially says: "To get the image from the previous step (`xₜ₋₁`), take the current image (`xₜ`), subtract the amount of noise the model (`ϵᵧ`) thinks was added, scale it back up, and add a small amount of random noise for variability."

### The U-Net Architecture

The network `ϵᵧ` is typically a **U-Net** because its architecture is well-suited for image-to-image tasks.

* It has an encoder (downsampling path) and a decoder (upsampling path).
* **Skip connections** link the encoder and decoder, allowing the network to preserve high-resolution details during reconstruction.
* **Time Embedding**: The timestep `t` is encoded into a vector and fed into the network's residual blocks. This is crucial because the network needs to know the noise level to make an accurate prediction.
* **Attention Mechanisms**: Self-attention layers are often added to help the model capture long-range dependencies within the image.

### The Loss Function

The training objective is surprisingly simple: minimize the Mean Squared Error (MSE) between the actual noise `ϵ` added during the forward process and the noise `ϵᵧ` predicted by the model.
$$L = \mathbb{E}_{x_0, t, \epsilon} [||\epsilon - \epsilon_\theta(x_t, t)||^2]$$
This direct, stable loss function is a major reason why diffusion models are easier to train than GANs.

---
<hr>

## 4. Advanced Topics and Applications

### Denoising Diffusion Implicit Models (DDIM)

A major drawback of standard diffusion models (DDPMs) is slow inference, often requiring 1000+ steps. **DDIMs** accelerate this significantly.
* DDIMs use a **non-Markovian** formulation, which allows the reverse process to skip steps (e.g., jumping from `t=1000` to `t=900`). This produces high-quality samples in far fewer steps (e.g., 20-100), trading a small amount of quality for a massive speedup.

### State-of-the-Art Models

* **DALL·E 2 (OpenAI)**: Uses a two-part system. A "prior" model maps a text caption to a CLIP image embedding, which captures the semantics of the image. A "decoder" diffusion model then generates an image based on this embedding.
* **Imagen (Google)**: Showed that scaling the **text encoder** (using a large, frozen language model like T5) is more important for photorealism and text alignment than scaling the diffusion U-Net. It uses a cascade of diffusion models to generate a low-resolution image and then upscale it.

### Security and Robustness

Diffusion models are a double-edged sword for security.
* **Defense**: They can be used for **Adversarial Purification**. An adversarial image can be partially noised and then denoised, effectively "washing away" the malicious perturbation before feeding it to a classifier.
* **Threat**: Like any generative model, they can be used to create synthetic identities for fraud or phishing. They can also be adapted to generate more natural and potent adversarial examples that are harder to detect (e.g., AdvDiffuser).