# Conditional Latent Diffusion (CIFAR-10)

## Step 1. Encode images into latent space (VAE)

* Input image: $x \in \mathbb{R}^{3 \times 32 \times 32}$
* The VAE encoder learns a Gaussian posterior:

  $$
  q_\phi(z|x) = \mathcal{N}\!\big(z;\,\mu_\phi(x),\,\sigma_\phi^2(x)\big), \quad z \in \mathbb{R}^{4 \times 8 \times 8}.
  $$
  - **$q_\phi(z|x)$**: Probability distribution of the latent variable $z$ given an input $x$.
  - **$\mu_\phi(x)$**: Predicted mean vector of the latent distribution.
  - **$\sigma^2_\phi(x)$**: Predicted variance (or uncertainty) for each latent dimension.
* Reparameterization trick: Allows gradients to flow through sampling by injecting noise explicitly.

  $$
  z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0,I).
  $$

### VAE loss (β-VAE ELBO)

$$
\mathcal{L}_{\text{VAE}} = \|x - \hat{x}\|_1 + \beta \cdot D_{\mathrm{KL}}(q_\phi(z|x)\,\|\,\mathcal{N}(0,I))
$$

* **Reconstruction term**: $\|x-\hat{x}\|_1$

  * Measures pixel-wise error between input and reconstruction
* **KL divergence**:

  $$
  D_{\mathrm{KL}} = \frac{1}{2}\sum\big(\mu^2 + \sigma^2 - \log\sigma^2 - 1\big)
  $$

  * Forces latent distribution to stay close to Gaussian prior $N(0,I)$
* $\beta$: hyperparameter controlling tradeoff between reconstruction and regularization



## Step 2. Define forward noising process (Diffusion forward)

### Noising process

$$
z_t = \sqrt{\bar\alpha_t}\,z_0 + \sqrt{1-\bar\alpha_t}\,\epsilon,\quad \epsilon \sim \mathcal{N}(0,I)
$$

* $z_0$: clean latent from VAE
* $z_t$: noisy version after $t$ steps
* $\bar\alpha_t$: cumulative product of noise schedule (see below)
* $\epsilon$: standard Gaussian noise


### Noise schedule

* Per-step variance parameter:

  $$
  \beta_t \in (0,1)
  $$
* Per-step keep-rate:

  $$
  \alpha_t = 1 - \beta_t
  $$
* Cumulative keep-rate:

  $$
  \bar\alpha_t = \prod_{s=1}^t \alpha_s
  $$

Intuition: larger $t$ ⇒ more noise ⇒ $\bar\alpha_t$ gets smaller.


## Step 3. Train the denoiser UNet (Reverse process)

### Goal

Train UNet $\epsilon_\theta$ to predict the noise added at each step:

$$
\epsilon_\theta: (z_t, t, y) \mapsto \hat{\epsilon}
$$


### Loss function

$$
\mathcal{L}_{\text{DDPM}} = \mathbb{E}_{x,\epsilon,t}\;\|\epsilon - \epsilon_\theta(z_t,t,\tilde{y})\|^2
$$

* $\epsilon$: true noise used in forward process
* $\hat{\epsilon}$: predicted noise
* $y$: class label (0–9 for CIFAR-10)
* $\tilde{y}$: either real label or “null” (unconditional) for classifier-free guidance

## Step 4. Sampling (reverse diffusion)

We start from pure noise $z_T \sim \mathcal{N}(0,I)$ and iteratively denoise.

### Predict clean latent

$$
\hat{z}_0 = \frac{z_t - \sqrt{1-\bar\alpha_t}\,\hat{\epsilon}}{\sqrt{\bar\alpha_t}}
$$

* Uses denoised noise $\hat{\epsilon}$ to recover approximation of the original latent.



### DDPM update (stochastic)

$$
z_{t-1} = \sqrt{\bar\alpha_{t-1}} \hat{z}_0 + \sqrt{1-\bar\alpha_{t-1}-\sigma_t^2}\,\hat{\epsilon} + \sigma_t \xi,
\quad \xi\sim\mathcal{N}(0,I)
$$

* $\sigma_t^2 = \beta_t \cdot \frac{1-\bar\alpha_{t-1}}{1-\bar\alpha_t}$: posterior variance
* Adds extra Gaussian noise each step (ancestral sampling).



### DDIM update (deterministic or stochastic)

$$
z_{t-1} = \sqrt{\bar\alpha_{t-1}} \hat{z}_0 + \sqrt{1-\bar\alpha_{t-1}}\,\hat{\epsilon},\quad (\eta=0)
$$

* With $\eta=0$: purely deterministic, fewer steps (fast).
* With $\eta>0$: reintroduces stochasticity for more diversity.



## Step 5. Classifier-Free Guidance at sampling
Idea: make diffusion samples more faithful to the condition while staying realistic. Uses the same model trained to handle both conditional and unconditional cases. Gives control over fidelity vs. diversity of generations.

Run the network twice:

* Unconditional: $\hat{\epsilon}_u = \epsilon_\theta(z_t, t, \varnothing)$
* Conditional: $\hat{\epsilon}_c = \epsilon_\theta(z_t, t, y)$

Combine:

$$
\hat{\epsilon} = \hat{\epsilon}_u + s \cdot (\hat{\epsilon}_c - \hat{\epsilon}_u),
$$

where $s$ = guidance scale (>= 1).

## Step 6. Decode latent back to image

- $z_0$: The final latent after the diffusion process finishes denoising. $[B,4,8,8]$
- After reaching $z_0$, decode with VAE:

$$
\hat{x} = \text{Dec}_\theta(z_0).
$$

This produces a CIFAR-10 image in $[B,3,32,32]$.

## References

| Step | Concept                                   | Paper                                                            |
| ---- | ----------------------------------------- | ---------------------------------------------------------------- |
| 1    | VAE (posterior, reparameterization)       | [Kingma & Welling 2013](https://arxiv.org/abs/1312.6114)         |
| 1    | β-VAE (disentanglement, KL weighting)     | [Higgins et al. 2017](https://openreview.net/forum?id=Sy2fzU9gl) |
| 2    | Forward diffusion (noising process)       | [Ho et al. 2020](https://arxiv.org/abs/2006.11239)               |
| 3    | Noise prediction training objective       | [Ho et al. 2020](https://arxiv.org/abs/2006.11239)               |
| 4    | DDPM sampling (ancestral reverse process) | [Ho et al. 2020](https://arxiv.org/abs/2006.11239)               |
| 4    | DDIM sampling (fast, deterministic)       | [Song et al. 2021](https://arxiv.org/abs/2010.02502)             |
| 5    | Classifier-Free Guidance (CFG)            | [Ho & Salimans 2021](https://arxiv.org/abs/2207.12598)           |
| 6    | Latent Diffusion (VAE + DDPM combo)       | [Rombach et al. 2022](https://arxiv.org/abs/2112.10752)          |
