# Recent Core Papers in Modern Generative Modeling (2021+), Grouped by Family

## 1) Transformers (Generative Models & Foundation Models)

### Text/Foundation Models
- **Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity** (2021)  
- **Training Compute-Optimal Large Language Models (Chinchilla)** (2022)  
- **PaLM: Scaling Language Modeling with Pathways** (2022)  
- **FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness** (2022)  
- **LLaMA: Open and Efficient Foundation Language Models** (2023)  
- **GPT-4 Technical Report** (2023)

### Transformer-based Visual Generation
- **MaskGIT: Masked Generative Image Transformer** (2022)  
- **Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Parti)** (2022)

Key primitive (scaled dot-product attention):
$$
\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.
$$

Autoregressive likelihood factorization:
$$
p(x)=\prod_{t=1}^{T} p(x_t \mid x_{<t}).
$$

---

## 2) Diffusion / Score-Based Models

- **Improved Denoising Diffusion Probabilistic Models** (2021)  
- **Diffusion Models Beat GANs on Image Synthesis** (2021)  
- **High-Resolution Image Synthesis with Latent Diffusion Models (LDM)** (2021/2022)  
- **Classifier-Free Diffusion Guidance** (2022)  
- **Elucidating the Design Space of Diffusion-Based Generative Models (EDM)** (2022)  
- **Consistency Models** (2023)

Forward noising (DDPM-style):
$$
q(x_t \mid x_{t-1})=\mathcal{N}\!\left(\sqrt{1-\beta_t}\,x_{t-1},\,\beta_t I\right),
\qquad
q(x_t \mid x_0)=\mathcal{N}\!\left(\sqrt{\bar{\alpha}_t}\,x_0,\,(1-\bar{\alpha}_t)I\right),
$$
where $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t=\prod_{s=1}^{t}\alpha_s$.

Denoising/score learning objective (one common form):
$$
\min_{\theta}\; \mathbb{E}_{t,x_0,\varepsilon}\left[\left\|\varepsilon-\varepsilon_\theta(x_t,t)\right\|_2^2\right],
\qquad
x_t=\sqrt{\bar{\alpha}_t}\,x_0+\sqrt{1-\bar{\alpha}_t}\,\varepsilon,\;\varepsilon\sim\mathcal{N}(0,I).
$$

Classifier-free guidance (schematic):
$$
\hat{\varepsilon}(x_t,t,c)=\varepsilon_\theta(x_t,t,\varnothing)+w\left(\varepsilon_\theta(x_t,t,c)-\varepsilon_\theta(x_t,t,\varnothing)\right).
$$

---

## 3) Flow-Based Models / Continuous Normalizing Flows

- **Flow Matching for Generative Modeling** (2022 / ICLR 2023)  
- **Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow** (2022)  
- **Improving Rectified Flow with Boundary Conditions** (2025)

Continuous normalizing flow (CNF) dynamics:
$$
\frac{dx}{dt}=f_\theta(x,t).
$$

Instantaneous change of variables (log-density evolution):
$$
\frac{d}{dt}\log p(x(t))=-\mathrm{Tr}\!\left(\frac{\partial f_\theta}{\partial x}\right).
$$

Flow matching (high-level target: match a velocity field along an interpolation path):
$$
\min_{\theta}\; \mathbb{E}_{t}\,\mathbb{E}_{x(t)}\left[\left\|f_\theta(x(t),t)-v^\star(x(t),t)\right\|_2^2\right].
$$

---

## 4) GANs / Adversarial Generative Models

- **Alias-Free Generative Adversarial Networks (StyleGAN3)** (2021)  
- **StyleGAN-XL: Scaling StyleGAN to Large and Diverse Datasets** (2022)  
- **StyleSwin: Transformer-Based GAN for High-Resolution Image Generation** (2022)  
- **Designing an Encoder for StyleGAN Image Manipulation (e4e)** (2021)

Standard GAN minimax objective:
$$
\min_{G}\max_{D}\;
\mathbb{E}_{x\sim p_{\text{data}}}\left[\log D(x)\right]
+\mathbb{E}_{z\sim p(z)}\left[\log\left(1-D(G(z))\right)\right].
$$

---

## 5) Adversarial Neural Networks (Robustness & Adversarial Training â€” not GANs)

- **Fast Is Better Than Free: Revisiting Adversarial Training** (2020/2021)

Robust optimization perspective:
$$
\min_{\theta}\;\mathbb{E}_{(x,y)}\left[\max_{\|\delta\|\le \varepsilon}\;\mathcal{L}\big(f_\theta(x+\delta),y\big)\right].
$$

---

## 6) Energy-Based Models (EBMs)

- **Energy-Based Models for Anomaly Detection: A Manifold Diffusion Recovery Approach** (NeurIPS 2023)  
- **Energy-Based Diffusion Language Model (EDLM)** (2024)

Energy-based density:
$$
p_\theta(x)=\frac{\exp(-E_\theta(x))}{Z_\theta},
\qquad
Z_\theta=\int \exp(-E_\theta(x))\,dx.
$$

Negative log-likelihood (up to an additive constant):
$$
-\log p_\theta(x)=E_\theta(x)+\log Z_\theta.
$$
