From b4ee3a6440108405430fb9cfb61014b2b455f975 Mon Sep 17 00:00:00 2001 From: Felix Fu <841808303@qq.com> Date: Wed, 3 Sep 2025 21:36:31 -0700 Subject: [PATCH 1/2] fix the latex rendering issue, update the mathjax.html file, explain the ODE distillation in detail. --- _includes/scripts/mathjax.html | 28 +++++++++--- _posts/2025-08-18-diff-distill.md | 71 +++++++++++++------------------ 2 files changed, 50 insertions(+), 49 deletions(-) diff --git a/_includes/scripts/mathjax.html b/_includes/scripts/mathjax.html index c55ec05..72f0dff 100644 --- a/_includes/scripts/mathjax.html +++ b/_includes/scripts/mathjax.html @@ -1,12 +1,26 @@ {%- if site.enable_math -%} - - - + + {%- endif %} diff --git a/_posts/2025-08-18-diff-distill.md b/_posts/2025-08-18-diff-distill.md index 9b46cb6..610f305 100644 --- a/_posts/2025-08-18-diff-distill.md +++ b/_posts/2025-08-18-diff-distill.md @@ -2,7 +2,7 @@ layout: distill title: A Unified Framework for Diffusion Distillation description: The explosive growth in one-step and few-step diffusion models has taken the field deep into the weeds of complex notations. In this blog, we cut through the confusion by proposing a coherent set of notations that reveal the connections among these methods. -tags: generative-models diffusion flows +tags: generative-models diffusion flow giscus_comments: true date: 2025-08-21 featured: true @@ -15,12 +15,6 @@ authors: bibliography: 2025-08-18-diff-distill.bib -# Optionally, you can add a table of contents to your post. -# NOTES: -# - make sure that TOC names match the actual section names -# for hyperlinks within the post to work correctly. -# - we may want to automate TOC generation in the future using -# jekyll-toc plugin (https://github.com/toshimaru/jekyll-toc). toc: - name: Introduction - name: Notation at a Glance @@ -36,24 +30,6 @@ toc: - name: ReFlow - name: Inductive Moment Matching - name: Closing Thoughts - -# Below is an example of injecting additional post-specific styles. -# If you use this post as a template, delete this _styles block. -# _styles: > -# .fake-img { -# background: #bbb; -# border: 1px solid rgba(0, 0, 0, 0.1); -# box-shadow: 0 0px 4px rgba(0, 0, 0, 0.1); -# margin-bottom: 12px; -# } -# .fake-img p { -# font-family: monospace; -# color: white; -# text-align: left; -# margin: 12px 0; -# text-align: center; -# font-size: 16px; -# } --- ## Introduction @@ -62,27 +38,20 @@ Diffusion and flow-based models, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), and parameter-efficient fine-tuning. In this blog, we focus on an orthogonal approach, **ODE distillation**, which minimize Number of Function Evaluations (NFEs) so that we can generate high-quality samples with as few denoising steps as possible. +This challenge has spurred research into acceleration strategies across multiple granular levels, including hardware optimization, mixed precision training, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), and parameter-efficient fine-tuning. In this blog, we focus on an orthogonal approach named **Ordinary Differential Equation (ODE) distillation**. This method introduces an auxiliary structure that bypasses explicit ODE solving, thereby reducing the Number of Function Evaluations (NFEs). As a result, we can generate high-quality samples with fewer denoising steps. Distillation, in general, is a technique that transfers knowledge from a complex, high-performance model (the *teacher*) to a more efficient, customized model (the *student*). Recent distillation methods have achieved remarkable reductions in sampling steps, from hundreds to a few and even **one** step, while preserving the sample quality. This advancement paves the way for real-time applications and deployment in resource-constrained environments. +
- {% include video.liquid path="blog/2025/diff-distill/diff-distill.mp4" class="img-fluid rounded z-depth-1" controls=true autoplay=true loop=true%} + {% include video.liquid path="/blog/2025/diff-distill/diff-distill.mp4" class="img-fluid rounded z-depth-1" controls=true autoplay=true loop=true %}
## Notation at a Glance -
-
- {% include figure.liquid loading="eager" path="blog/2025/diff-distill/teaser_probpath_velocity_field.png" class="img-fluid rounded z-depth-1" %} -
-
-
-From left to right:conditional and marginal probability paths, conditional and marginal velocity fields. The velocity field induces a flow that dictates its instantaneous movement across all points in space. -
-The modern approaches of generative modelling consist of picking some samples from a base distribution $$\mathbf{x}_1\sim p_{\text{noise}}$$, typically an isotropic Gaussian, and learning a map such that $$\mathbf{x}_0\sim p_{\text{data}}$$. The connection between these two distributions can be expressed by establishing an initial value problem controlled by the **velocity field** $v(\mathbf{x}_t, t)$, +The modern approaches of generative modelling consist of picking some samples from a base distribution $$ \mathbf{x}_{1} \sim p_ {\text{noise}} $$, typically an isotropic Gaussian, and learning a map such that $$ \mathbf{x}_{0} \sim p_ {\text{data}} $$. The connection between these two distributions can be expressed by establishing an initial value problem controlled by the **velocity field** $$v(\mathbf{x}_{t}, t)$$, $$ \require{physics} @@ -91,7 +60,17 @@ $$ \end{equation} $$ -where the **flow** $\psi_t:\mathbb{R}^d\times[0,1]\to \mathbb{R}^d$ is a diffeomorphic map with $$\psi_t(\mathbf{x}_t)$$ defined as the solution to the above ODE (\ref{eq:1}). If the flow satisfies the push-forward equationThis is also known as the change of variable equation: $[\phi_t]_\# p_0(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right].$ $$p_t=[\psi_t]_\#p_0$$, we say a **probability path** $$(p_t)_{t\in[0,1]}$$ is generated from the velocity vector field. The goal of flow matching is to find a velocity field $$v_\theta(\mathbf{x}_t, t)$$ so that it transforms $$\mathbf{x}_1\sim p_{\text{noise}}$$ to $$\mathbf{x}_0\sim p_{\text{data}}$$ when integrated. In order to receive supervision at each time step, one must predefine a condition probability path $$p_t(\cdot \vert \mathbf{x}_0)$$In practice, the most common one is the Gaussian conditional probability path. This arises from a Gaussian conditional vector field, whose analytical form can be derived from the continuity equation. $$\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v) = 0$$ See the table for details. associated with its velocity field. For each datapoint $$\mathbf{x}_0\in \mathbb{R}^d$$, let $$v(\mathbf{x}_t, t\vert\mathbf{x}_0)=\mathbb{E}_{p_t(v_t \vert \mathbf{x}_0)}[v_t]$$ denote a conditional velocity vector field so that the corresponding ODE (\ref{eq:1}) yields the conditional flow. +where the **flow** $$\psi_t:\mathbb{R}^d\times[0,1]\to \mathbb{R}^d$$ is a diffeomorphic map with $$\psi_t(\mathbf{x}_t)$$ defined as the solution to the above ODE (\ref{eq:1}). If the flow satisfies the push-forward equationThis is also known as the change of variable equation: $[\phi_t]_\# p_0(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right].$ $$p_t=[\psi_t]_\#p_0$$, we say a **probability path** $$(p_t)_{t\in[0,1]}$$ is generated from the velocity vector field. The goal of flow matching is to find a velocity field $$v_\theta(\mathbf{x}_t, t)$$ so that it transforms $$\mathbf{x}_1\sim p_{\text{noise}}$$ to $$\mathbf{x}_0\sim p_{\text{data}}$$ when integrated. In order to receive supervision at each time step, one must predefine a condition probability path $$p_t(\cdot \vert \mathbf{x}_0)$$In practice, the most common one is the Gaussian conditional probability path. This arises from a Gaussian conditional vector field, whose analytical form can be derived from the continuity equation. $$\frac{\partial p_t}{\partial t} + \nabla \cdot (p_t v) = 0$$ See the table for details. associated with its velocity field. For each datapoint $$\mathbf{x}_0\in \mathbb{R}^d$$, let $$v(\mathbf{x}_t, t\vert\mathbf{x}_0)=\mathbb{E}_{p_t(v_t \vert \mathbf{x}_0)}[v_t]$$ denote a conditional velocity vector field so that the corresponding ODE (\ref{eq:1}) yields the conditional flow. + + +
+
+ {% include figure.liquid loading="eager" path="/blog/2025/diff-distill/teaser_probpath_velocity_field.png" class="img-fluid rounded z-depth-1" %} +
+
+
+From left to right:conditional and marginal probability paths, conditional and marginal velocity fields. The velocity field induces a flow that dictates its instantaneous movement across all points in space. +
Most of the conditional probability paths are designed as the **differentiable** interpolation between noise and data for simplicity, and we can express sampling from a marginal path $$\mathbf{x}_t = \alpha(t)\mathbf{x}_0 + \beta(t)\mathbf{x}_1$$ where $$\alpha(t), \beta(t)$$ are predefined schedules. The stochastic interpolant paper defines this probability path that summarizes all diffusion models, with several assumptions. Here, we use a simpler interpolant for clean illustration. @@ -127,6 +106,7 @@ where $$w(t)$$ is a reweighting functionThe weighting function modul ## ODE Distillation methods + Before introducing ODE distillation methods, it is imperative to define a general continuous-time flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$ where it maps any noisy input $$\mathbf{x}_t, t\in[0,1]$$ to any point $$\mathbf{x}_s, s\in[0,1]$$ on the ODE (\ref{eq:1}) that describes the probability flow aformentioned. This is a generalization of flow-based distillation and consistency models within a single unified framework. The flow map is well-defined only if its **boundary conditions** satisfy $$f_{t\to t}(\mathbf{x}_t, t, t) = \mathbf{x}_t$$ for all time steps. One popular way to meet the condition is to parameterize the model as $$ f_{t\to s}(\mathbf{x}_t, t, s)= c_{\text{skip}}(t, s)\mathbf{x}_t + c_{\text{out}}(t,s)F_{t\to s}(\mathbf{x}_t, t, s)$$ where $$c_{\text{skip}}(t, t) = 1$$ and $$c_{\text{out}}(t, t) = 0$$ for all $$t$$. At its core, ODE distillation boils down to how to strategically construct the training objective of the flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$ so that it can be efficiently evaluated during sampling. In addition, we need to orchestrate the schedule of $$(t,s)$$ pairs for better training dynamics. @@ -134,6 +114,7 @@ At its core, ODE distillation boils down to how to strategically construct the t In the context of distillation, the forward direction $$s sampling, the conditional probability path is traversed twice. In our flow map formulation, this can be replaced with the flow maps $$f_{\tau_i\to 0}(\mathbf{x}_{\tau_i}, \tau_i, 0), f_{0\to \tau_{i-1}}(\mathbf{x}_0, 0, \tau_{i-1})$$ where $$0<\tau_{i-1}<\tau_i<1$$. Intuitively, the flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$ represents a direct mapping of some **displacement field** where $$F_{t\to s}(\mathbf{x}_t, t, s)$$ measures the increment which corresponds to a **velocity field**. ### MeanFlow + MeanFlow can be trained from scratch or distilled from a pretrained FM model. The conditional probability path is defined as the linear interpolation between noise and data $$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$$ with the corresponding default conditional velocity field OT target $$v(\mathbf{x}_t, t \vert \mathbf{x}_0)=\mathbf{x}_1- \mathbf{x}_0.$$ The main contribution consists of identifying and defining an **average velocity field** which coincides with our flow map as $$ @@ -172,6 +153,7 @@ where $$F_{t\to s}^{\text{tgt}}(\mathbf{x}_t, t, s\vert\mathbf{x}_0)=v - (t-s)(v
Full derivation of the target Based on the MeanFlow identity, we can compute the target as follows: + $$ \require{physics} \require{cancel} @@ -181,6 +163,7 @@ F_{t\to s}^{\text{tgt}}(\mathbf{x}_t, t, s\vert\mathbf{x}_0) &= \dv{\mathbf{x}_t & = v - (t-s)\left(v \nabla_{\mathbf{x}_t} F_{t\to s}(\mathbf{x}_t, t, s) + \partial_t F_{t\to s}(\mathbf{x}_t, t, s)\right). \\ \end{align*} $$ + Note that in MeanFlow $$\dv{\mathbf{x}_t}{t} = v(\mathbf{x}_t, t\vert \mathbf{x}_0)$$ and $$\dv{s}{t}=0$$ since $s$ is independent of $t$.
@@ -250,7 +233,8 @@ Type (b) backward loss, while CTMs opt Similar to MeanFlow preliminaries, Flow Anchor Consistency Model (FACM) also adopts the linear conditional probability path $$\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1$$ with the corresponding default conditional velocity field OT target $$v(\mathbf{x}_t, t \vert \mathbf{x}_0)=\mathbf{x}_1- \mathbf{x}_0.$$ In our flow maps notation, FACM parameterizes the model as $$ f^\theta_{t\to s}(\mathbf{x}_t, t, 0)= \mathbf{x}_t - tF^\theta_{t\to s}(\mathbf{x}_t, t, 0) $$ where $$c_{\text{skip}}(t,s)=1$$ and $$c_{\text{out}}(t,s)=-t$$. -FACM imposes a **consistency property** which requires the total derivative of the consistency function to be zero +FACM imposes a **consistency property** which requires the total derivative of the consistency function to be zero + $$ \require{physics} \dv{t}f^\theta_{t \to 0}(\mathbf{x}, t, 0) = 0. @@ -268,7 +252,7 @@ Notice this is equivalent to [MeanFlow](#meanflow) where $$s=0$$. This indicates Training: FACM training algorithm equipped with our flow map notation. Notice that $$d_1, d_2$$ are $\ell_2$ with cosine loss$L_{\cos}(\mathbf{x}, \mathbf{y}) = 1 - \dfrac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_{2} \, \|\mathbf{y}\|_{2}}$ and norm $\ell_2$ loss$L_{\text{norm}}(\mathbf{x}, \mathbf{y}) =\dfrac{\|\mathbf{x}-\mathbf{y}\|^2}{\sqrt{\|\mathbf{x}-\mathbf{y}\|^2+c}}$ where $c$ is a small constant. This is a special case of adaptive L2 loss proposed in MeanFlow. respectively, plus reweighting. Interestingly, they separate the training of FM and CM on disentangled time intervals. When training with CM target, we let $$s=0, t\in[0,1]$$. On the other hand, we set $$t'=2-t, t'\in[1,2]$$ when training with FM anchors.
- {% include figure.liquid loading="eager" path="blog/2025/diff-distill/facm_training.png" class="img-fluid rounded z-depth-1" %} + {% include figure.liquid loading="eager" path="/blog/2025/diff-distill/facm_training.png" class="img-fluid rounded z-depth-1" %}
@@ -306,12 +290,14 @@ Type (b) backward loss for AYF-EMD, type (a) forward loss for AYF-LMD. ## Connections + Now it is time to connect the dots with some previous existing methods. Let's frame their objectives in our flow map notation and identify their loss types if possible. ### Shortcut Models +
- {% include figure.liquid loading="eager" path="blog/2025/diff-distill/shortcut_model.png" class="img-fluid rounded z-depth-1" %} + {% include figure.liquid loading="eager" path="/blog/2025/diff-distill/shortcut_model.png" class="img-fluid rounded z-depth-1" %}
@@ -337,7 +323,7 @@ Type (c) tri-consistency loss ### ReFlow
- {% include figure.liquid loading="eager" path="blog/2025/diff-distill/rectifiedflow.png" class="img-fluid rounded z-depth-1" %} + {% include figure.liquid loading="eager" path="/blog/2025/diff-distill/rectifiedflow.png" class="img-fluid rounded z-depth-1" %}
@@ -350,9 +336,10 @@ Unlike most ODE distillation methods that learn to jump from $$t\to s$$ accordin Sampling: Same as FMs. ### Inductive Moment Matching +
- {% include figure.liquid loading="eager" path="blog/2025/diff-distill/IMM.png" class="img-fluid rounded z-depth-1" %} + {% include figure.liquid loading="eager" path="/blog/2025/diff-distill/IMM.png" class="img-fluid rounded z-depth-1" %}
From f0605a9a099659d6a6919409ff1c8b5d41951548 Mon Sep 17 00:00:00 2001 From: Felix Fu <841808303@qq.com> Date: Wed, 3 Sep 2025 22:17:00 -0700 Subject: [PATCH 2/2] add the caption for all the visuals --- _posts/2025-08-18-diff-distill.md | 34 +++++++++++++++++++------------ 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/_posts/2025-08-18-diff-distill.md b/_posts/2025-08-18-diff-distill.md index 610f305..f1ead0a 100644 --- a/_posts/2025-08-18-diff-distill.md +++ b/_posts/2025-08-18-diff-distill.md @@ -45,13 +45,17 @@ Distillation, in general, is a technique that transfers knowledge from a complex
{% include video.liquid path="/blog/2025/diff-distill/diff-distill.mp4" class="img-fluid rounded z-depth-1" controls=true autoplay=true loop=true %} +
+ A video illustrating the basic flow matching concepts and three categories of ODE distillation objectives. +
+ ## Notation at a Glance -The modern approaches of generative modelling consist of picking some samples from a base distribution $$ \mathbf{x}_{1} \sim p_ {\text{noise}} $$, typically an isotropic Gaussian, and learning a map such that $$ \mathbf{x}_{0} \sim p_ {\text{data}} $$. The connection between these two distributions can be expressed by establishing an initial value problem controlled by the **velocity field** $$v(\mathbf{x}_{t}, t)$$, +The modern approaches of generative modelling consist of picking some samples from a base distribution $$ \mathbf{x}_{1} \sim p_{\text{noise}} $$, typically an isotropic Gaussian, and learning a map such that $$ \mathbf{x}_{0} \sim p_{\text{data}} $$. The connection between these two distributions can be expressed by establishing an initial value problem controlled by the **velocity field** $$v(\mathbf{x}_{t}, t)$$, $$ \require{physics} @@ -66,11 +70,12 @@ where the **flow** $$\psi_t:\mathbb{R}^d\times[0,1]\to \mathbb{R}^d$$ is a diffe
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/teaser_probpath_velocity_field.png" class="img-fluid rounded z-depth-1" %} +
+ From left to right:conditional and marginal probability paths, conditional and marginal velocity fields. The velocity field induces a flow that dictates its instantaneous movement across all points in space. +
-
-From left to right:conditional and marginal probability paths, conditional and marginal velocity fields. The velocity field induces a flow that dictates its instantaneous movement across all points in space. -
+ Most of the conditional probability paths are designed as the **differentiable** interpolation between noise and data for simplicity, and we can express sampling from a marginal path $$\mathbf{x}_t = \alpha(t)\mathbf{x}_0 + \beta(t)\mathbf{x}_1$$ where $$\alpha(t), \beta(t)$$ are predefined schedules. The stochastic interpolant paper defines this probability path that summarizes all diffusion models, with several assumptions. Here, we use a simpler interpolant for clean illustration. @@ -298,11 +303,12 @@ Now it is time to connect the dots with some previous existing methods. Let's fr
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/shortcut_model.png" class="img-fluid rounded z-depth-1" %} +
+ The diagram of Shortcut Models +
-
-The diagram of Shortcut Models -
+ In essence, Shortcut Models augment the standard flow matching objective with a self-consistency regularization term. This additional loss component ensures that the learned vector field satisfies a midpoint consistency property: the result of a single large integration step should match the composition of two smaller steps traversing the same portion of the ODE (\ref{eq:1}) trajectory. Training: In the training objective, we neglect the input arguments and focus on the core transition between time steps. Again, we elaborate it with our flow map notation. @@ -324,11 +330,12 @@ Type (c) tri-consistency loss
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/rectifiedflow.png" class="img-fluid rounded z-depth-1" %} +
+ The diagram of rectified flow and ReFlow process +
-
-The diagram of rectified flow and ReFlow process -
+ Unlike most ODE distillation methods that learn to jump from $$t\to s$$ according to our defined flow map $$f_{t\to s}(\mathbf{x}_t, t, s)$$, ReFlow takes a different approach by establishing new noise-data couplings so that the new model will generate straighter trajectories.In the rectified flow paper, the straightness of any continuously differentiable process $$Z=\{Z_t\}$$ can be measured by $$S(Z)=\int_0^1\mathbb{E}\|(Z_1-Z_0)-\dot{Z}_t\|^2 dt$$ where $S(Z)=0$ implies the trajectories are perfectly straight. In this case, this allows the ODE (\ref{eq:1}) to be solved with fewer steps and larger step sizes. To some extent, this resembles the preconditioning from OT-CFM where they intentionally sample noise and data pairs jointly from an optimal transport map $$\pi(\mathbf{x}_0, \mathbf{x}_1)$$ instead of independent marginals. Training: Pair synthesized data from the pretrained model with the noise. Use this new coupling to train a student model with the standard FM objective. @@ -340,11 +347,12 @@ Unlike most ODE distillation methods that learn to jump from $$t\to s$$ accordin
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/IMM.png" class="img-fluid rounded z-depth-1" %} +
+ The diagram of IMM +
-
-The diagram of IMM -
+ This recent method trains our flow map from scratch via matching the distributions of $$f^{\theta}_{t\to s}(\mathbf{x}_t, t, s)$$ and $$f^{\theta}_{r\to s}(\mathbf{x}_r, r, s)$$ where $$sTraining: In our flow map notation, the training objective becomes