In [None]:
We can express the result in compact matrix‐form. Recall the elementwise result

\[
\frac{\partial \beta_i}{\partial v_j}=\frac{r}{d}\left(\delta_{ij}-s_i\,s_j+\frac{s_i}{K\,(\pmb{A}_{i^*}\cdot\vec{s})}\,A_{i^*,j}\right),
\]

where:
- \(\vec{s}=\vec{v}/d\) with \(d=\|\vec{v}\|_2\),
- \(r=\Bigl(\frac{d}{\alpha^{\max}}\Bigr)^{1/K}\),
- and \(\pmb{A}_{i^*}\) is the active (row) constraint, so that \(\pmb{A}_{i^*}\cdot\vec{s}\) is a scalar.

In outer product notation, define:
- the identity matrix \(I\),
- the rank‑one projection \(\vec{s}\vec{s}^T\),
- and note that \(\vec{s}\,\pmb{A}_{i^*}\) is the outer product of the \(K\)-vector \(\vec{s}\) with the \(K\)-dimensional row vector \(\pmb{A}_{i^*}\).

Then the Jacobian can be written as

\[
\boxed{
\frac{\partial \vec{\beta}}{\partial \vec{v}}=\frac{r}{d}\left[I-\vec{s}\vec{s}^T+\frac{1}{K\,(\pmb{A}_{i^*}\cdot\vec{s})}\,\vec{s}\,\pmb{A}_{i^*}\right].
}
\]

This is the desired matrix formulation of the Jacobian.

Below is an expanded version of your text that integrates the ideas of flow matching, along with the necessary mathematical formulations for a probability path, a conditional probability path, and an optimal transport conditional probability path.

---

\subsection{Continuous Normalizing Flows and Flow Matching}

Pushing the idea of composing flows to its continuous limit, we arrive at continuous normalizing flows (CNFs) or, equivalently, neural ODEs \cite{chen_neural_2019, grathwohl_ffjord_2018}. In this continuous formulation, the transformation of a state is described by an ODE:
\begin{align}
\frac{d\vec{h}(t)}{dt} = \mathbf{g}(\vec{h}(t), t; \theta), \quad \vec{h}(0) = \vec{b}, \quad \vec{h}(T) = \vec{y},
\end{align}
with the corresponding evolution of the density governed by
\begin{align}
\frac{d\log p(\vec{h}(t))}{dt} = -\operatorname{Tr}\!\left(\frac{\partial \mathbf{g}}{\partial \vec{h}(t)}\right),
\end{align}
so that integrating over time yields
\begin{align}
p(\vec{y}) = p(\vec{b}) \exp\!\left(-\int_0^T \operatorname{Tr}\!\left(\frac{\partial \mathbf{g}}{\partial \vec{h}(t)}\right) dt\right).
\end{align}

This continuous perspective not only offers a smooth, adaptive way to model transformations but also provides an alternative viewpoint to the discrete stacking of flows. Notably, it obviates the need for expensive determinant computations, thereby allowing for transformations with non-triangular Jacobians. However, in maximum likelihood training, evaluating the Jacobian trace (or the equivalent determinant in the discrete case) can become a computational bottleneck, especially in high-dimensional settings.

\subsubsection{Incorporating Flow Matching}

An alternative training paradigm is \emph{flow matching}, where instead of directly optimizing likelihoods, one learns the velocity field by matching it to a \emph{target} velocity field that induces a prescribed probability evolution. This approach can be understood in three related settings:

\paragraph{1. Probability Path}  
Define a \emph{probability path} \(\{p_t(\vec{x})\}_{t \in [0, T]}\) that continuously interpolates between a base distribution \(p(\vec{b}) = p_0(\vec{x})\) and a target distribution \(p(\vec{y}) = p_T(\vec{x})\). The evolution of the density along this path satisfies the continuity equation:
\begin{align}
\frac{\partial p_t(\vec{x})}{\partial t} + \nabla \cdot \left( p_t(\vec{x})\, v_t(\vec{x}) \right) = 0,
\end{align}
where \(v_t(\vec{x})\) is the velocity field corresponding to the probability flow. Flow matching then trains the neural network to match this target velocity field by minimizing the objective
\begin{align}
\min_\theta \; \mathbb{E}_{t\sim U(0,T),\; \vec{x}\sim p_t(\vec{x})} \Big[\big\| \mathbf{g}(\vec{x}, t; \theta) - v_t(\vec{x}) \big\|^2 \Big].
\end{align}

\paragraph{2. Conditional Probability Path}  
In many applications, we wish to model conditional distributions. For a conditioning variable \(\vec{c}\), define a \emph{conditional probability path} \(\{p_t(\vec{y}\,|\,\vec{c})\}_{t\in[0,T]}\) that interpolates between a conditional base distribution \(p_0(\vec{y}\,|\,\vec{c})\) and the target conditional distribution \(p_T(\vec{y}\,|\,\vec{c})\). The associated velocity field is now \(v_t(\vec{y}\,|\,\vec{c})\), and the flow matching objective becomes
\begin{align}
\min_\theta \; \mathbb{E}_{\vec{c}\sim p(\vec{c}),\; t\sim U(0,T),\; \vec{y}\sim p_t(\vec{y}\,|\,\vec{c})} \Big[\big\| \mathbf{g}(\vec{y}, t; \theta, \vec{c}) - v_t(\vec{y}\,|\,\vec{c}) \big\|^2 \Big].
\end{align}

\paragraph{3. Optimal Transport Conditional Probability Path}  
A particularly appealing choice for the conditional probability path is one derived from optimal transport. For a fixed \(\vec{c}\), let \(T_{\vec{c}}\) denote the optimal transport map that pushes forward the base conditional density to the target conditional density, i.e.,
\[
T_{\vec{c}} \sharp p_0(\vec{y}\,|\,\vec{c}) = p_T(\vec{y}\,|\,\vec{c}).
\]
McCann’s displacement interpolation then provides the optimal transport conditional probability path:
\begin{align}
p_t(\vec{y}\,|\,\vec{c}) = \Big((1-t/T)\,\mathrm{id} + (t/T)\,T_{\vec{c}}\Big) \sharp p_0(\vec{y}\,|\,\vec{c}).
\end{align}
The corresponding velocity field is given by
\begin{align}
v_t(\vec{y}\,|\,\vec{c}) = \frac{T_{\vec{c}}(\vec{y}) - \vec{y}}{T}.
\end{align}
Thus, the optimal transport flow matching objective becomes
\begin{align}
\min_\theta \; \mathbb{E}_{\vec{c}\sim p(\vec{c}),\; t\sim U(0,T),\; \vec{y}\sim p_t(\vec{y}\,|\,\vec{c})} \Bigg[\Big\| \mathbf{g}(\vec{y}, t; \theta, \vec{c}) - \frac{T_{\vec{c}}(\vec{y}) - \vec{y}}{T} \Big\|^2 \Bigg].
\end{align}

\paragraph{Discussion}  
Incorporating flow matching into the training of continuous normalizing flows provides an alternative to maximum likelihood estimation. By guiding the learned dynamics to match a prescribed probability path (whether unconditional or conditional), flow matching can yield more direct control over the transport of probability mass. In particular, when using an optimal transport conditional probability path, the flow is encouraged to follow the most efficient trajectory (in terms of transport cost) between the base and target densities, which can be advantageous in high-dimensional settings or when modeling complex conditional distributions.

---

This expanded framework not only maintains the benefits of continuous formulations but also introduces a flexible training objective via flow matching that can alleviate computational bottlenecks and potentially lead to more interpretable transport dynamics.

Below is an explanation—both in mathematical notation and in descriptive text—that you can include in your publication. It explains how our ConvexPolytopeManifold projects points onto the polytope as well as tangent vectors onto the tangent cone, and how the exponential and logarithmic maps are defined.

---

### Mathematical Description

**1. Projection onto the Polytope**

Let  
\[
P = \{ x \in \mathbb{R}^n : A x \leq b \}
\]
be a convex polytope defined by \( m \) linear inequalities. The Euclidean projection of any point \( x \in \mathbb{R}^n \) onto \( P \) is given by:
\[
\operatorname{proj}_P(x) = \arg\min_{z \in P} \| z - x \|^2.
\]
This problem is convex and can be reformulated in its dual form. In particular, the dual quadratic program is
\[
\min_{\lambda \ge 0} \; \frac{1}{2}\lambda^T Q \lambda - \lambda^T (Ax - b),
\]
where \( Q = A A^T \). Once the optimal dual variable \( \lambda^\star \) is obtained, the projected point is recovered via the KKT conditions as:
\[
\operatorname{proj}_P(x) = x - A^T \lambda^\star.
\]

**2. Projection onto the Tangent Cone**

For a point \( x \in P \), the tangent cone at \( x \) is defined as
\[
T_x P = \{ v \in \mathbb{R}^n : A_i v \leq 0 \quad \text{for all active constraints } i \text{ (i.e. } A_i x = b_i \text{)} \}.
\]
(In practice, constraints for which \( A_i x \approx b_i \) within a tolerance are considered active.) Given a vector \( u \in \mathbb{R}^n \) (which might not lie in \( T_xP \)), its projection onto the tangent cone is defined by
\[
\operatorname{proj}_{T_x P}(u) = \arg\min_{v \in T_x P} \| u - v \|^2.
\]
Analogous to the point projection, one can derive a dual formulation for this projection. For the subset of constraints active at \( x \), one solves the dual problem
\[
\min_{\lambda \ge 0} \; \frac{1}{2}\lambda^T Q \lambda - \lambda^T \left[ (u A^T)_\text{active} \right],
\]
and then recovers the projected tangent vector as
\[
\operatorname{proj}_{T_x P}(u) = u - A^T \lambda^\star.
\]

**3. Exponential and Logarithmic Maps**

- **Exponential Map:**  
  To “exponentiate” a tangent vector \( u \) at a base point \( x \in P \) and obtain a point on the manifold, we define
  \[
  \exp_x(u) = \operatorname{proj}_P(x + u).
  \]
  That is, one first takes a Euclidean step \( x + u \) in the ambient space and then projects back onto the polytope \( P \).

- **Logarithmic Map:**  
  The logarithmic map is chosen to be the simple Euclidean difference:
  \[
  \log_x(y) = y - x.
  \]
  This choice is natural when the ambient metric is Euclidean and the projection operation is used to “correct” points that lie outside \( P \).

---

### Descriptive Explanation

**Projection of Points onto the Polytope**

Given a convex polytope defined by a set of linear inequalities \( Ax \leq b \), the natural way to project an arbitrary point \( x \) onto the polytope is by finding the point within \( P \) that is closest to \( x \) in the Euclidean sense. This is formulated as a convex optimization problem:
\[
\min_{z \in P} \| z - x \|^2.
\]
By duality, this problem can be expressed in terms of a dual variable \( \lambda \ge 0 \), leading to the dual quadratic program
\[
\min_{\lambda \ge 0} \; \frac{1}{2}\lambda^T (A A^T) \lambda - \lambda^T (Ax - b).
\]
Once the optimal \( \lambda^\star \) is computed (approximately, via an iterative gradient descent scheme in our implementation), the projected point is recovered through the relationship
\[
\operatorname{proj}_P(x) = x - A^T \lambda^\star.
\]
This projection is used as the “exponential map” of the manifold.

**Projection of Tangent Vectors**

When working on a manifold, it is often necessary to project arbitrary vectors in the ambient space onto the tangent space (or, in the case of a manifold with boundary, onto the tangent cone). For a point \( x \) on the polytope, the tangent cone \( T_x P \) consists of all directions that do not immediately violate the constraints. In our method, we first determine which constraints are active at \( x \) (those for which \( Ax \) is nearly equal to \( b \)). The projection of a tangent vector \( u \) onto the tangent cone is then found by solving a similar dual optimization problem. The dual variable \( \lambda \) for the tangent projection is computed so that
\[
\operatorname{proj}_{T_x P}(u) = u - A^T \lambda^\star.
\]
This operation ensures that the resulting vector lies within the feasible direction set of the polytope.

**Exponential and Logarithmic Maps**

In our framework, the exponential map is defined by “moving” from a point \( x \) along a tangent vector \( u \) and then projecting back onto the polytope:
\[
\exp_x(u) = \operatorname{proj}_P(x + u).
\]
This reflects the idea that while the ambient space is Euclidean, the manifold’s structure is enforced by the projection. Conversely, the logarithmic map is defined in the simplest possible way as
\[
\log_x(y) = y - x,
\]
which is consistent with the Euclidean metric. (The projection is not needed for the logarithmic map because it is assumed that \( y \) lies on the manifold.)

---

This formulation elegantly combines standard Euclidean operations with projections to enforce the constraint \( Ax \leq b \). The dual formulations guarantee that both the point and tangent projections are computed in a manner that respects the geometry of the convex polytope, ensuring that the resulting maps are smooth (where differentiability holds) and consistent with the underlying Euclidean metric.

In [1]:
import arviz as az
import numpy as np

# Generate some fake data:
n_chains = 2
n_draws = 1000
n_params = 3

# Posterior samples for a parameter "theta" of shape (chains, draws, params)
theta = np.random.randn(n_chains, n_draws, n_params)
# Fake log-probabilities for the sample stats (shape: chains x draws)
log_prob = np.random.randn(n_chains, n_draws)

# Define dims: here, the non-sample dimension for theta is named "theta_id"
dims = {"theta": ["theta_id"]}

# Define coords: the coordinate "theta_id" has three values (one for each parameter)
coords = {"theta_id": ["alpha", "beta", "gamma"]}

# Define sample statistics:
sample_stats = {"lp": log_prob}

# Define some custom attributes (these can be any additional metadata)
attrs = {"model": "Example Mixture Model", "n_params": n_params}

# Create the InferenceData object using arviz.from_dict
idata = az.from_dict(
    posterior={"theta": theta},  # theta is a (chains x draws x params) array
    dims=dims,
    coords=coords,
    sample_stats=sample_stats,
    attrs=attrs
)

# Now you can access the posterior variable "theta"
print("Shape of theta:", idata.posterior.theta.values.shape)
# Expected output: Shape of theta: (2, 1000, 3)

# You can also inspect the coordinates:
print("Coordinates for 'theta_id':", idata.coords["theta_id"].values)


Shape of theta: (2, 1000, 3)


AttributeError: 'InferenceData' object has no attribute 'coords'

Below is an expanded section on flow matching that you could incorporate into your manuscript:

---

\subsection{Flow Matching in Continuous Normalizing Flows}

A central challenge in training continuous normalizing flows (CNFs) is the efficient evaluation of the change-of-variables formula. In the continuous setting, the evolution of the log-density is governed by
\[
\frac{d\log p(\vec{h}(t))}{dt} = -\operatorname{Tr}\!\left(\frac{\partial \mathbf{g}(\vec{h}(t),t;\theta)}{\partial \vec{h}(t)}\right),
\]
so that, after integrating from \(t=0\) to \(t=T\), the density at the terminal state \(\vec{y} = \vec{h}(T)\) is given by
\[
p(\vec{y}) = p(\vec{b}) \exp\!\left(-\int_0^T \operatorname{Tr}\!\left(\frac{\partial \mathbf{g}(\vec{h}(t),t;\theta)}{\partial \vec{h}(t)}\right) dt\right).
\]
In maximum likelihood training, the numerical evaluation of the Jacobian trace (or the equivalent determinant in the discrete case) can become a computational bottleneck, particularly in high-dimensional settings.

A promising alternative is provided by \emph{flow matching}, a method that directly aligns the model’s instantaneous velocity field with a target vector field derived from the data distribution. Instead of computing the log-determinant term, one defines a loss that penalizes discrepancies between the model’s velocity and a prescribed target field. Concretely, let
\[
\mathbf{v}^*(\vec{h}(t),t)
\]
denote a target velocity field that transports samples from a simple base distribution to the target distribution (see, e.g., \cite{lipman_flow_2023, lipman_flow_2024, chen_flow_2024}). The flow matching loss is then defined as
\[
\mathcal{L}_{\text{FM}} = \mathbb{E}_{t \sim \mathcal{U}(0,T),\ \vec{h}(0) \sim p(\vec{h}(0))} \left[ \left\Vert \mathbf{g}(\vec{h}(t), t;\theta) - \mathbf{v}^*(\vec{h}(t),t) \right\Vert_2^2 \right].
\]
By minimizing \(\mathcal{L}_{\text{FM}}\), the training procedure encourages the model’s dynamics to closely follow the optimal transport path determined by \(\mathbf{v}^*\). This approach offers several advantages:

\begin{enumerate}
    \item \textbf{Avoidance of Jacobian Determinant Computations:} Unlike maximum likelihood training, which requires computing (or approximating) the trace of the Jacobian of \(\mathbf{g}\), flow matching sidesteps this calculation entirely. This can lead to a significant reduction in computational complexity, particularly when the dimensionality is high.
    
    \item \textbf{Stability of Optimization:} The squared-error loss in \(\mathcal{L}_{\text{FM}}\) is typically smoother and more stable than the direct likelihood objective. This can facilitate convergence and allow for more robust training of CNFs.
    
    \item \textbf{Flexibility in Target Design:} The target vector field \(\mathbf{v}^*\) can be chosen to incorporate domain knowledge or to impose specific transport properties, thereby tailoring the flow to better match the structure of the target distribution.
\end{enumerate}

Empirical studies have shown that flow matching can lead to competitive or even superior performance compared to traditional maximum likelihood training \cite{lipman_flow_2023, lipman_flow_2024, chen_flow_2024}. The method is especially attractive when the target distribution exhibits complex geometry or when the support of the distribution is constrained (for instance, when defined on convex polytopes). In such cases, the flexibility of the target vector field and the computational benefits of avoiding explicit determinant calculations make flow matching a compelling alternative.

In summary, while traditional CNF training relies on the explicit evaluation of the Jacobian trace, flow matching offers a conceptually and computationally attractive alternative. By aligning the instantaneous dynamics of the model with a carefully chosen target vector field, one can achieve efficient and stable training, even in high-dimensional or geometrically complex settings. This approach not only broadens the applicability of CNFs but also paves the way for new methods in generative modeling where the geometry of the data plays a central role.

---

This expanded discussion should provide a clear and thorough account of flow matching, highlighting its motivation, formulation, and advantages over traditional training methods.

In [1]:
import cvxpy as cp
import numpy as np

# Example polytope: Ax <= b
# For demonstration, let the polytope be defined in 2D.
# Example: A square defined by:
#   x <= 1, y <= 1, -x <= 0, -y <= 0
A = np.array([
    [ 1,  0],
    [ 0,  1],
    [-1,  0],
    [ 0, -1]
])
b = np.array([1, 1, 0, 0])

# Define the point x0 (can be inside or outside the polytope)
x0 = np.array([1.5, 0.5])  # For example, outside the polytope

# Define the variable (same dimension as x0)
x = cp.Variable(x0.shape[0])

# Objective: minimize the squared distance to x0
objective = cp.Minimize(cp.sum_squares(x - x0))

# Constraint: x must lie in the polytope Ax <= b
constraints = [A @ x <= b]

# Formulate and solve the problem
prob = cp.Problem(objective, constraints)
result = prob.solve()  # You can specify a solver if needed, e.g., solver=cp.OSQP

# Retrieve the optimal point and compute the distance
x_star = x.value
distance = np.linalg.norm(x_star - x0)

print("Optimal point (projection):", x_star)
print("Distance from x0 to the polytope:", distance)


Optimal point (projection): [1.  0.5]
Distance from x0 to the polytope: 0.5


Below is a revised version of your introduction that is written in a more accessible, reader-friendly style while preserving all the essential mathematical details:

---

Normalizing flows provide a flexible framework for modeling complex, high-dimensional distributions. The central idea is to start with a simple base distribution—often a multivariate normal—and transform it into a more complicated target distribution using a smooth, invertible mapping (a diffeomorphism). In our formulation, a normalizing flow maps samples from the base distribution \(q(\vec{U})\) to the target distribution \(p(\vec{Y})\) via a transformation \(\mathscr{f}\) with parameters \(\theta^\mathscr{f}\). Formally, if \(\vec{b}\) is a sample from the base distribution, then

\[
\vec{y} = \mathscr{f}(\vec{b}; \theta^\mathscr{f}), \quad \vec{b} \sim p(\vec{B}; \theta^b), \quad \vec{b} \in \mathcal{B}, \quad \vec{y} \in \mathcal{Y}.
\]

The transformation \(\mathscr{f}\) is a diffeomorphism, meaning it is smooth and invertible. More precisely, we require that

\[
\begin{aligned}
\mathscr{f} &: \mathcal{B} \to \mathcal{Y}, \\
\mathscr{f}^{-1} &: \mathcal{Y} \to \mathcal{B}, \\
\mathscr{f} &\in C^\infty(\mathcal{B}, \mathcal{Y}), \quad \mathscr{f}^{-1} \in C^\infty(\mathcal{Y}, \mathcal{B}).
\end{aligned}
\]

Once we have defined the mapping, we can compute the probability density of the transformed variable using the change of variables formula. If \(\pmb{J}^\mathscr{f}(\vec{b})\) denotes the Jacobian matrix of \(\mathscr{f}\) at \(\vec{b}\), then

\[
p(\vec{y}) = p(\vec{b}) \cdot \left|\det\!\left(\pmb{J}^\mathscr{f}(\vec{b})\right)\right|^{-1},
\]
or equivalently,
\[
p(\vec{y}) = p\!\left(\mathscr{f}^{-1}(\vec{y})\right) \cdot \left|\det\!\left(\pmb{J}^{\mathscr{f}^{-1}}(\vec{y})\right)\right|.
\]
Here, the Jacobian matrix is given by

\[
\pmb{J}^\mathscr{f}(\vec{b}) =
\begin{bmatrix}
\frac{\partial \vec{y}_1}{\partial \vec{b}_1} & \dots & \frac{\partial \vec{y}_1}{\partial \vec{b}_K} \\[1mm]
\vdots & \ddots & \vdots \\[1mm]
\frac{\partial \vec{y}_K}{\partial \vec{b}_1} & \dots & \frac{\partial \vec{y}_K}{\partial \vec{b}_K}
\end{bmatrix}.
\]

For learning the target distribution, it is crucial to evaluate the density efficiently. This requires fast computation of both the inverse transformation \(\mathscr{f}^{-1}\) and the determinant of its Jacobian. Likewise, efficient sampling from the flow depends on quickly computing \(\mathscr{f}\) itself. In many practical implementations, the design of the transformation ensures that the Jacobian is triangular, which significantly simplifies the determinant calculation (see, e.g., \textcite{papamakarios_normalizing_2019} for a detailed discussion).

Typically, the base distribution is chosen as a multivariate normal (MVN), whose support is all of \(\mathbb{R}^K\). This choice works well for many applications, but it can create challenges when the target distribution naturally lives on a space with non-Euclidean geometry—such as on tori or spheres \cite{rezende_normalizing_2020, gemici_normalizing_2016}—or on compact subsets of \(\mathbb{R}^K\), like a ball or a convex polytope \(\mathcal{F} \subset \mathbb{R}^K\). In such cases, a diffeomorphism mapping the base distribution onto the entire target space may not exist.

One way to address these challenges is to increase the flexibility of the mapping by composing several simpler diffeomorphisms. Instead of a single transformation \(\mathscr{f}\), we define a composite mapping as

\[
\mathscr{f} = \mathscr{f}_L \circ \mathscr{f}_{L-1} \circ \cdots \circ \mathscr{f}_1,
\]
where each \(\mathscr{f}_\ell\) is a diffeomorphism between intermediate spaces (with \(\mathcal{H}_0 = \mathcal{B}\) and \(\mathcal{H}_L = \mathcal{Y}\)). The change of variables formula for the composite flow becomes

\[
p(\vec{y}) = p(\vec{b}) \prod_{\ell=1}^L \left|\det\!\left(\pmb{J}^{\mathscr{f}_\ell}(\vec{h}_{\ell-1})\right)\right|^{-1},
\]
where \(\vec{h}_0 = \vec{b}\) and \(\vec{h}_\ell = \mathscr{f}_\ell(\vec{h}_{\ell-1})\). This layered construction enables us to model highly complex transformations by combining simpler, tractable ones.

Pushing this idea further, one may consider the limit as the number of composed flows tends to infinity. In this continuous limit, the discrete sequence of transformations is replaced by a continuous evolution governed by an ordinary differential equation (ODE). This approach leads to **continuous normalizing flows (CNFs)** or, equivalently, **neural ODEs**. In the continuous formulation, we describe the transformation by a time-dependent state \(\vec{h}(t)\) that evolves according to

\[
\frac{d\vec{h}(t)}{dt} = \mathbf{g}(\vec{h}(t), t; \theta), \quad \vec{h}(0) = \vec{b}, \quad \vec{h}(T) = \vec{y},
\]
where \(\mathbf{g}\) is an instantaneous velocity field typically modeled by a neural network. The corresponding evolution of the density is given by

\[
\frac{d\log p(\vec{h}(t))}{dt} = -\operatorname{Tr}\!\left(\frac{\partial \mathbf{g}}{\partial \vec{h}(t)}\right),
\]
and integration over time yields

\[
p(\vec{y}) = p(\vec{b}) \exp\!\left(-\int_0^T \operatorname{Tr}\!\left(\frac{\partial \mathbf{g}}{\partial \vec{h}(t)}\right) dt\right).
\]
This continuous perspective not only offers a smooth, adaptive way to model transformations but also provides an alternative viewpoint to the discrete stacking of flows.

A recent and promising development in this area is **flow matching**. Traditional training of normalizing flows via maximum likelihood requires the computation of the Jacobian determinant, which can be challenging in high dimensions. Flow matching offers a different approach by directly aligning the model’s instantaneous velocity field \(\mathbf{g}(\vec{h}(t), t; \theta)\) with a target vector field derived from the data. By constructing a loss function that penalizes discrepancies between the two, flow matching avoids the need for explicit determinant calculations, potentially simplifying optimization and enhancing stability (see, e.g., \textcite{flow_matching_reference}).

In summary, normalizing flows transform a simple base distribution into a complex target distribution through smooth, invertible mappings. Enhancements such as composing multiple diffeomorphisms, taking the continuous limit to form neural ODEs, and employing novel training methods like flow matching all contribute to the robustness and flexibility of these models. These advances allow normalizing flows to be applied even when the target distribution has a non-Euclidean geometry or is defined on a compact support, such as in the case of convex polytopes.

--- 

This version maintains the mathematical rigor of your original text while presenting the ideas in a clearer and more reader-friendly manner.