Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
e332d8c
remove the type `ParamSpaceSGD`
Red-Portal Sep 15, 2025
1f35cc9
run formatter
Red-Portal Sep 15, 2025
c8404b6
run formatter
Red-Portal Sep 15, 2025
0cc7538
run formatter
Red-Portal Sep 15, 2025
ede91c6
fix rename file paramspacesgd.jl to interface.jl
Red-Portal Oct 13, 2025
625f429
Merge branch 'remove_paramspacesgd' of github.com:TuringLang/Advanced…
Red-Portal Oct 13, 2025
e3c2761
Merge branch 'main' of github.com:TuringLang/AdvancedVI.jl into remov…
Red-Portal Oct 13, 2025
683a09d
throw invalid state for unknown paramspacesgd type
Red-Portal Oct 13, 2025
570fe11
add docstring for union type of paramspacesgd algorithms
Red-Portal Oct 13, 2025
2d5f373
fix remove custom state types for paramspacesgd algorithms
Red-Portal Oct 13, 2025
e0221eb
fix remove custom state types for paramspacesgd
Red-Portal Oct 13, 2025
e51ab3c
fix file path
Red-Portal Oct 13, 2025
e49c680
fix bug in BijectorsExt
Red-Portal Oct 13, 2025
3c5b56f
fix include `SubSampleObjective` as part of `ParamSpaceSGD`
Red-Portal Oct 13, 2025
30f5160
fix formatting
Red-Portal Oct 13, 2025
008c4ea
fix revert adding SubsampledObjective into ParamSpaceSGD
Red-Portal Oct 13, 2025
8a18902
refactor flatten algorithms
Red-Portal Oct 13, 2025
b002e1e
fix error update paths in main file
Red-Portal Oct 13, 2025
1ba361f
refactor flatten the tests to reflect new structure
Red-Portal Oct 13, 2025
86baa07
fix file include path in tests
Red-Portal Oct 13, 2025
67e9375
fix missing operator in subsampledobj tests
Red-Portal Oct 13, 2025
9b2eabb
fix formatting
Red-Portal Oct 13, 2025
922e5d7
update docs
Red-Portal Oct 20, 2025
7d3ed86
Merge branch 'remove_paramspacesgd' of github.com:TuringLang/Advanced…
Red-Portal Oct 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 3 additions & 11 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -23,17 +23,9 @@ makedocs(;
"Normalizing Flows" => "tutorials/flows.md",
],
"Algorithms" => [
"KLMinRepGradDescent" => "paramspacesgd/klminrepgraddescent.md",
"KLMinRepGradProxDescent" => "paramspacesgd/klminrepgradproxdescent.md",
"KLMinScoreGradDescent" => "paramspacesgd/klminscoregraddescent.md",
"Parameter Space SGD" => [
"General" => "paramspacesgd/general.md",
"Objectives" => [
"Overview" => "paramspacesgd/objectives.md",
"RepGradELBO" => "paramspacesgd/repgradelbo.md",
"ScoreGradELBO" => "paramspacesgd/scoregradelbo.md",
],
],
"KLMinRepGradDescent" => "klminrepgraddescent.md",
"KLMinRepGradProxDescent" => "klminrepgradproxdescent.md",
"KLMinScoreGradDescent" => "klminscoregraddescent.md",
],
"Variational Families" => "families.md",
"Optimization" => "optimization.md",
Expand Down
2 changes: 1 addition & 1 deletion docs/src/families.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# [Reparameterizable Variational Families](@id families)

The [RepGradELBO](@ref repgradelbo) objective assumes that the members of the variational family have a differentiable sampling path.
Algorithms such as [`KLMinRepGradELBO`](@ref klminrepgraddescent) assume that the members of the variational family have a differentiable sampling path.
We provide multiple pre-packaged variational families that can be readily used.

## [The `LocationScale` Family](@id locscale)
Expand Down
1 change: 0 additions & 1 deletion docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ VI algorithms perform scalable and computationally efficient Bayesian inference

# List of Algorithms

- [ParamSpaceSGD](@ref paramspacesgd)
- [KLMinRepGradDescent](@ref klminrepgraddescent) (alias of `ADVI`)
- [KLMinRepGradProxDescent](@ref klminrepgradproxdescent)
- [KLMinScoreGradDescent](@ref klminscoregraddescent) (alias of `BBVI`)
Original file line number Diff line number Diff line change
@@ -1,49 +1,75 @@
# [Reparameterization Gradient Estimator](@id repgradelbo)
# [`KLMinRepGradDescent`](@id klminrepgraddescent)

## Overview
## Description

The `RepGradELBO` objective implements the reparameterization gradient estimator[^HC1983][^G1991][^R1992][^P1996] of the ELBO gradient.
The reparameterization gradient, also known as the push-in gradient or the pathwise gradient, was introduced to VI in [^TL2014][^RMW2014][^KW2014].
For the variational family $\mathcal{Q} = \{q_{\lambda} \mid \lambda \in \Lambda\}$, suppose the process of sampling from $q_{\lambda}$ can be described by some differentiable reparameterization function $$T_{\lambda}$$ and a *base distribution* $$\varphi$$ independent of $$\lambda$$ such that
This algorithm aims to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence via stochastic gradient descent in the space of parameters.
Specifically, it uses the the *reparameterization gradient estimator*.
As a result, this algorithm is best applicable when the target log-density is differentiable and the sampling process of the variational family is differentiable.
(See the [methodology section](@ref klminrepgraddescent_method) for more details.)
This algorithm is also commonly referred to as automatic differentiation variational inference, black-box variational inference with the reparameterization gradient, and stochastic gradient variational inference.
`KLMinRepGradDescent` is also an alias of `ADVI` .

```@docs
KLMinRepGradDescent
```

## [Methodology](@id klminrepgraddescent_method)

This algorithm aims to solve the problem

[^HC1983]: Ho, Y. C., & Cao, X. (1983). Perturbation analysis and optimization of queueing networks. Journal of optimization theory and Applications, 40(4), 559-582.
[^G1991]: Glasserman, P. (1991). Gradient estimation via perturbation analysis (Vol. 116). Springer Science & Business Media.
[^R1992]: Rubinstein, R. Y. (1992). Sensitivity analysis of discrete event systems by the “push out” method. Annals of Operations Research, 39(1), 229-250.
[^P1996]: Pflug, G. C. (1996). Optimization of stochastic models: the interface between simulation and optimization (Vol. 373). Springer Science & Business Media.
[^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*.
[^RMW2014]: Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In *International Conference on Machine Learning*.
[^KW2014]: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In *International Conference on Learning Representations*.
```math
z \sim q_{\lambda} \qquad\Leftrightarrow\qquad
z \stackrel{d}{=} T_{\lambda}\left(\epsilon\right);\quad \epsilon \sim \varphi \; .
\mathrm{minimize}_{q \in \mathcal{Q}}\quad \mathrm{KL}\left(q, \pi\right)
```

In these cases, denoting the target log denstiy as $\log \pi$, we can effectively estimate the gradient of the ELBO by directly differentiating the stochastic estimate of the ELBO objective
where $\mathcal{Q}$ is some family of distributions, often called the variational family, by running stochastic gradient descent in the (Euclidean) space of parameters.
That is, for all $$q_{\lambda} \in \mathcal{Q}$$, we assume $$q_{\lambda}$$ there is a corresponding vector of parameters $$\lambda \in \Lambda$$, where the space of parameters is Euclidean such that $$\Lambda \subset \mathbb{R}^p$$.

Since we usually only have access to the unnormalized densities of the target distribution $\pi$, we don't have direct access to the KL divergence.
Instead, the ELBO maximization strategy maximizes a surrogate objective, the *evidence lower bound* (ELBO; [^JGJS1999])

```math
\widehat{\mathrm{ELBO}}\left(\lambda\right) = \frac{1}{M}\sum^M_{m=1} \log \pi\left(T_{\lambda}\left(\epsilon_m\right)\right) + \mathbb{H}\left(q_{\lambda}\right),
\mathrm{ELBO}\left(q\right) \triangleq \mathbb{E}_{\theta \sim q} \log \pi\left(\theta\right) + \mathbb{H}\left(q\right),
```

where $$\epsilon_m \sim \varphi$$ are Monte Carlo samples.
The resulting gradient estimate is called the reparameterization gradient estimator.
which is equivalent to the KL up to an additive constant (the evidence).

In addition to the reparameterization gradient, `AdvancedVI` provides the following features:
Algorithmically, `KLMinRepGradDescent` iterates the step

1. **Posteriors with constrained supports** are handled through [`Bijectors`](https://github.com/TuringLang/Bijectors.jl), which is known as the automatic differentiation VI (ADVI; [^KTRGB2017]) formulation. (See [this section](@ref bijectors).)
2. **The gradient of the entropy** can be estimated through various strategies depending on the capabilities of the variational family. (See [this section](@ref entropygrad).)
```math
\lambda_{t+1} = \mathrm{operator}\big(
\lambda_{t} + \gamma_t \widehat{\nabla_{\lambda} \mathrm{ELBO}} (q_{\lambda_t})
\big) ,
```

## `RepGradELBO`
where $\widehat{\nabla \mathrm{ELBO}}(q_{\lambda})$ is the reparameterization gradient estimate[^HC1983][^G1991][^R1992][^P1996] of the ELBO gradient and $$\mathrm{operator}$$ is an optional operator (*e.g.* projections, identity mapping).

To use the reparameterization gradient, `AdvancedVI` provides the following variational objective:
The reparameterization gradient, also known as the push-in gradient or the pathwise gradient, was introduced to VI in [^TL2014][^RMW2014][^KW2014].
For the variational family $$\mathcal{Q}$$, suppose the process of sampling from $$q_{\lambda} \in \mathcal{Q}$$ can be described by some differentiable reparameterization function $$T_{\lambda}$$ and a *base distribution* $$\varphi$$ independent of $$\lambda$$ such that

```@docs
RepGradELBO
```math
z \sim q_{\lambda} \qquad\Leftrightarrow\qquad
z \stackrel{d}{=} T_{\lambda}\left(\epsilon\right);\quad \epsilon \sim \varphi \; .
```

## [Handling Constraints with `Bijectors`](@id bijectors)
In these cases, denoting the target log denstiy as $\log \pi$, we can effectively estimate the gradient of the ELBO by directly differentiating the stochastic estimate of the ELBO objective

As mentioned in the docstring, the `RepGradELBO` objective assumes that the variational approximation $$q_{\lambda}$$ and the target distribution $$\pi$$ have the same support for all $$\lambda \in \Lambda$$.
```math
\widehat{\mathrm{ELBO}}\left(q_{\lambda}\right) = \frac{1}{M}\sum^M_{m=1} \log \pi\left(T_{\lambda}\left(\epsilon_m\right)\right) + \mathbb{H}\left(q_{\lambda}\right),
```

where $$\epsilon_m \sim \varphi$$ are Monte Carlo samples.

[^JGJS1999]: Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37, 183-233.
[^HC1983]: Ho, Y. C., & Cao, X. (1983). Perturbation analysis and optimization of queueing networks. Journal of optimization theory and Applications, 40(4), 559-582.
[^G1991]: Glasserman, P. (1991). Gradient estimation via perturbation analysis (Vol. 116). Springer Science & Business Media.
[^R1992]: Rubinstein, R. Y. (1992). Sensitivity analysis of discrete event systems by the “push out” method. Annals of Operations Research, 39(1), 229-250.
[^P1996]: Pflug, G. C. (1996). Optimization of stochastic models: the interface between simulation and optimization (Vol. 373). Springer Science & Business Media.
[^TL2014]: Titsias, M., & Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In *International Conference on Machine Learning*.
[^RMW2014]: Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In *International Conference on Machine Learning*.
[^KW2014]: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In *International Conference on Learning Representations*.
## [Handling Constraints with `Bijectors`](@id bijectors)

As mentioned in the docstring, `KLMinRepGradDescent` assumes that the variational approximation $$q_{\lambda}$$ and the target distribution $$\pi$$ have the same support for all $$\lambda \in \Lambda$$.
However, in general, it is most convenient to use variational families that have the whole Euclidean space $$\mathbb{R}^d$$ as their support.
This is the case for the [location-scale distributions](@ref locscale) provided by `AdvancedVI`.
For target distributions which the support is not the full $$\mathbb{R}^d$$, we can apply some transformation $$b$$ to $$q_{\lambda}$$ to match its support such that
Expand All @@ -57,9 +83,11 @@ where $$b$$ is often called a *bijector*, since it is often chosen among bijecti
This idea is known as automatic differentiation VI[^KTRGB2017] and has subsequently been improved by Tensorflow Probability[^DLTBV2017].
In Julia, [Bijectors.jl](https://github.com/TuringLang/Bijectors.jl)[^FXTYG2020] provides a comprehensive collection of bijections.

One caveat of ADVI is that, after applying the bijection, a Jacobian adjustment needs to be applied.
That is, the objective is now

[^KTRGB2017]: Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. *Journal of Machine Learning Research*, 18(14), 1-45.
[^DLTBV2017]: Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
[^FXTYG2020]: Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In *Symposium on Advances in Approximate Bayesian Inference*.
One caveat of ADVI is that, after applying the bijection, a Jacobian adjustment needs to be applied.
That is, the objective is now
```math
\mathrm{ADVI}\left(\lambda\right)
\triangleq
Expand All @@ -84,13 +112,10 @@ q_transformed = Bijectors.TransformedDistribution(q, binv)
By passing `q_transformed` to `optimize`, the Jacobian adjustment for the bijector `b` is automatically applied.
(See the [Basic Example](@ref basic) for a fully working example.)

[^KTRGB2017]: Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (2017). Automatic differentiation variational inference. *Journal of Machine Learning Research*.
[^DLTBV2017]: Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore, D., ... & Saurous, R. A. (2017). Tensorflow distributions. arXiv.
[^FXTYG2020]: Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., & Ge, H. (2020,. Bijectors. jl: Flexible transformations for probability distributions. In *Symposium on Advances in Approximate Bayesian Inference*.
## [Entropy Estimators](@id entropygrad)
## [Entropy Gradient Estimators](@id entropygrad)

For the gradient of the entropy term, we provide three choices with varying requirements.
The user can select the entropy estimator by passing it as a keyword argument when constructing the `RepGradELBO` objective.
The user can select the entropy estimator by passing it as a keyword argument when constructing the algorithm object.

| Estimator | `entropy(q)` | `logpdf(q)` | Type |
|:--------------------------- |:------------:|:-----------:|:-------------------------------- |
Expand Down Expand Up @@ -179,7 +204,7 @@ end

In this example, the true posterior is contained within the variational family.
This setting is known as "perfect variational family specification."
In this case, the `RepGradELBO` estimator with `StickingTheLandingEntropy` is the only estimator known to converge exponentially fast ("linear convergence") to the true solution.
In this case, `KLMinRepGradDescent` with `StickingTheLandingEntropy` is the only estimator known to converge exponentially fast ("linear convergence") to the true solution.

Recall that the original ADVI objective with a closed-form entropy (CFE) is given as follows:

Expand Down Expand Up @@ -281,7 +306,7 @@ Furthermore, in a lot of cases, a low-accuracy solution may be sufficient.
[^KMG2024]: Kim, K., Ma, Y., & Gardner, J. (2024). Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?. In International Conference on Artificial Intelligence and Statistics (pp. 235-243). PMLR.
## Advanced Usage

There are two major ways to customize the behavior of `RepGradELBO`
There are two major ways to customize the behavior of `KLMinRepGradDescent`

- Customize the `Distributions` functions: `rand(q)`, `entropy(q)`, `logpdf(q)`.
- Customize `AdvancedVI.reparam_with_entropy`.
Expand Down
61 changes: 61 additions & 0 deletions docs/src/klminrepgradproxdescent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# [`KLMinRepGradProxDescent`](@id klminrepgradproxdescent)

## Description

This algorithm is a slight variation of [`KLMinRepGradDescent`](@ref klminrepgraddescent) specialized to [location-scale families](@ref locscale).
Therefore, it also aims to minimize the exclusive (or reverse) Kullback-Leibler (KL) divergence over the space of parameters.
But instead, it uses stochastic proximal gradient descent with the [proximal operator](@ref proximalocationscaleentropy) of the entropy of location-scale variational families as discussed in: [^D2020][^KMG2024][^DGG2023].
The remainder of the section will only discuss details specific to `KLMinRepGradProxDescent`.
Thus, for general usage and additional details, please refer to the docs of `KLMinRepGradDescent` instead.

```@docs
KLMinRepGradProxDescent
```

It implements the stochastic proximal gradient descent-based algorithm described in: .

## Methodology

Recall that [KLMinRepGradDescent](@ref klminrepgraddescent) maximizes the ELBO.
Now, the ELBO can be re-written as follows:

```math
\mathrm{ELBO}\left(q\right) \triangleq \mathcal{E}\left(q\right) + \mathbb{H}\left(q\right),
```

where

```math
\mathcal{E}\left(q\right) = \mathbb{E}_{\theta \sim q} \log \pi\left(\theta\right)
```

is often referred to as the *negative energy functional*.
`KLMinRepGradProxDescent` attempts to address the fact that minimizing the whole ELBO can be unstable due to non-smoothness of $$\mathbb{H}\left(q\right)$$[^D2020].
For this, `KLMinRepGradProxDescent` relies on proximal stochastic gradient descent, where the problematic term $$\mathbb{H}\left(q\right)$$ is separately handled via a *proximal operator*.
Specifically, `KLMinRepGradProxDescent` first estimates the gradient of the energy $$\mathcal{E}\left(q\right)$$ only via the reparameterization gradient estimator.
Let us denote this as $$\widehat{\nabla_{\lambda} \mathcal{E}}\left(q_{\lambda}\right)$$.
Then `KLMinRepGradProxDescent` iterates the step

```math
\lambda_{t+1} = \mathrm{prox}_{-\gamma_t \mathbb{H}}\big(
\lambda_{t} + \gamma_t \widehat{\nabla_{\lambda} \mathcal{E}}(q_{\lambda_t})
\big) ,
```

where

```math
\mathrm{prox}_{h}(\lambda_t)
= \argmin_{\lambda \in \Lambda}\left\{
h(\lambda) + {\lVert \lambda - \lambda_t \rVert}_2^2
\right\}
```

is a proximal operator for the entropy.
As long as $$\mathrm{prox}_{-\gamma_t \mathbb{H}}$$ can be evaluated efficiently, this scheme can side-step the fact that $$\mathbb{H}(\lambda)$$ is difficult to deal with via gradient descent.
For location-scale families, it turns out the proximal operator of the entropy can be operated efficiently[^D2020], which is implemented as [`ProximalLocationScaleEntropy`](@ref proximalocationscaleentropy).
This has been empirically shown to be more robust[^D2020][^KMG2024].

[^D2020]: Domke, J. (2020). Provable smoothness guarantees for black-box variational inference. In *International Conference on Machine Learning*.
[^KMG2024]: Kim, K., Ma, Y., & Gardner, J. (2024). Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?. In International Conference on Artificial Intelligence and Statistics (pp. 235-243). PMLR.
[^DGG2023]: Domke, J., Gower, R., & Garrigos, G. (2023). Provable convergence guarantees for black-box variational inference. Advances in neural information processing systems, 36, 66289-66327.
Loading
Loading