Basic rewrite of the package 2023 edition #45

Red-Portal · 2023-03-14T19:48:24Z

Hi, this is the initial pull request for the rewrite of AdvancedVI as a successor to #25

The following panel will be updated in real-time, reflecting the discussions happening below.

Roadmap

Topics to Discuss

❌ ~~Should we use AbstractDifferentiation?~~ Not now.
✔️ Should we migrate to Optimisers? (probably yes)
✔️ Should we call restructure inside of optimize such that the flattening/unflattening is completely abstracted out to the user? Then, in the current state of things, Flux will have to be added as a dependency, otherwise we'll have to roll our own implementation of destructure. destructure is now part of Optimisers, which is much more lightweight.
✔️ ~~Should we keep TruncatedADAGrad, DecayedADAGrad? I think these are quite outdated and would advise people from using these. So how about deprecating these?~~ Planning to deprecate.

Demo

    using Turing
    using Bijectors
    using Optimisers
    using ForwardDiff
    using ADTypes

    import AdvancedVI as AVI

    μ_y, σ_y = 1.0, 1.0
    μ_z, Σ_z = [1.0, 2.0], [1.0 0.; 0. 2.0]

    Turing.@model function normallognormal()
        y ~ LogNormal(μ_y, σ_y)
        z ~ MvNormal(μ_z, Σ_z)
    end
    model   = normallognormal()
    b       = Bijectors.bijector(model)
    b⁻¹     = inverse(b)
    prob    = DynamicPPL.LogDensityFunction(model)
    d       = LogDensityProblems.dimension(prob)

    μ = randn(d)
    L = Diagonal(ones(d))
    q = AVI.MeanFieldGaussian(μ, L)

    n_max_iter = 10^4
    q, stats = AVI.optimize(
        AVI.ADVI(prob, b⁻¹, 10),
        q,
        n_max_iter;
        adbackend = AutoForwardDiff(),
        optimizer = Optimisers.Adam(1e-3)
    )

src/estimators/advi.jl

src/AdvancedVI.jl

This is to avoid having to reconstruct transformed distributions all the time. The direct use of bijectors also avoids going through lots of abstraction layers that could break. Instead, transformed distributions could be constructed only once when returing the VI result.

torfjelde · 2023-03-22T10:40:03Z

I'll have a look at the PR itself later, but for now:

AdvancedVI.jl naively reconstructs/deconstructs MvNormal from its variational parameters. This is okay from the mean-field parameterization, but for full-rank or non-diagonal covariance parameterization, this is a little more complicated since MvNormal in Distributions.jl asks for a PDMat. So the variational parameters must first be converted to a matrix, then to a PDMat, and then fed to MvNormal. For high-dimensional problems, not sure if this is ideal.

In relation to the topic above, I'm starting to believe that implementing our custom distribution (just as @theogf previously did in #25) might be a good idea in terms of performance, especially for reparameterization-trick-based methods. However, instead of reinventing the wheel (by implementing every distribution in existence) or tying ourselves to a small number of specific distributions (a custom MvNormal that is), I think implementing a single general LocationScale distribution would be feasible, where the user provides the underlying univariate base distribution. Through this, we could support distributions like the multivariate Laplace that are not even supported in Distributions.jl with a single general object.

Maybe we should make this into a discussion. I feel like there are several different approaches we can take here.

For flattening the parameters, @theogf has proposed ParameterHandling.jl. But it currently does not work well with AD. The current alternative is ModelWrappers.jlk, but it comes with many dependencies, potentially a governance topic.

For this one in particular we have an implementation in DynamicPPL that can potentially moved to its own package if we really want to: https://github.com/TuringLang/DynamicPPL.jl/blob/b23acff013a9111c8ce2c89dbf5339e76234d120/src/utils.jl#L434-L473

But this has a couple of issues:

Requires 2n memory, since we can't release the original object (we need it as the first argument for construction since these things often depend on runtime information, e.g. the dimensionality of a MvNormal).
Can't specialize on which parameters we actually want, e.g. maybe we only want to learn the mean-parameter for a MvNormal.

(1) can be addressed by instead taking a closure-approach a la Functors.jl:

function flatten(d::MvNormal{<:AbstractVector,<:Diagonal})
    dim = length(d)
    function MvNormal_unflatten(x)
        return MvNormal(d[1:dim], Diagonal(d[dim+1:end]))
    end

    return vcat(d.μ, diag(d.Σ)), MvNormal_unflatten
end

For (2), we have a couple of immediate options:
a) Define "wrapper" distributions.
b) Take a contextual dispatch approach.

For (a) we'd have something like:

abstract type WrapperDistribution{D<:Distribution{V,F}} <: Distribution{V,F} end

# HACK: Probably shouldn't do this.
inner_dist(x::WrapperDistribution) = x.inner

# TODO: Specialize further on `x` to avoid hitting default implementations?
Distributions.logpdf(d::WrapperDistribution, x) = logpdf(d.dist, x)
# Etc.

struct MeanParameterized{D} <: WrapperDistribution{D}
    inner::D
end

function flatten(d::MeanParameterized{<:MvNormal})
    μ = mean(d.inner)
    function MeanParameterized_MvNormal_unflatten(x)
        return MeanParameterized(MvNormal(x, d.inner.Σ))
    end

    return μ, MeanParameterized_MvNormal_unflatten
end

Pros:

It's fairly simple to implement.
Cons:
Requires wrapping all the distributions all the time.
Nice until we have other sort of nested distributions in which case this can get real ugly real fast.

For (b) we'd have something like

struct MeanOnly end

function flatten(::MeanOnly, d::MvNormal)
    μ = mean(d.inner)
    function MvNormal_meanonly_unflatten(x)
        return MeanParameterized(MvNormal(x, d.inner.Σ))
    end

    return μ, MvNormal_meanonly_unflatten
end

Pros:

Cleaner as it avoids nesting.
Can easily support "wrapper" distributions since it can just pass the context downwards.
Cons:
Somewhat unclear to me how to make all this composable, e.g. how do we handle arbitrary structs containing distributions?

Red-Portal · 2023-03-23T15:53:03Z

Hi @torfjelde

Maybe we should make this into a discussion. I feel like there are several different approaches we can take here.

Should we proceed here or create a separate issue?

Whatever approach we take, I think the key would be to avoid inverting or even computing the covariance matrix, provided that we operate with a Cholesky factor. None of the steps of ADVI require any of these, except for the STL estimator, where we do need to invert the Cholesky factor.

torfjelde · 2023-03-25T16:03:17Z

Created a discussion: #46

src/estimators/advi.jl

src/AdvancedVI.jl

…into rewriting_advancedvi

Red-Portal · 2023-06-09T00:20:33Z

@torfjelde Hi, I have significantly changed the sketch for the project structure.

As you previously suggested, the ELBO objective is now formed in a modular way.
I've also migrated to use AbstractDifferentiation instead of rolling our custom differentiation glue functions.

Any comments on the new structure? Also, do you approve the use of AbstractDifferentiation?

yebai · 2023-06-09T19:46:41Z

Also, do you approve the use of AbstractDifferentiation?

@devmotion what are your current thoughts on AbstractDifferentiation?

Red-Portal · 2023-06-09T23:47:28Z

I've now added the pre-packaged location-scale family. Overall, to the user, the basic interface looks like the following:

    μ_y, σ_y = 1.0, 1.0
    μ_z, Σ_z = [1.0, 2.0], [1.0 0.; 0. 2.0]

    Turing.@model function normallognormal()
        y ~ LogNormal(μ_y, σ_y)
        z ~ MvNormal(μ_z, Σ_z)
    end
    model   = normallognormal()
    b       = Bijectors.bijector(model)
    b⁻¹     = inverse(b)
    prob    = DynamicPPL.LogDensityFunction(model)
    d       = LogDensityProblems.dimension(prob)

    μ = randn(d)
    L = Diagonal(ones(d))
    q = AVI.MeanFieldGaussian(μ, L)

    λ₀, restructure  = Flux.destructure(q)

    function rebuild(λ′)
        restructure(λ′)
    end
    λ = AVI.optimize(
        AVI.ADVI(prob, b⁻¹, 10),
        rebuild,
        10000,
        λ₀;
        optimizer = Flux.ADAM(1e-3),
        adbackend = AutoForwardDiff()
    )
    q = restructure(λ)

    μ = q.transform.outer.a
    L = q.transform.inner.a
    Σ = L*L'

    μ_true      = vcat(μ_y, μ_z)
    Σ_diag_true = vcat(σ_y, diag(Σ_z))

    @info("VI Estimation Error",
          norm(μ - μ_true),
          norm(diag(Σ) - Σ_diag_true),)

Some additional notes to the comments above,

Should we call restructure inside of optimize such that the flattening/unflattening is completely abstracted out to the user? Then, in the current state of things, Flux will have to be added as a dependency, otherwise we'll have to roll our own implementation of destructure.
Should we keep TruncatedADAGrad, DecayedADAGrad? I think these are quite outdated and would advise people from using these. So how about deprecating these?
We should probably migrate to Optimisers.jl. The current optimization infrastructure is quite old.

yebai · 2023-12-22T13:17:01Z

@Red-Portal is there anything in this PR not yet merged by #49 and #50?

Red-Portal · 2023-12-22T18:44:19Z

@yebai Yes, we have the documentation still left. I'm currently working on it.

refactor ADVI, change gradient operation interface

b49cf3e

Red-Portal added enhancement New feature or request help wanted Extra attention is needed labels Mar 14, 2023

Red-Portal changed the title ~~Basic rewrite of the package 2023 edition [WIP]~~ [WIP] Basic rewrite of the package 2023 edition Mar 14, 2023

Red-Portal removed enhancement New feature or request help wanted Extra attention is needed labels Mar 14, 2023

remove unused file, remove unused dependency

88e0b79

theogf reviewed Mar 15, 2023

View reviewed changes

src/estimators/advi.jl Outdated Show resolved Hide resolved

theogf reviewed Mar 15, 2023

View reviewed changes

src/AdvancedVI.jl Outdated Show resolved Hide resolved

Red-Portal added 2 commits March 15, 2023 18:53

fix ADVI elbo computation more efficiently

c2fb3f8

fix missing entropy regularization term

83161fd

Red-Portal marked this pull request as draft March 16, 2023 20:55

Red-Portal added 2 commits March 18, 2023 01:04

add LogDensityProblem interface

efa8106

torfjelde requested changes Mar 26, 2023

View reviewed changes

src/estimators/advi.jl Outdated Show resolved Hide resolved

src/estimators/advi.jl Outdated Show resolved Hide resolved

src/estimators/advi.jl Outdated Show resolved Hide resolved

src/AdvancedVI.jl Outdated Show resolved Hide resolved

src/AdvancedVI.jl Outdated Show resolved Hide resolved

Red-Portal added 8 commits June 8, 2023 00:29

Merge branch 'master' of https://github.com/TuringLang/AdvancedVI.jl …

2bf2a42

…into rewriting_advancedvi

fix type restrictions

1cadb51

remove unused file

3474e8d

fix use of with_logabsdet_jacobian

03a2767

restructure project; move the main VI routine to its own file

09c44fb

remove redundant import

b7407ce

restructure project into more modular objective estimators

4040149

migrate to AbstractDifferentiation

2a4514e

add location scale pre-packaged variational family, add functors

93a16d8

Red-Portal and others added 19 commits August 22, 2023 22:54

fix type instability, bug in argument check in LocationScale

b7d3471

add missing import bug

df50e83

refactor test, fix type bug in tests for LocationScale

ae3e9b0

add missing compat entries

e4002cf

fix missing package import in test

8c82569

add additional tests for sampling LocationScale

c2e7517

fix bug in batch in-place rand! for LocationScale

3a6f8bf

fix bug in inference test initialization

b78ef4b

add missing file

a1f7e98

fix remove use of for 1.6

8b783ec

refactor adjust inference test hyperparameters to be more robust

12cd9f2

refactor optimize to return obj_state, add warm start kwargs

837c729

refactor make tests more robust, reduce amount of tests

95629a5

fix remove a cholesky in test model

0b4b865

fix compat bounds, remove unused package

b49f4eb

bump compat for ADTypes 0.2

947a070

fix broken LaTeX in README

a9b3f48

remove redundant use of PDMats in docs

54826eb

fix use Cholesky signature supported in 1.6

1d1c8ff

Red-Portal mentioned this pull request Aug 24, 2023

Basic rewrite of the package 2023 edition Part I: ADVI #49

Merged

9 tasks

Red-Portal and others added 5 commits August 24, 2023 01:51

fix remove redundant cholesky operation in test

a0de2cf

add mean, var, cov to LocationScale

f593a67

refactor optimize warm-starting interface, add objargs argument

ff32ac6

update documentation for optimize

bc5cfd3

fix CUDA-compatibility bugs

de4284e

This was referenced Dec 9, 2023

Basic rewrite of the package 2023 edition Part II: Location-Scale Families #50

Closed

Basic rewrite of the package 2023 edition Part II: Location-scale variational families #51

Merged

yebai closed this Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic rewrite of the package 2023 edition #45

Basic rewrite of the package 2023 edition #45

Red-Portal commented Mar 14, 2023 •

edited

torfjelde commented Mar 22, 2023

Red-Portal commented Mar 23, 2023 •

edited

torfjelde commented Mar 25, 2023

Red-Portal commented Jun 9, 2023 •

edited

yebai commented Jun 9, 2023

Red-Portal commented Jun 9, 2023 •

edited by yebai

yebai commented Dec 22, 2023

Red-Portal commented Dec 22, 2023

Basic rewrite of the package 2023 edition #45

Basic rewrite of the package 2023 edition #45

Conversation

Red-Portal commented Mar 14, 2023 • edited

Roadmap

Topics to Discuss

Demo

torfjelde commented Mar 22, 2023

Red-Portal commented Mar 23, 2023 • edited

torfjelde commented Mar 25, 2023

Red-Portal commented Jun 9, 2023 • edited

yebai commented Jun 9, 2023

Red-Portal commented Jun 9, 2023 • edited by yebai

yebai commented Dec 22, 2023

Red-Portal commented Dec 22, 2023

Red-Portal commented Mar 14, 2023 •

edited

Red-Portal commented Mar 23, 2023 •

edited

Red-Portal commented Jun 9, 2023 •

edited

Red-Portal commented Jun 9, 2023 •

edited by yebai