Flattening of parameters: what do we do? #46

torfjelde · 2023-03-25T16:02:59Z

torfjelde
Mar 25, 2023
Maintainer

Overview

Flattening of parameters is an issue that has been discussed many times in the Julia community, and there are numerous attempts at addressing this:

ParameterHandling.jl
Functors.jl
Small piece of code doing something similar in DynamicPPL: https://github.com/TuringLang/DynamicPPL.jl/blob/b23acff013a9111c8ce2c89dbf5339e76234d120/src/utils.jl#L434-L473
- Basically the same approach as in ParameterHandling.jl but less extensive and AD-friendly.
And probably more.

This is also an issue we're facing here in AdvancedVI.jl, and it becomes particularly annoying when combined with Distributions.jl as we might want to work with different parameterizations of a given distribution, etc.

So the issue is:

How do we "flatten" nested structs, etc. into a something we can give AD-frameworks, i.e. AbstractVector{<:Real}, and allow specification of exactly which parameters are considered "learnable"?

torfjelde · 2023-03-25T16:03:03Z

torfjelde
Mar 25, 2023
Maintainer Author

The following is basically a copy-paste from #45 (comment), as I think this covers some of the discussion.

For flattening the parameters, @theogf has proposed ParameterHandling.jl. But it currently does not work well with AD. The current alternative is ModelWrappers.jlk, but it comes with many dependencies, potentially a governance topic.

For this one in particular we have an implementation in DynamicPPL that can potentially moved to its own package if we really want to: https://github.com/TuringLang/DynamicPPL.jl/blob/b23acff013a9111c8ce2c89dbf5339e76234d120/src/utils.jl#L434-L473

But this has a couple of issues:

Requires 2n memory, since we can't release the original object (we need it as the first argument for construction since these things often depend on runtime information, e.g. the dimensionality of a MvNormal).
Can't specialize on which parameters we actually want, e.g. maybe we only want to learn the mean-parameter for a MvNormal.

(1) can be addressed by instead taking a closure-approach a la Functors.jl:

function flatten(d::MvNormal{<:AbstractVector,<:Diagonal})
    dim = length(d)
    function MvNormal_unflatten(x)
        return MvNormal(d[1:dim], Diagonal(d[dim+1:end]))
    end

    return vcat(d.μ, diag(d.Σ)), MvNormal_unflatten
end

For (2), we have a couple of immediate options:
a) Define "wrapper" distributions.
b) Take a contextual dispatch approach.

For (a) we'd have something like:

abstract type WrapperDistribution{D<:Distribution{V,F}} <: Distribution{V,F} end

# HACK: Probably shouldn't do this.
inner_dist(x::WrapperDistribution) = x.inner

# TODO: Specialize further on `x` to avoid hitting default implementations?
Distributions.logpdf(d::WrapperDistribution, x) = logpdf(d.dist, x)
# Etc.

struct MeanParameterized{D} <: WrapperDistribution{D}
    inner::D
end

function flatten(d::MeanParameterized{<:MvNormal})
    μ = mean(d.inner)
    function MeanParameterized_MvNormal_unflatten(x)
        return MeanParameterized(MvNormal(x, d.inner.Σ))
    end

    return μ, MeanParameterized_MvNormal_unflatten
end

Pros:

It's fairly simple to implement.
Cons:
Requires wrapping all the distributions all the time.
Nice until we have other sort of nested distributions in which case this can get real ugly real fast.

For (b) we'd have something like

struct MeanOnly end

function flatten(::MeanOnly, d::MvNormal)
    μ = mean(d.inner)
    function MvNormal_meanonly_unflatten(x)
        return MeanParameterized(MvNormal(x, d.inner.Σ))
    end

    return μ, MvNormal_meanonly_unflatten
end

Pros:

Cleaner as it avoids nesting.
Can easily support "wrapper" distributions since it can just pass the context downwards.
Cons:
Somewhat unclear to me how to make all this composable, e.g. how do we handle arbitrary structs containing distributions?

2 replies

torfjelde Mar 26, 2023
Maintainer Author

@devmotion I feel like you might just know of an existing solution that does exactly what we're after 😅

Red-Portal Jun 26, 2023
Maintainer

About excluding "trainable variables," I just learned that Functors let's you specify which fields are to be flattened. I guess this problem is also solved through Functors.jl.

Red-Portal · 2023-03-25T21:33:41Z

Red-Portal
Mar 25, 2023
Maintainer

I have a few additional considerations.

If we only plan on supporting ADVI and not BBVI, I think composability would be less of a concern since we could implement a single LocationScale distribution and specialize everything to it. For the case we would like to support BBVI, it would only make sense if the users could compose their own variational family, so we have a problem. Not sure how large of a market there is for BBVI, though.
Is there a demand for fixing certain parameters? I've personally never thought of such a use-case for VI.
None of the currently suggested solutions can handle Cholesky-valued parameters out of the box. And for me, this seems to be the key to low overhead.

Questions:

What do you mean by nested distribution?

8 replies

Red-Portal Mar 27, 2023
Maintainer

I meant here that ParameterHandling, for example, does not support Choleskys directly, but only PD matrices. For us, there is no need to form a PD matrix during inference, so this is slightly inconvenient.

But a PDMat can be constructed using a Cholesky: https://github.com/JuliaStats/PDMats.jl/blob/fff131e11e23403931a42f5bfb3384f0d2b114c9/src/pdmat.jl#L20 :)

Hi Tor,
I'm not sure why you're mentioning PDMat here because ParameterHandling.jl doesn't support PDMat, but its custom wrapper positive_definite. But even if it did, PDMat reconstructs the full matrix during construction, and MvNormal operates with PDMats anyway so there is a whole chain of events that I think are not ideal, which motivated the original ideal about implementing our custom LocationScale.

But, maybe I should worry about the Cholesky-only way of things later and just stick with PDMats first. What do you think?

torfjelde Mar 27, 2023
Maintainer Author

I'm not sure why you're mentioning PDMat here because ParameterHandling.jl doesn't support PDMat, but its custom wrapper positive_definite.

Aaah I see; my bad! Thanks!

But even if it did, PDMat reconstructs the full matrix during construction

I thought the construct with the Cholesky avoided this, but looking into AbstractMatrix(::Cholesky) I see it actually constructs the entire thing 😕 So you're completely right 👍 Hmm, that is annoying.

MvNormal operates with PDMats anyway so there is a whole chain of events that I think are not ideal, which motivated the original ideal about implementing our custom LocationScale.

Not happy with just defining it using Scale and Shift from Bijectors.jl, where the Scale just uses the Cholesky? As in, it's already possible to do:

using Bijectors
transformed(MvNormal(...), Shift(μ) ∘ Scale(cholesky(Σ).L))

and work with this. So the only remaining part is making Scale(::Cholesky) valid, I think.

But, maybe I should worry about the Cholesky-only way of things later and just stick with PDMats first. What do you think?

But yes, I wouldn't really worry about this right now 👍

Red-Portal Mar 27, 2023
Maintainer

Not happy with just defining it using Scale and Shift from Bijectors.jl, where the Scale just uses the Cholesky? As in, it's already possible to do:
using Bijectors
transformed(MvNormal(...), Shift(μ) ∘ Scale(cholesky(Σ).L))
and work with this. So the only remaining part is making Scale(::Cholesky) valid, I think.

Oh, I'm happy with that idea for now. I'll try to work with that. Although that doesn't solve our flattening problem yet! :(

torfjelde Mar 27, 2023
Maintainer Author

Although that doesn't solve our flattening problem yet! :(

Haha yeah it doesn't..

theogf Apr 4, 2023
Maintainer

I meant here that ParameterHandling, for example, does not support Choleskys directly, but only PD matrices. For us, there is no need to form a PD matrix during inference, so this is slightly inconvenient.

But a PDMat can be constructed using a Cholesky: https://github.com/JuliaStats/PDMats.jl/blob/fff131e11e23403931a42f5bfb3384f0d2b114c9/src/pdmat.jl#L20 :)

Hi Tor, I'm not sure why you're mentioning PDMat here because ParameterHandling.jl doesn't support PDMat, but its custom wrapper positive_definite. But even if it did, PDMat reconstructs the full matrix during construction, and MvNormal operates with PDMats anyway so there is a whole chain of events that I think are not ideal, which motivated the original ideal about implementing our custom LocationScale.

But, maybe I should worry about the Cholesky-only way of things later and just stick with PDMats first. What do you think?

Regarding ParameterHandling.jl, if you feel that one transformation is missing we can very easily add it to the list. One could even reuse some of the functions (like vec_to_tril)

ToucheSir · 2023-04-02T18:02:20Z

ToucheSir
Apr 2, 2023

Stumbled upon this while reading through some interesting GSoC proposals. My main advice would be to keep the scope as narrow as possible. Trying for a very general system quite quickly leads to JuliaGaussianProcesses/ParameterHandling.jl#43, and down that road is madness 😅.

0 replies

Red-Portal · 2023-04-03T18:26:30Z

Red-Portal
Apr 3, 2023
Maintainer

I think I'm leaning toward Functors.jl. I think it already does everything we want to do, and I personally found it to be quite reliable already. Can anybody think of potential limitations? @ToucheSir @torfjelde

Provided that we make our simple-enough-to-maintain pre-packaged functors (again, another reason to use the location-scale abstraction IMO), I think it should be good. I also guess the flows in Bijectors might also need to be made compatible with Functors.jl.

13 replies

ToucheSir Apr 10, 2023

@torfjelde Functors.AbstractWalk and StructWalk.WalkStyle (which it was inspired by) would how you suspect, yes. Have a look at Optimisers.jl for how we use the former to handle trainable params/flattening and the StructWalk.jl repo/Transformers.jl for the latter.

ComposedFunction should have functor already defined. If that isn't working for you, mind filing an issue?

To @theogf's point, Functors.jl doesn't put that much effort into flattening or adding constraints on parameters. That's the main philosophical point IMO: we do not want to require custom wrapper types to express such constraints, so all information is stored out-of-band. If you're not bound by such constraints, maybe a more specialized library would be better.

Red-Portal Jun 7, 2023
Maintainer

Sorry for the hiatus. I've been recovering from NeurIPS submissions.

To @theogf's point, Functors.jl doesn't put that much effort into flattening or adding constraints on parameters. That's the main philosophical point IMO: we do not want to require custom wrapper types to express such constraints, so all information is stored out-of-band. If you're not bound by such constraints, maybe a more specialized library would be better.

I have a recent paper studying the effect of constraints on the variational parameters, and I think the conclusion is pretty clear: we don't want to put constraints to get fast convergence. So I think that won't constrain us for the choice of framework here.

Red-Portal Jun 7, 2023
Maintainer

@torfjelde what are the plans for NormalizingFlows.jl in terms of flattening/unflattening?

theogf Jun 8, 2023
Maintainer

I have a recent paper studying the effect of constraints on the variational parameters, and I think the conclusion is pretty clear: we don't want to put constraints to get fast convergence. So I think that won't constrain us for the choice of framework here.

This looks like a really cool paper! I will have a look!

torfjelde Jun 8, 2023
Maintainer Author

Sorry for the hiatus. I've been recovering from NeurIPS submissions.

Good luck!

what are the plans for NormalizingFlows.jl in terms of flattening/unflattening?

As a starting point, I think we're just going to use Functors.jl and then the user will have to specify transformations themselves.

Basically, the approach I mention in #46 (reply in thread) or just Flux.destructure.

torfjelde · 2024-06-06T07:52:37Z

torfjelde
Jun 6, 2024
Maintainer Author

Maybe worth adding a comment as to why this was closed @Red-Portal ? Will be useful for future reference:)

1 reply

Red-Portal Jun 6, 2024
Maintainer

Yeah I should have done that!

So we tried out Functors.jl + Optimisers.jl and it worked like a charm. The latency was low enough so that the v0.3 rewrite is able to achieve performance similar to v0.2. Also, CUDA worked well without issue. So, I think we'll stick to this combination for now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flattening of parameters: what do we do? #46

{{title}}

Replies: 5 comments 24 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Flattening of parameters: what do we do? #46

torfjelde Mar 25, 2023 Maintainer

Overview

Replies: 5 comments · 24 replies

torfjelde Mar 25, 2023 Maintainer Author

torfjelde Mar 26, 2023 Maintainer Author

Red-Portal Jun 26, 2023 Maintainer

Red-Portal Mar 25, 2023 Maintainer

Red-Portal Mar 27, 2023 Maintainer

torfjelde Mar 27, 2023 Maintainer Author

Red-Portal Mar 27, 2023 Maintainer

torfjelde Mar 27, 2023 Maintainer Author

theogf Apr 4, 2023 Maintainer

ToucheSir Apr 2, 2023

Red-Portal Apr 3, 2023 Maintainer

ToucheSir Apr 10, 2023

Red-Portal Jun 7, 2023 Maintainer

Red-Portal Jun 7, 2023 Maintainer

theogf Jun 8, 2023 Maintainer

torfjelde Jun 8, 2023 Maintainer Author

torfjelde Jun 6, 2024 Maintainer Author

Red-Portal Jun 6, 2024 Maintainer

torfjelde
Mar 25, 2023
Maintainer

Replies: 5 comments 24 replies

torfjelde
Mar 25, 2023
Maintainer Author

torfjelde Mar 26, 2023
Maintainer Author

Red-Portal Jun 26, 2023
Maintainer

Red-Portal
Mar 25, 2023
Maintainer

Red-Portal Mar 27, 2023
Maintainer

torfjelde Mar 27, 2023
Maintainer Author

Red-Portal Mar 27, 2023
Maintainer

torfjelde Mar 27, 2023
Maintainer Author

theogf Apr 4, 2023
Maintainer

ToucheSir
Apr 2, 2023

Red-Portal
Apr 3, 2023
Maintainer

Red-Portal Jun 7, 2023
Maintainer

Red-Portal Jun 7, 2023
Maintainer

theogf Jun 8, 2023
Maintainer

torfjelde Jun 8, 2023
Maintainer Author

torfjelde
Jun 6, 2024
Maintainer Author

Red-Portal Jun 6, 2024
Maintainer