Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Distributions GenerativeFunctions #274

Open
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

georgematheos
Copy link
Contributor

This builds on #263 and resolves #259

I make Distribution <: GenerativeFunction true, and remove a lot of code in the static and dynamic DSLs specialized to distributions. Likewise, I remove the ChoiceAt combinator.

Note that I have not significantly modified the gradient code, since I have not yet taken the time to understand how it works. (So it still dispatches on whether something is a distribution or a generative function.) We may be able to further reduce the code footprint by removing specialization to distributions in the gradient calculations.

I have also added a test/benchmark folder with a couple initial MH benchmarks adapted from the examples; we can add and revise benchmarks as we create them. Currently the benchmarks are running MH on a static DSL and dynamic DSL model I took from the examples folder.

My initial benchmark results were sort of volatile (varied a lot between runs), and more careful benchmarking should be done. It looks like the changes somewhat improve performance for static code, but slow down dynamic code somewhere between 1.1x and 1.6x. The dynamic performance slowdown appears to be caused by my previous PR (ValueChoiceMap), not the distributions-as-generative-functions PR. That said, I don't see how to do this PR without building on the other. I have not thoroughly investigated what causes the dynamic performance reduction; we may be able to improve this.

Benchmarking results:
after this PR:

Simple static DSL (including CallAt nodes) MH on regression model:
  0.379326 seconds (4.19 M allocations: 309.771 MiB, 15.82% gc time)
  0.383679 seconds (4.19 M allocations: 309.771 MiB, 13.12% gc time)

Simple dynamic DSL MH on regression model:
  9.365213 seconds (88.12 M allocations: 4.972 GiB, 13.14% gc time)
  9.583431 seconds (88.12 M allocations: 4.972 GiB, 12.99% gc time)

After the ValueChoiceMap PR but before this PR:

Simple static DSL (including CallAt nodes) MH on regression model:
  0.658002 seconds (3.89 M allocations: 300.707 MiB, 9.39% gc time)
  0.618939 seconds (3.89 M allocations: 300.707 MiB, 10.44% gc time)

Simple dynamic DSL MH on regression model:
  9.536382 seconds (83.63 M allocations: 4.799 GiB, 12.63% gc time)
  9.662581 seconds (83.63 M allocations: 4.799 GiB, 12.26% gc time)

Before either PR:

Simple static DSL (including CallAt nodes) MH on regression model:
  0.423469 seconds (4.35 M allocations: 309.954 MiB, 19.35% gc time)
  0.416202 seconds (4.35 M allocations: 309.954 MiB, 17.90% gc time)

Simple dynamic DSL MH on regression model:
  8.099392 seconds (68.65 M allocations: 4.524 GiB, 16.40% gc time)
  6.430965 seconds (68.65 M allocations: 4.524 GiB, 17.02% gc time)

@alex-lew
Copy link
Contributor

@georgematheos For these benchmark results, could you also run the experiment but at 10x the number of datapoints? That will help ensure that nothing is sneakily asymptotically slower (though I don't see why it would be).

@georgematheos
Copy link
Contributor Author

@alex-lew Here is some more benchmarking for asymptotics. I did 10x the datapoints for the static DSL, and 1/10 the points for the dynamic DSL (since this is slow, and has superlinear performance, so it takes a very long time at 10x the data points). I also did a few runs on each since I found the results varied a decent amount:

This PR:

Simple static DSL (including CallAt nodes) MH on regression model:
  3.954642 seconds (50.89 M allocations: 4.416 GiB, 19.14% gc time)
  4.119761 seconds (50.89 M allocations: 4.416 GiB, 23.29% gc time)

Simple dynamic DSL MH on regression model:
  0.138685 seconds (1.70 M allocations: 96.116 MiB, 9.75% gc time)
  0.139786 seconds (1.70 M allocations: 96.116 MiB, 8.17% gc time)

georgematheos@Georges-MacBook-Pro-3 benchmarks % julia run_benchmarks.jl
Simple static DSL (including CallAt nodes) MH on regression model:
  3.978059 seconds (50.89 M allocations: 4.416 GiB, 18.58% gc time)
  4.196043 seconds (50.89 M allocations: 4.416 GiB, 20.04% gc time)

Simple dynamic DSL MH on regression model:
  0.162830 seconds (1.70 M allocations: 96.116 MiB, 13.55% gc time)
  0.165859 seconds (1.70 M allocations: 96.116 MiB, 13.06% gc time)

georgematheos@Georges-MacBook-Pro-3 benchmarks % julia run_benchmarks.jl
Simple static DSL (including CallAt nodes) MH on regression model:
  3.833077 seconds (50.89 M allocations: 4.416 GiB, 18.93% gc time)
  3.992323 seconds (50.89 M allocations: 4.416 GiB, 20.45% gc time)

Simple dynamic DSL MH on regression model:
  0.150765 seconds (1.70 M allocations: 96.116 MiB, 14.91% gc time)
  0.151998 seconds (1.70 M allocations: 96.116 MiB, 13.49% gc time)

Master branch:

Simple static DSL (including CallAt nodes) MH on regression model:
  3.726138 seconds (52.32 M allocations: 4.416 GiB, 15.33% gc time)
  3.742488 seconds (52.32 M allocations: 4.416 GiB, 16.26% gc time)

Simple dynamic DSL MH on regression model:
  0.092530 seconds (1.33 M allocations: 87.372 MiB, 11.15% gc time)
  0.090397 seconds (1.33 M allocations: 87.372 MiB, 8.60% gc time)

georgematheos@Georges-MacBook-Pro-3 benchmarks % julia run_benchmarks.jl
Simple static DSL (including CallAt nodes) MH on regression model:
  3.941206 seconds (52.32 M allocations: 4.416 GiB, 17.40% gc time)
  4.000897 seconds (52.32 M allocations: 4.416 GiB, 18.58% gc time)

Simple dynamic DSL MH on regression model:
  0.111098 seconds (1.33 M allocations: 87.372 MiB, 20.31% gc time)
  0.106125 seconds (1.33 M allocations: 87.372 MiB, 17.79% gc time)

georgematheos@Georges-MacBook-Pro-3 benchmarks % julia run_benchmarks.jl
Simple static DSL (including CallAt nodes) MH on regression model:
  3.872930 seconds (52.32 M allocations: 4.416 GiB, 18.37% gc time)
  3.955795 seconds (52.32 M allocations: 4.416 GiB, 19.18% gc time)

Simple dynamic DSL MH on regression model:
  0.110984 seconds (1.33 M allocations: 87.372 MiB, 19.15% gc time)
  0.111851 seconds (1.33 M allocations: 87.372 MiB, 17.00% gc time)

So it looks like there is no asymptotic difference between the old implementation and the new one. Again we see that the dynamic DSL is slowed down a bit by this PR (I wouldn't be surprised if it is slowed down more in this small example since there's less time for the compiler to optimize). It also looks like in these runs the speedup in the static DSL do not seem to appear (though there doesn't seem to be a significant slowdown either).

@georgematheos
Copy link
Contributor Author

@marcoct I agree that changing the choicemap interface may be reasonable. Soon, I will implement the AddressTree concept discussed in the design doc for modifying the update function (I want the new update function for my OUPM research), and this would provide a place to standardize some sort of "address tree iteration" interface.

In terms of implementing iteration behavior instead of get_submaps_shallow, this sounds reasonable to me, but we should decide whether the iterator should behave like get_submaps_shallow, or instead give an iterator over pairs :complete_address => value. @bzinberg has discussed with me that before we commit to making the "address tree" nature of choicemaps a user-facing feature, rather than an implementation detail, we should think carefully about whether this is really a fundamental part of choicemaps or not, and how certain we are we want to commit to this.

@georgematheos georgematheos marked this pull request as ready for review July 3, 2020 16:33
@georgematheos georgematheos marked this pull request as draft July 3, 2020 16:34
@alex-lew alex-lew marked this pull request as ready for review August 19, 2022 19:13
@alex-lew
Copy link
Contributor

alex-lew commented Aug 19, 2022

@georgematheos I'm interested in seeing if we can merge this in for the next breaking-changes release. (This PR mostly doesn't break anything for users, but it does change the ChoiceMap interface a bit, e.g. the behavior of get_submaps_shallow.) I think it's really nice work! I've merged recent changes from master into the PR.

One question I have about the implementation is why the gradient-based GFI methods still special-case on Distribution and DistributionTrace. It seems we should be able to just treat calls to distributions as calls to black-box generative functions, as long as we implement the backprop-related methods correctly on distributions themselves. (This appears to be the strategy you took for simulate, generate, assess, propose, update, and regenerate.) Is there a reason you avoided doing this for backprop-related methods?

@georgematheos
Copy link
Contributor Author

@alex-lew not totally sure why I did this -- I can take a look in more detail next week.

If I remember correctly, the reason may be that at the time when I wrote the pull request, I did not understand Gen's backprop code, so I tried to keep the back-prop code as similar to the past version as possible, to make sure I didn't break anything.

args
score::Float64
end
@inline dist(::DistributionTrace{T, Dist}) where {T, Dist} = Dist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgematheos Unfortunately, this implementation does not support more complicated distributions that are not 'singleton' structs, like the Mixture distributions. The problem is that they do not have zero-argument constructors. For example, HomogeneousMixture's definition looks like this:

struct HomogeneousMixture{T} <: Distribution{T}
    base_dist::Distribution{T}
    dims::Vector{Int}
end

If someone creates such a distribution, and then simulates a trace from it, the trace does not remember the base_dist and the dims, and so it cannot implement Gen.get_gen_fn.

One option is for the DistributionTrace to store a reference to the distribution object, and to have dist(d::DistributionTrace) = d.dist. That adds slight storage overhead, but maybe not enough to worry about -- it's one more pointer per choice in a trace.

Another option is to say that the Mixtures are not literally subtypes of Distribution -- they are just additional generative functions with ValueChoiceMaps.

I lean toward the first option -- thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is for the DistributionTrace to store a reference to the distribution object, and to have dist(d::DistributionTrace) = d.dist. That adds slight storage overhead, but maybe not enough to worry about -- it's one more pointer per choice in a trace.

I think this may actually add zero storage overhead if the DistributionTrace is parametrically typed, and the type in question is a singleton (as it is for Normal etc.)! My guess is that Julia optimizes that sort of thing away. But it's worth checking.

@alex-lew
Copy link
Contributor

All right, I think I've removed most (all?) of the special-casing that the dynamic and static DSLs do on the Distribution type -- so now the DSLs themselves are just glue holding together calls to black-box generative functions (including calls to Distributions). All such calls are handled uniformly. In many ways I think this is quite a simplification!

One interesting benefit of this PR's changes is that it is now possible to, from the trace of a program, figure out what distribution each choice was drawn from. This could be used to implement, e.g., automatic Gibbs inference for discrete choices without the user needing to pass in a list of valid values (because the function could just deduce the support of the choice, based on get_args of the distribution's trace).

However, merging this PR would break external packages that examine the IR of Gen static DSL functions and expect to find RandomChoiceNodes, including GenVariableElimination.jl and the not-exactly-a-package GenCompileContinuous.jl. It should be an easy fix to make those packages special-case on whether a GenerativeFunctionCallNode is a call to a Distribution, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Distributions Generative Functions
4 participants