Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: DistributionsBase.jl for packages defining custom distributions #1139

Open
tpapp opened this issue Jun 17, 2020 · 17 comments
Open

RFC: DistributionsBase.jl for packages defining custom distributions #1139

tpapp opened this issue Jun 17, 2020 · 17 comments

Comments

@tpapp
Copy link
Contributor

tpapp commented Jun 17, 2020

Distributions.jl is a high-quality implementation of many commonly used distributions, benefiting from continuous contributions and peer review from the members of the Julia community.

While it is natural to contribute commonly used distributions to this package, some distributions may not be generic enough to warrant this (eg distributions used in a very narrow subfield, invented for a specific application and not in widespread use yet, etc). We should nevertheless make it easy for distributions living in other packages to share the API without incurring the cost of dependencies that are used by Distributions.jl. Of course, distributions from such packages could be migrated to this one later on if necessary.

I am proposing that a minimal API is extracted to a small, lightweight package, which could be called DistributionsBase.jl, defining the

  1. functions cdf, pdf, ... that operate on Distributions (from the current export list,

  2. a @reexport_DistributionsBase macro that reexports these, for use by packages to make it easy to maintain a consistent common exported interface.

I am not sure that I would export the type hierarchy in the first pass, as I don't think would be used commonly by packages defining custom distributions. In any case, I would start with the bare minimum, more can be added on demand later.

(related: #525)

@nickrobinson251
Copy link
Contributor

nickrobinson251 commented Jun 17, 2020

What's the benefit to the interface being in a separate package? Is the idea that a MyNicheDistribution.jl package would not want to depend on Distributions.jl because it's a large dependency? (Is this a large dependency? does it take a long time to compile?) Or is there another advantage to a separate minimal "base"/"interface" package?

@tpapp
Copy link
Contributor Author

tpapp commented Jun 18, 2020

Yes, Distributions.jl is a somewhat large dependency by necessity (eg ultimately a lot of special functions come from https://github.com/JuliaStats/Rmath.jl, which is a binary dependency, too). A lot of distributions could be much more lightweight.

@tpapp
Copy link
Contributor Author

tpapp commented Jun 19, 2020

An additional benefit for doing something similar in Tables.jl was clarifying all details of the API.

@oxinabox
Copy link
Contributor

A thing is its often hard to define more niche Distributions without depending on Distributions anyway.
E.g. I defined PertBeta in terms of the Beta distribution.
I could do it without Distributions.jl, but in that case i would end up needing to depend on SpecialFunctions.jl and/or Rmath.jl anyway.

Another thing to bear in mind is it is desirable for Distributions.jl to be a one-stop-shop for all of your distributions.
So really, it would be good if niche Distribitions were just PR'd into Distributions.jl

@ViralBShah
Copy link
Contributor

Just curious - what's the latest here? I recently heard a couple of people echoing the sentiment for this sort of refactoring.

@tpapp
Copy link
Contributor Author

tpapp commented Oct 11, 2020

My understanding is that for this particular package, this does not have the support of maintainers so I didn't start any work on it.

Generally I still think that factoring out a generic API for packages like this at some point in their lifecycle is the best approach.

@ViralBShah
Copy link
Contributor

Wondering if @simonbyrne and @andreasnoack can chime in.

@cscherrer
Copy link

cscherrer commented Apr 26, 2021

Some design choices in Distributions can cause problems, like type constraints on constructors (making it unusable with Symbolics) and inconsistent argument checking in the form of the check_args keyword argument (making things difficult for PPL). At the same time, it grabs lots of names, and has a huge number of reverse dependencies.

At this point it's difficult to change the library itself; I've seen perfectly reasonable discussions devolve into bikeshedding. That's not meant to accuse anyone of any bad behavior; it's a natural consequence of a large codebase with a wide range of use-cases.

It's important for the Julia ecosystem to find a balance between standards and agility, and Distributions.jl is very heavily on the side of the former. As a result, there's a strong tendency to hack around its limitations. It has become so heavy-weight that in many cases people start a new package rather than try to update it. This seems to be the case with DistributionsAD.jl, and is certainly the case for MeasureTheory.jl.

I can see a great value in factoring on a DistributionsBase. Ideally, there could even be a way to allow parameterized distributions like we have in MeasureTheory, and make it more natural to build libraries like this as extensions instead of independent projects.

@oschulz
Copy link
Contributor

oschulz commented Sep 28, 2022

We could call it DistributionsInterface, to stress that's ist mostly about function definitions, not concrete implementations. But whatever the name, I think having definitions of pdf-, cdf- and similar functions in a lightweight package would be extremely helpful.

@ChrisRackauckas
Copy link

Can we at least start by moving some of the abstract types to a package? Things like SciML generally don't need all of the distributions defined but just need to dispatch on AbstractSampler and call logpdf. There's a few interface pieces that could be pulled out and large swaths of the ecosystem will get a pretty big dependency reduction.

@oschulz
Copy link
Contributor

oschulz commented Dec 4, 2022

This could also remove some function name redundancy between MeasureBase/MeasureTheory and Distributions, since MeasureBase could depend on DistributionsInterface.

@devmotion
Copy link
Member

I'm in favor of extracting an interface (see #1641 (review)) but I think one should make sure it is clearly defined and consider that there are multiple things that it seems would be good to change in a next breaking release of Distributions.

Things like SciML generally don't need all of the distributions defined but just need to dispatch on AbstractSampler and call logpdf.

I looked around a bit but it seems e.g. DifferenceEquations.jl, DiffEqFlux.jl, and DiffEqBayes.jl all use concrete distributions such as MvNormal as well, so logpdf wouldn't be sufficient. Is that a general pattern or is there actually a package that could depend only on such an interface package? (There is no type of the name AbstractSampler defined in Distributions.)

@oscardssmith
Copy link

The point isn't DiffEqFlux and DiffEqBayes. The problem is that currentlly DiffEqBase relies on Distributions only for https://github.com/SciML/DiffEqBase.jl/blob/3fe9a5b9676caccc1c0ba2184bc1aca9d353a8a2/src/solve.jl#L1085. This one line of code add .8 seconds to the using time of DiffEqBase (a package that has 4s using time in total). Removing this dependency would save a noticable amount of time from people using OrdinaryDiffEq without Distributions (which is fairly common).

@ChrisRackauckas
Copy link

DiffEqBase and QuasiMonteCarlo are the two I'm thinking about. They want to dispatch on "if I see a Distribution", but don't need any distributions themselves. And as Oscar says, it's a pretty big hit to support that right now, and that 0.8 seconds can be approximately 0.

@devmotion
Copy link
Member

Oscar's example doesn't use logpdf though, hence it did not show up in my search.

This gets a bit off-topic: I didn't know this functionality exists and honestly I am a bit surprised that basically all problem types support distributions as initial values - I wouldn't have thought, e.g., that's supported for e.g. ODEs (and it's not documented in https://docs.sciml.ai/DiffEqDocs/stable/types/ode_types/#SciMLBase.ODEProblem it seems, according to the docstring u0 should be an array or number). handle_distribution_u0 seems also a bit prone to surprising non-reproducibility issues since there is no way to specify an RNG and it will always use the global one, even if one specifies e.g. a seed in solve (https://github.com/SciML/StochasticDiffEq.jl/blob/454bf4ea6e21a0024d2ccbb8f05c72555884501d/src/solve.jl#L67) or possibly some other RNG (if that's supported somewhere). It seems the main reason for handle_distribution_u0 is to support random initial conditions for which a concrete value can be generated with rand. So putting the issue with the RNG aside, maybe one could just call rand on anything that's not a Number or AbstractArray (functions are handled before it seems)? And maybe it's not even needed since ::Functions are supported anyway and one could just use (_, _) -> rand(Normal()) (which would even support custom RNGs and non-Distributions distributions)?

@oschulz
Copy link
Contributor

oschulz commented Dec 5, 2022

They want to dispatch on "if I see a Distribution", but don't need any distributions themselves.

This use case may be (partially) solvable using the trait-based DensityInterface.jl API, which both Distributions.jl and MeasureBase.jl/MeasureTheory.jl support already. DensityKind(d) tells if d has or is a density or neither, and logdensityof(d, x) can be used instead of logpdf. DensityInterface.jl has no notion of sampling, though. We could add a "supports rand" trait to DensityInterface.jl though, that might be useful anyway.

Independent of that, I still think and AbstractDistributions.jl would be very useful.

(Note: In DensityInterface.jl, DensityKind(obj) === HasDensity() is essentially equivalent to "obj is a (probability or non-normalized) measure", whereas DensityKind(obj) === IsDensity() is equivalent to "obj is the density of some measure in respect to some reference measure". Likelihoods also fall under IsDensity(). The default is DensityKind(::Any) === NoDensity().)

@ChrisRackauckas
Copy link

This gets a bit off-topic: I didn't know this functionality exists and honestly I am a bit surprised that basically all problem types support distributions as initial values - I wouldn't have thought, e.g., that's supported for e.g. ODEs (and it's not documented in https://docs.sciml.ai/DiffEqDocs/stable/types/ode_types/#SciMLBase.ODEProblem it seems, according to the docstring u0 should be an array or number). handle_distribution_u0 seems also a bit prone to surprising non-reproducibility issues since there is no way to specify an RNG and it will always use the global one, even if one specifies e.g. a seed in solve (https://github.com/SciML/StochasticDiffEq.jl/blob/454bf4ea6e21a0024d2ccbb8f05c72555884501d/src/solve.jl#L67) or possibly some other RNG (if that's supported somewhere). It seems the main reason for handle_distribution_u0 is to support random initial conditions for which a concrete value can be generated with rand. So putting the issue with the RNG aside, maybe one could just call rand on anything that's not a Number or AbstractArray (functions are handled before it seems)? And maybe it's not even needed since ::Functions are supported anyway and one could just use (_, _) -> rand(Normal()) (which would even support custom RNGs and non-Distributions distributions)?

It came before those featurse, but indeed we could deprecate it. It is off topic though and we should discuss this in DiffEqBase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants