Minimal API that a new distribution must implement #525

johnmyleswhite · 2016-08-13T20:58:40Z

While working on some new distributions for a separate package, I've found that I frequently hit ambiguity warnings when defining operations like pdf(d::MyNewAbstractDistribution, x::Real) because those operations are ambiguous with operations like pdf(d::Distribution, x::Int).

In the past, I could resolve these issues by only implementing a basic core API that defined operations like pdf(d::MyNewConcreteDistribution, x::Float64) and trusting that the fallbacks would handle conversions for me. With the increase in abstraction we've achieved thanks to the work on parametric distributions, I've lost track of what the core minimal API is. Which functions must a new distribution implement?

If someone is familiar with what the minimal API is off the top of their head, I can write documentation for it. If we need to decide this, I'd be happy to be involved in figuring out what the minimal API ought to be.

The text was updated successfully, but these errors were encountered:

jmxpearson · 2016-08-13T21:50:36Z

I don't have an immediate answer, but this is also important for, e.g., parameterizing Truncated, which has exactly this kind of ambiguity problem.

I'm happy to help discuss and document, though as I said, I don't know the answer.

johnmyleswhite · 2016-08-16T05:48:12Z

Ok, I'm starting to try to wrap my head around what exists. Here's my thinking about big high-level design principles we might aspire to.

A parametric distribution of type T is essentially a template for code that operates in the space of values of type T. Put another way, all functions on something like Normal{T} will evaluate all code using only operations that will retain the type of T whenever possible. This includes ensuring that we only work with constants that are of type T.

This is a pretty complicated restriction and may be intractable given our use of integer parameters for some distributions. But I think it would help to resolve some of the current issues in the code that worry me as we move forward with parametric types.

Consider using the following functions against both Normal{Float32} and Normal{BigFloat}:

skewness(d::Normal) = 0.0

To me, this seems like it ought to be something like skewness(d::Normal{T}) = oftype(T, 0) so that you can make sure you don't accidentally introduce a Float64 value into a computation that you're trying to carry out in reduced precision. The other direction seems less problematic because low-precision values will be promoted up as necessary, so your worst case scenario is the risk of type-instability, rather than precision errors.

For another example, consider:

entropy(d::Normal) = 0.5 * (log2π + 1.0) + log(d.σ)

This seems to have the same problem of fixing precision backs towards Float64.

In contrast, a method definition like this one strikes me as being near perfect:

rand(d::Normal) = d.μ + d.σ * randn()

The only problem I can see here is the use of randn, which probably ought to use an RNG with more precision for something like BigFloat. But it also has the danger of pushing things upwards, so that rand(Normal{Float16} is a number of much higher precision than you might expect.

The other alternative is where I think we lived before: all methods are generally restricted to Int and Float64. I think it's awesome that we're pushing for a much higher level of abstraction, but I think we might to resolve some of these design decisions before we figure out what the contracts would be for methods like pdf(d, x) in general.

andreasnoack · 2016-08-16T14:15:30Z

This is very much the content of @jmxpearson's PRs and I think we are almost there. The rules are that you cannot use floating point literals in any method and that you should let the inputs determine the type of the result by having them first in expressions, e.g. not 1/2*x but x/2 because 1/2=0.5. ~~Random number generation is a problem because randn doesn't support anything but Float64 right now~~ but for the other functions, it should be possible to write generic version although it can be a little challenging. However, I think that being able to use Distributions with automatic differentiation outweighs the costs of the extra work required to write the methods.

johnmyleswhite · 2016-08-16T14:20:25Z

Good to know.

We might not support randn with a rigorous algorithm, but it does work already in Julia 0.5:

julia> randn(Float32)
-1.8167182f0

julia> randn(Float16)
Float16(-1.386)

If this is the direction we're heading, should we then also move in the direction that new distributions should aspire to define a method like pdf(d, x::Real)? They could potentially work with something like _pdf(d, x::Float64) behind the scenes, but it seems like only exporting methods at the maximum level of generality would remove our need for most fallbacks, which ought to remove most of the ambiguity warnings.

Does that seem feasible as our eventual API? We'd have to write down all of the functions, but if they all have the form foo(d, args::Real...), it seems tractable to make progress on this quickly.

andreasnoack · 2016-08-16T14:31:13Z

Thanks for correcting me. I don't know why I haven't noticed that change.

I think that most methods could just be defined for Real so it might be feasible to remove many of the fallbacks but we'll have to do some experiments. Basically, only methods that exploit specific floating point properties should be restricted. An example could be a core algorithm that evaluates a series to some precision where the number of terms in the series is chosen for 64 bits floating points but most often the methods in Distributions will work at a higher level.

johnmyleswhite · 2016-08-16T14:55:16Z

Ok. I'll try to write up the minimal API in a Gist and link to it here. We can adapt it over time (along with strategies for handling hard problems like the one you mention) as John's work continues to show how these issues play out for specific distributions.

johnmyleswhite · 2016-08-16T15:15:34Z

I made a start here: https://gist.github.com/johnmyleswhite/c2ec16dd57ebb43799ed5e141c6a1a93

The hardest thing for me is deciding on the return types for these functions. I've used promote_type for the moment to express the idea that you get the more precise type of the distribution's type parameters and the input value.

jmxpearson · 2016-08-16T20:06:08Z

Just to echo Andreas's remarks:

We've tried to be very careful about type stability in most of the API. The last remaining exceptions are the MLE routines, which are on my to-do list. That's not to say there aren't lots of tricky edge cases lurking, though. I suspect a number of them will only emerge once people start autodiffing and find that they run into errors.
It still makes sense to me to restrict pdf(d::DiscreteUnivariateDistribution, x::Integer) and the equivalent when x is an array of integers. However, I think we all agree that it would be better for the signature in continuous cases to be pdf(d::ContinuousUnivariateDistribution, x::Real).

I'll also add some comments to the gist.

Oh, and loosening the signature for pdf, etc. will really help with Truncated, as I mentioned above, since the same method ambiguity problems arise when one tries to define pdf(d::Truncated, x::Real).

taqtiqa-mark · 2019-04-30T06:11:11Z

Thanks for taking the time to collate those ideas @johnmyleswhite.
Unless Distribution.jl constrains its scope to some (arbitrary?) definition of popular/common distributions I wonder if the minimal requirements aren't too heavy - as they stand now.
They seem to mix three distinct, yet interrelated, domains (arguably 4/5 domains if you think of moments/cumulants):

distribution functions
statistics and
estimation

There seem to be common (natural?) cases where you just want to compile code related one of those domains ():

simulation (functions)
data analysis (statistics)
modeling (estimation)

Given the compile times will be a perennial issue (it'll never be never fast enough).... there is some user benefit to keeping the package smaller by keeping it specialized, rather than keeping it smaller by keeping it limited to some smaller subset of distributions, but with bells and whistles included.

It would also lower the barrier to contributions for each domain (but increase it a little to contribute across domains - multiple packages/repos).

Of course there would be the issue of whether these three packages should be synchronized - I'd argue for loose coupling and that the current package contents are a MVP.
Hence Distributions.jl could be an umbrella package, so could be backwards compatible

Bike shedding names:
DistributionFunctions.jl
DistributionStatistics.jl
DistributionEstimators.jl

johnmyleswhite added question document labels Aug 13, 2016

bdeonovic mentioned this issue Aug 15, 2016

How to ensure usability of ForwardDiff with custom distributions #527

Closed

tpapp mentioned this issue Jun 17, 2020

RFC: DistributionsBase.jl for packages defining custom distributions #1139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal API that a new distribution must implement #525

Minimal API that a new distribution must implement #525

johnmyleswhite commented Aug 13, 2016 •

edited

Loading

jmxpearson commented Aug 13, 2016

johnmyleswhite commented Aug 16, 2016 •

edited

Loading

andreasnoack commented Aug 16, 2016 •

edited

Loading

johnmyleswhite commented Aug 16, 2016

andreasnoack commented Aug 16, 2016

johnmyleswhite commented Aug 16, 2016

johnmyleswhite commented Aug 16, 2016

jmxpearson commented Aug 16, 2016 •

edited

Loading

taqtiqa-mark commented Apr 30, 2019

Minimal API that a new distribution must implement #525

Minimal API that a new distribution must implement #525

Comments

johnmyleswhite commented Aug 13, 2016 • edited Loading

jmxpearson commented Aug 13, 2016

johnmyleswhite commented Aug 16, 2016 • edited Loading

andreasnoack commented Aug 16, 2016 • edited Loading

johnmyleswhite commented Aug 16, 2016

andreasnoack commented Aug 16, 2016

johnmyleswhite commented Aug 16, 2016

johnmyleswhite commented Aug 16, 2016

jmxpearson commented Aug 16, 2016 • edited Loading

taqtiqa-mark commented Apr 30, 2019

johnmyleswhite commented Aug 13, 2016 •

edited

Loading

johnmyleswhite commented Aug 16, 2016 •

edited

Loading

andreasnoack commented Aug 16, 2016 •

edited

Loading

jmxpearson commented Aug 16, 2016 •

edited

Loading