Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal API that a new distribution must implement #525

Open
johnmyleswhite opened this issue Aug 13, 2016 · 9 comments
Open

Minimal API that a new distribution must implement #525

johnmyleswhite opened this issue Aug 13, 2016 · 9 comments

Comments

@johnmyleswhite
Copy link
Member

johnmyleswhite commented Aug 13, 2016

While working on some new distributions for a separate package, I've found that I frequently hit ambiguity warnings when defining operations like pdf(d::MyNewAbstractDistribution, x::Real) because those operations are ambiguous with operations like pdf(d::Distribution, x::Int).

In the past, I could resolve these issues by only implementing a basic core API that defined operations like pdf(d::MyNewConcreteDistribution, x::Float64) and trusting that the fallbacks would handle conversions for me. With the increase in abstraction we've achieved thanks to the work on parametric distributions, I've lost track of what the core minimal API is. Which functions must a new distribution implement?

If someone is familiar with what the minimal API is off the top of their head, I can write documentation for it. If we need to decide this, I'd be happy to be involved in figuring out what the minimal API ought to be.

@jmxpearson
Copy link
Contributor

I don't have an immediate answer, but this is also important for, e.g., parameterizing Truncated, which has exactly this kind of ambiguity problem.

I'm happy to help discuss and document, though as I said, I don't know the answer.

@johnmyleswhite
Copy link
Member Author

johnmyleswhite commented Aug 16, 2016

Ok, I'm starting to try to wrap my head around what exists. Here's my thinking about big high-level design principles we might aspire to.

  • A parametric distribution of type T is essentially a template for code that operates in the space of values of type T. Put another way, all functions on something like Normal{T} will evaluate all code using only operations that will retain the type of T whenever possible. This includes ensuring that we only work with constants that are of type T.

This is a pretty complicated restriction and may be intractable given our use of integer parameters for some distributions. But I think it would help to resolve some of the current issues in the code that worry me as we move forward with parametric types.

Consider using the following functions against both Normal{Float32} and Normal{BigFloat}:

  • skewness(d::Normal) = 0.0

To me, this seems like it ought to be something like skewness(d::Normal{T}) = oftype(T, 0) so that you can make sure you don't accidentally introduce a Float64 value into a computation that you're trying to carry out in reduced precision. The other direction seems less problematic because low-precision values will be promoted up as necessary, so your worst case scenario is the risk of type-instability, rather than precision errors.

For another example, consider:

  • entropy(d::Normal) = 0.5 * (log2π + 1.0) + log(d.σ)

This seems to have the same problem of fixing precision backs towards Float64.

In contrast, a method definition like this one strikes me as being near perfect:

  • rand(d::Normal) = d.μ + d.σ * randn()

The only problem I can see here is the use of randn, which probably ought to use an RNG with more precision for something like BigFloat. But it also has the danger of pushing things upwards, so that rand(Normal{Float16} is a number of much higher precision than you might expect.

The other alternative is where I think we lived before: all methods are generally restricted to Int and Float64. I think it's awesome that we're pushing for a much higher level of abstraction, but I think we might to resolve some of these design decisions before we figure out what the contracts would be for methods like pdf(d, x) in general.

@andreasnoack
Copy link
Member

andreasnoack commented Aug 16, 2016

This is very much the content of @jmxpearson's PRs and I think we are almost there. The rules are that you cannot use floating point literals in any method and that you should let the inputs determine the type of the result by having them first in expressions, e.g. not 1/2*x but x/2 because 1/2=0.5. Random number generation is a problem because randn doesn't support anything but Float64 right now but for the other functions, it should be possible to write generic version although it can be a little challenging. However, I think that being able to use Distributions with automatic differentiation outweighs the costs of the extra work required to write the methods.

@johnmyleswhite
Copy link
Member Author

Good to know.

We might not support randn with a rigorous algorithm, but it does work already in Julia 0.5:

julia> randn(Float32)
-1.8167182f0

julia> randn(Float16)
Float16(-1.386)

If this is the direction we're heading, should we then also move in the direction that new distributions should aspire to define a method like pdf(d, x::Real)? They could potentially work with something like _pdf(d, x::Float64) behind the scenes, but it seems like only exporting methods at the maximum level of generality would remove our need for most fallbacks, which ought to remove most of the ambiguity warnings.

Does that seem feasible as our eventual API? We'd have to write down all of the functions, but if they all have the form foo(d, args::Real...), it seems tractable to make progress on this quickly.

@andreasnoack
Copy link
Member

Thanks for correcting me. I don't know why I haven't noticed that change.

I think that most methods could just be defined for Real so it might be feasible to remove many of the fallbacks but we'll have to do some experiments. Basically, only methods that exploit specific floating point properties should be restricted. An example could be a core algorithm that evaluates a series to some precision where the number of terms in the series is chosen for 64 bits floating points but most often the methods in Distributions will work at a higher level.

@johnmyleswhite
Copy link
Member Author

Ok. I'll try to write up the minimal API in a Gist and link to it here. We can adapt it over time (along with strategies for handling hard problems like the one you mention) as John's work continues to show how these issues play out for specific distributions.

@johnmyleswhite
Copy link
Member Author

I made a start here: https://gist.github.com/johnmyleswhite/c2ec16dd57ebb43799ed5e141c6a1a93

The hardest thing for me is deciding on the return types for these functions. I've used promote_type for the moment to express the idea that you get the more precise type of the distribution's type parameters and the input value.

@jmxpearson
Copy link
Contributor

jmxpearson commented Aug 16, 2016

Just to echo Andreas's remarks:

  • We've tried to be very careful about type stability in most of the API. The last remaining exceptions are the MLE routines, which are on my to-do list. That's not to say there aren't lots of tricky edge cases lurking, though. I suspect a number of them will only emerge once people start autodiffing and find that they run into errors.
  • It still makes sense to me to restrict pdf(d::DiscreteUnivariateDistribution, x::Integer) and the equivalent when x is an array of integers. However, I think we all agree that it would be better for the signature in continuous cases to be pdf(d::ContinuousUnivariateDistribution, x::Real).

I'll also add some comments to the gist.

Oh, and loosening the signature for pdf, etc. will really help with Truncated, as I mentioned above, since the same method ambiguity problems arise when one tries to define pdf(d::Truncated, x::Real).

@taqtiqa-mark
Copy link

Thanks for taking the time to collate those ideas @johnmyleswhite.
Unless Distribution.jl constrains its scope to some (arbitrary?) definition of popular/common distributions I wonder if the minimal requirements aren't too heavy - as they stand now.
They seem to mix three distinct, yet interrelated, domains (arguably 4/5 domains if you think of moments/cumulants):

  1. distribution functions
  2. statistics and
  3. estimation

There seem to be common (natural?) cases where you just want to compile code related one of those domains ():

  1. simulation (functions)
  2. data analysis (statistics)
  3. modeling (estimation)

Given the compile times will be a perennial issue (it'll never be never fast enough).... there is some user benefit to keeping the package smaller by keeping it specialized, rather than keeping it smaller by keeping it limited to some smaller subset of distributions, but with bells and whistles included.

It would also lower the barrier to contributions for each domain (but increase it a little to contribute across domains - multiple packages/repos).

Of course there would be the issue of whether these three packages should be synchronized - I'd argue for loose coupling and that the current package contents are a MVP.
Hence Distributions.jl could be an umbrella package, so could be backwards compatible

Bike shedding names:
DistributionFunctions.jl
DistributionStatistics.jl
DistributionEstimators.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants