Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate UnivariateFinite (for categorical distributions) out to new package #504

Closed
ablaom opened this issue Jan 20, 2021 · 16 comments
Closed
Projects

Comments

@ablaom
Copy link
Member

ablaom commented Jan 20, 2021

In line with #416, I propose we move UnivariateFinite out to a new package called CategoricalDistributions.jl.

If this were okay with the current host of MLJBase.jl (I need to check this @vollmersj) it might make sense for this package to live at JuliaData (host of CategoricalArrays.jl) or JuliaStats (host of Distrtibutions.jl). I wonder what curators of those organisations think of that idea?

@nalimilan @bkamins @andreasnoack @devmotion @matbesancon

Recall that UnivariateFinite consists of the following:

  • A composite type UnivariateFinite{S,V,R,P<:Real} for encoding the probability distribution associated with a finite labelled set of points, as opposed to the distribution Categorical from Distributions.jl, whose sample space is always a collection of integers. The sample space of a UnivariateFinite instance is a CategoricalPool object from CategoricalArrays.jl.

  • Implementation of relevant parts of the Distributions.jl API, including rand, pdf, logpdf support, params, mode, and fit (which fits to a CatgoricalVector).

  • A wrapper UnivariateFiniteArray for arrays of such objects (sharing a common sample space / pool). This type, implementing the AbstractArray API, is optimised for fast indexing, and for broadcasting of pdf, and logpdf (which turned out to be essential in our applications to machine learning).

  • A fairly elaborate constructor for UnivariateFiniteArray objects from matrices of probabilities. See this docstring

Technical note. I'm hoping this migration should be fairly painless but there is one issue to be aware of: Currently the UnivariateFinite constructor stub lives in MLJModelInterface but the type and all real functionality lives in MLJBase (which depends on MLJModelInterface). The reason for this was to keep MLJModelInterface (the sole dependency of third party packages inplementing MLJ's model API) super lightweight. So this needs sorting out.

@ablaom
Copy link
Member Author

ablaom commented Jan 20, 2021

cc @OkonSamuel

@DilumAluthge
Copy link
Member

@cscherrer might also be interested in this.

@devmotion
Copy link
Contributor

devmotion commented Jan 20, 2021

as opposed to the distribution Categorical from Distributions.jl, whose sample space is always a collection of integers.

While this is correct for Categorical, the support of the more general DiscreteNonParametric, of which Categorical is a special case, can be an arbitrary finite set of Reals.

@cscherrer
Copy link
Contributor

Thanks @DilumAluthge . I guess it will get a little complicated since we're moving away from Distributions.jl, but I think we'll still be able to import this and wrap it as a Measure

@ablaom
Copy link
Member Author

ablaom commented Jan 21, 2021

I guess it will get a little complicated since we're moving away from Distributions.jl,

Yes, Distributions is a hefty dependency. The new package could probably avoid Distributions as dependency and just implement the "functional" side of the Distributions API. Ie, drop the sub typing UnivariateFinite{...} <: Distribution or whatever. Unless Distributions wants to move types out to a new Base package - don't see that happening to be honest.

@OkonSamuel
Copy link
Member

I guess it would be easier if Distributions.jl has a separate package containing basic functions which we can extend

@devmotion
Copy link
Contributor

This was suggested and discussed in JuliaStats/Distributions.jl#1139 but it seems it was not supported by the maintainers.

@cscherrer
Copy link
Contributor

cscherrer commented Jan 21, 2021

Yes, Distributions is a hefty dependency.

Yes, but it's so widely used I don't see much getting around it. We actually have it as a dependency for MeasureTheory anyway, and use it for some of the sampling routines.

I'd love if this can be made nicely compatible with MeasureTheory, if it's not already. Here are some guidelines for that:

  1. Structs themselves should not constrain types beyond what's necessary. For example, <: Real constraints impose a boundary for symbolic analysis that takes special care to get around. (edit by @ablaom: now fixed in MLJBase)
  2. Any constructor with bounds checking should call to another function without it. The keyword-argument approach in Distributions.jl is awkward because it changes from one distribution to the next. (edit by @blaom: mostly fixed)
  3. I'd love if there's a way to call this with different parameterizations. For example, log-weights or as input to the UnitVector transform from TransformVariables. Or the one from Bijectors, though honestly that one still confuses me since it's not bijective. We have a nice way to write parameterized measures in MeasureTheory.
  4. If possible, any normalization constants should be computed separately (edit by @blaom: UnivariateFinite is now a general finite measure (unnormalised) so these don't appear, except in constructor when augment=true)

These are just off the top of my head, and partly for my notes as well.

@nalimilan
Copy link

CategoricalDistributions.jl really sounds like a narrow scope for a package. And IIUC it would define a lot of the Distributions.jl API (like pdf and such), and Distributions would have to depend on it (otherwise names would conflict)? At that point, we have to admit it would be very close to what a potential DistributionsBase package would be.

Why can't UnivariateFinite live in Distributions? Is it because you don't want to depend on Rmath?

@matbesancon
Copy link

I think this issue should be taken a bit broader than categorical distributions only, but my overall observation is that Distributions has become very hard to maintain because it is a central utility with few and rarely available maintainers. Most people just drop by for an issue or PR but don't want (understandably) to get involved in its maintenance, and many former maintainers are not involved anymore.

I agree with @nalimilan that splitting a package just for categorical distributions would be too fragmenting, but splitting a DistributionsBase, and possibly using this opportunity to clean the house, should be considered

@ablaom
Copy link
Member Author

ablaom commented Jan 22, 2021

Why can't UnivariateFinite live in Distributions?

There were conversations but resistance. They said "you can just have Distributions as a dependency and overload..." I think the main objection was to include anything with a sample space that was not Array{Float64,N} or Array{Int,N} (curiously, probabilities have generic type, but the sample space types implemented are not generic).

@matbesancon
Copy link

include anything with a sample space that was not Array{Float64,N} or Array{Int,N}

(or Float64 / Int themselves). But overall I agree that this has been one of the biggest blocking points in Distributions, because some parts of the code base were non-trivial to adapt to something more generic. That being said, the type parameters of distributions can be extended (VariateForm and ValueSupport)

@ablaom
Copy link
Member Author

ablaom commented Jan 22, 2021

but my overall observation is that Distributions has become very hard to maintain because it is a central utility with few and rarely available maintainers. Most people just drop by for an issue or PR but don't want (understandably) to get involved in its maintenance, and many former maintainers are not involved anymore.

This is also my impression and, if true, a real issue for Julia, in my opinion. The concept of a distribution extends far beyond what is in Distributions.jl (objects sampled using Monte-Carlo-Markov-Chain, pdfs on manifolds, priors for tree-like parameter spaces, etc). It may not make sense to add to what is there, but without a base API package it's very messy to extend.

@ablaom
Copy link
Member Author

ablaom commented Jan 22, 2021

That being said, the type parameters of distributions can be extended (VariateForm and ValueSupport)

Yes, that is what we currently do in MLJ. But it means having Distributions.jl as a dependency in places we'd rather not. We had to perform some acrobatics to create our light-weight MLJModelInterface, partly for this reason.

@ablaom
Copy link
Member Author

ablaom commented May 28, 2021

Thanks to contributors to this discussion.

Update: UnivariateFinite is now generalized, in the sense that is no longer assumed or enforced that probabilities add to one. Ie, one can use it to represent arbitrary measures over a "labelled" finite set. This would appear to make it more useful re the MeasureTheory project (see above #504 (comment))

I think I should like to proceed with moving this functionality to new package, dependent on Distributions for pdf, logpdf, support, mode , and params with the great hope that sometime in the future these methods will be moved to StatsBase or a new DistributionsBase.

Some technical details for myself or whoever takes this on. Currently the super-lightweight MLJModelInterface defines a method only, UnivariateFinite. This method is needed for 3rd party packages implementing the MLJ api for classifiers, for example. However, the method has only dummy behaviour, unless MLJBase is also in scope. MLJBase adds the true functionality, ie, makes it a bone-fide constructor, now that the type UnivariateConstructor actually exists. This is handled with something resembling trait-dispatch where the trait is "is MLJBase loaded?". These gymnastics were necessary to avoid having Distributions and CategoricalArrays as a dependencies of MLJModelInterface.jl. See here and here for details.

Moving forward, no changes are made to MLJModelInterface. The new package will not depend on MLJ in any way, and in particular does not extend the constructor from MLJModelInterface. Rather:

  • The new package, FiniteMeasures.jl, say, defines both a new type FiniteMeasure (and FiniteMeasureArray) and all constructors ab initio, using the code from MLJBase with renaming UnivariateFinite -> FiniteMeasure. The only change is we are not extending some other constructor, just making a new one.
  • MLJBase, which imports FiniteMeasures.jl, continues to extend the UnivariateFinite constructor from MLJModelInterface; however the added functionality is just to call the FiniteMeasures.FiniteMeasure constructor, returning a FiniteMeasure object. MLJBase will now re-export both MLJModelInterface.UnivariateFinite and FiniteMeasures.FiniteMeasure. A deprecation warning is added to the "MLJBase" version of the method (the version dispatching on FullInterface) as we intend MLJ users to move to FiniteMeasure, but deprecation itself means only that we stop exporting UnivariateFinite; we want to keep UnivariateFinite in MLJModelInterface, as this would be too painful to change now, in my opinion.

In principle the renaming is not necessary, but it simplifies the explanation and makes sense given the recent generalization. However, I'm open to suggestions re the naming/renaming question.

@ablaom
Copy link
Member Author

ablaom commented Aug 13, 2021

Another reason to do this: https://github.com/JuliaAI/ScientificTypes.jl/issues/142

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

7 participants