Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there support for a StatsModelsAPI for method stubs only? #625

Closed
ablaom opened this issue Dec 29, 2020 · 13 comments
Closed

Is there support for a StatsModelsAPI for method stubs only? #625

ablaom opened this issue Dec 29, 2020 · 13 comments

Comments

@ablaom
Copy link

ablaom commented Dec 29, 2020

I wonder if there is support and interest for a StatsModelsAPI package that is just for method stubs, for extension by packages in the Stats/ML ecosystem, modelled on the DatatAPI.jl package.

From the DataAPI readme:

This package provides a namespace for data-related generic function definitions to solve the optional dependency problem; packages wishing to share and/or extend functions can avoid depending directly on each other by moving the function definition to DataAPI.jl and each package taking a dependency on it. As such, it is paramount for DataAPI.jl to be as minimal as possible, defining only generic function stubs and very little else. PRs proposing external dependencies or involved definitions will not be accepted.

This question was already raised here: JuliaData/DataAPI.jl#6

I learned of this proposal in a slack discussion. I am copying my original post from that thread here:

Wondering if there is any stomach for creating a light-weight method-stubs only package for stats and ml. For example, MLJBase used to extend fit and predict from StatsBase but the demand for a light-weight base package (MJLModelInterface) meant we no longer do so. I’ve bumping up against this again as I try to separate out a metrics package from MLJBase because metrics share a bunch of traits with models (for example prediction_type - probabilistic or deterministic). The new StatisticalMeasures package ought not to depend on any of our machine learning model code, but without introducing a third package for the trait stubs I’m stuck. I could create one but it seems the more sensible thing would to get buy-in from StatsBase, and other packages. Thoughts anyone? (I know there is MLBase, but this is more than method stubs).

I might be interested in creating a StatsModelsAPI package, if I could expect buy-in from StatsBase.

Supposing a number of key method stubs (no types) were extracted from https://github.com/JuliaStats/StatsBase.jl/blob/master/src/statmodels.jl and put into such a package, would the maintainers of StatsBase be prepared to add the new StatsModelsAPI as a dependency and extend the stubs? I could propose a list of proposed methods

Any other thoughts?

cc: @nalimilan @kleinschmidt @andreasnoack @DilumAluthge @lindahua @ararslan @oxinabox @johnmyleswhite

@DilumAluthge
Copy link
Contributor

Sounds great to me.

@AriMKatz
Copy link

AriMKatz commented Dec 29, 2020

@DhairyaLGandhi and @CarloLucibello

@ablaom
Copy link
Author

ablaom commented Dec 29, 2020

@ven-k

@kleinschmidt
Copy link
Member

I think the main question (as I think was discussed on slack) is how opinionated these functions are meant to be about the actual API. At one end of the spectrum, such a package could simply provide the function definitions so people don't get conflicts when they try to use different packages which provide some kind of fit. At the other end, the package could provide specific guidance about the order of arguments, whether data is row- or column-oriented, etc. We don't necessarily need to decide this right away since we can start with a minimal package and make it more opinionated in later releases but it's at least worth thinking a bit about how much the various stakeholders are invested in the different uses.

@nalimilan
Copy link
Member

@ablaom Do you think MLJ and StatsModels could agree on a relatively precise definition what fit and predict should do?

@ablaom
Copy link
Author

ablaom commented Jan 12, 2021

Well, how precise do we really need to avoid problems? I'm asking because I'm not sure I appreciate all the possibilities for abuse. What's wrong with

  • fit(::CustomType, args....; kwargs...) returning anything, not mutating arguments
  • predict(::CustomType, args...;kwargs...) returning anything, not mutating arguments
  • ditto transform, inverse_transform (or reconstruct or whatever); we also have predict_mode, predict_median, predict_mean for models making probabilistic predictions; and predict_joint. But I don't think we need to inflict these on non-MLJ packages.
  • fit!(::CustomType, args...; kwargs...) allowed to mutate first argument only

?

@nalimilan
Copy link
Member

Strictly speaking, to "avoid problems" we just need to forbid type piracy so that packages don't conflict. But usually in Julia methods overriding a common function are expected to have some degree of consistency -- otherwise we keep them as separate function. I'd say your proposal fills these two criteria.

In this particular case, as @kleinschmidt noted, it would be interesting to see whether we can agree on whether rows represent observations (like in Tables and StatsModels) or variables (like in MultivariateStats). Do you have requirements aobut it?

Also, regarding predict, could we be more specific about the return type? Should it be either a vector, a matrix or a Tables.jl object?

Finally, transform is a difficult case as it's used in DataFrames for a quite different operation. That's one of the reasons why StatsBase keeps it unexported. Technically there's no dispatch conflict though.

@ablaom
Copy link
Author

ablaom commented Jan 14, 2021

@nalimilan Thanks for that.

But usually in Julia methods overriding a common function are expected to have some degree of consistency -- otherwise we keep them as separate function.

In that case, my proposal is probably too ambitious.

In this particular case, as @kleinschmidt noted, it would be interesting to see whether we can agree on whether rows represent observations (like in Tables and StatsModels) or variables (like in MultivariateStats). Do you have requirements about it?

Currently, for our model predict method, the observations should be "rows", if the training data is of the form of a table, matrix or vector, or some slurped tuple of these. However, we are planning on allowing model-specific representations of the data, with model-specific methods to do observation subsampling. So then, the data could be essentially anything, and "rows" is whatever you define it to be...

What I am describing applies to our predict(::MLJModelInterface.Model, ...) methods (what a 3rd party pkg implements to buy into the MLJ API) but we also have a predict(::MLJBase.Machine, ...) methods, which is for user-interaction, whose data arguments (the args in my comment above) are more restricted.

Also, regarding predict, could we be more specific about the return type? Should it be either a vector, a matrix or a Tables.jl object?

In MLJ predict typically returns a table or (abstract) vector but again, the container type is not actually fixed by the API (although the implementation of the API does need to declare what this type would look like).

@ablaom
Copy link
Author

ablaom commented Jan 17, 2021

On further reflection, I think will instead put my efforts into a StatisticalTraits.jl package that focuses on traits shared by models and metrics, as we developed for use in MLJ. It seems that the fit/predict stories are too divergent to meaningfully integrate them at this point, and the traits is something we need to sort out more urgently.

Thanks again for the feedback!

@nalimilan
Copy link
Member

What kind of functions would you define in StatisticalTraits? The name sounds quite broad so if it reflects its contents it would make sense to have some coordination with JuliaStats packages.

@ablaom
Copy link
Author

ablaom commented Jan 22, 2021

@nalimilan
Copy link
Member

OK, I don't see anything that would be common with JuliaStats packages currently.

@nalimilan
Copy link
Member

FWIW, I've created a StatsAPI package which currently only contains pairwise: https://github.com/JuliaStats/StatsAPI.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants