Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disintegration of MLJBase (discussion and tracking issue) #416

Open
4 of 5 tasks
ablaom opened this issue Aug 27, 2020 · 12 comments
Open
4 of 5 tasks

Disintegration of MLJBase (discussion and tracking issue) #416

ablaom opened this issue Aug 27, 2020 · 12 comments

Comments

@ablaom
Copy link
Member

ablaom commented Aug 27, 2020

It seems there is a case to be made for further modularisation of MLJBase. These all sound reasonable to me but happy to hear other ideas:

The benefits of the 3rd has come in Soss.jl integration but also for models wanting to specify UnivariateFinite distribution priors as hyper-parameters (originally BayesianLDA was this way). The third because it's so big.

@ablaom
Copy link
Member Author

ablaom commented Aug 28, 2020

pre-compiling MLJBase on my newish macbook pro just too 37seconds 😢

@DilumAluthge
Copy link
Member

For easy discoverability, maybe a consistent naming scheme as such:

  1. MLJComposition.jl
  2. MLJDistributions.jl - would include UnivariateFinite; other distributions could be added later
  3. MLJMeasures.jl - would include the statistical measures
  4. MLJSerialization.jl

Of course, these packages (e.g. MLJDistributions.jl) can definitely be used outside of MLJ.

@DilumAluthge
Copy link
Member

Anyway, this sounds like a great idea.

@DilumAluthge
Copy link
Member

DilumAluthge commented Aug 28, 2020

Of course, the hard part is keeping all the dependencies, compat entries, etc. in sync. Tools like CompaHelper can help. There are probably additional tools that we can develop to make development and maintenance easier.

@ablaom
Copy link
Member Author

ablaom commented Aug 28, 2020

Thanks for the feedback!

Others have criticised the adoption of the "MLJ" prefix for packages that are really meant for more general consumption. (MLJScientificTypes is a case in point - but we already had ScientificTypes for the light version.) The criticism is that developers see "MLJ" and think that means the package is only for use with MLJ packages. I think there's validity in this.

Of course, the hard part is keeping all the dependencies, compat entries, etc. in sync. Tools like CompaHelper can help. There are probably additional tools that we can develop to make development and maintenance easier.

Yes, these things help and I am probably not familiar enough with those tools, so appreciate your continued input here.

@DilumAluthge
Copy link
Member

Others have criticised the adoption of the "MLJ" prefix for packages that are really meant for more general consumption. (MLJScientificTypes is a case in point - but we already had ScientificTypes for the light version.) The criticism is that developers see "MLJ" and think that means the package is only for use with MLJ packages. I think there's validity in this.

Yeah that's definitely a concern.

I guess we should identify which subparts of MLJBase are useful in general, and which parts are really only useful in the MLJ context.

As you point out, the UnivariateFinite and statistical measures stuff are probably useful for general consumption.

@DilumAluthge
Copy link
Member

DilumAluthge commented Aug 28, 2020

For the measures, StatisticalMeasures.jl is a good name.

For the UnivariateFinite distribution, it might be good to have a slightly more general name. Maybe CategoricalDistributions.jl or MachineLearningDistributions.jl? That way, if we have more distributions to add in the future, we can put them in the same package.

The advantage of something like MachineLearningDistributions.jl is that we can later add e.g. non-categorical distributions if the need arises.

@OkonSamuel
Copy link
Member

OkonSamuel commented Aug 28, 2020

Am happy were finally discussing about this. This may reduce package testing time for 3rd party developers.

@OkonSamuel
Copy link
Member

MachineLearningDistributions.jl

Why don't we just call it MLDistributions for short

@ven-k ven-k mentioned this issue Sep 4, 2020
@DilumAluthge
Copy link
Member

Same comment that I posted here.

The MLJ.jl README has a very nice visualization of the dependency graph, stored in the repo at https://github.com/alan-turing-institute/MLJ.jl/blob/master/material/MLJ_stack.svg

As we break up MLJModels and MLJBase, we should definitely remember to keep this image up-to-date.

@giordano
Copy link
Member

An issue for MLJSerialization is that
https://github.com/alan-turing-institute/MLJBase.jl/blob/757b5ee6d673eb212d1710409f7087764bed6e5e/src/machines.jl#L743-L762
requires the Machine type, which at the moment is defined in MLJBase. What to do with that? Move the definition to another package?

One weird thing is that unless the Machine is defined in MLJSerialization, this method will be type-piracy, as both the function and the types of all arguments will be defined in different packages. I think it's kinda OK, since it's all coordinated, but something to keep in mind.

@ablaom
Copy link
Member Author

ablaom commented Mar 22, 2021

@giordano I'm not sure I understand. MLJSerialization is not intended to be independent of MLJBase, so it can just import Machine, no?

We could avoid type piracy by not importing MMI.save. MLJSerialization can define it's own save. I'm pretty sure save is not exported by any of the MLJ packages. In MLJ.jl, we currently have import MLJ.save (not re-exported) and that is going to break. For now we replace it with import MLJSerialization.save so that the new changes do not necessitate a breaking release of MLJ (only compat updates).

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants