Skip to content
This repository has been archived by the owner on May 23, 2022. It is now read-only.

What needs to be defined here #1

Closed
Evizero opened this issue Jun 25, 2016 · 14 comments
Closed

What needs to be defined here #1

Evizero opened this issue Jun 25, 2016 · 14 comments

Comments

@Evizero
Copy link
Member

Evizero commented Jun 25, 2016

I transferred as little code to this package as I think is absolutely needed.

Let the discussion on what is missing / should be changed / should be added begin.

To start off: I chose to only define the baseclass Loss here in LearnBase and will define ModelLoss and ParameterLoss in MLModels instead. The motivation being that it turns out that if one programs something that falls into the ModelLoss / ParameterLoss framework one probably needs to import MLModels anyway. For example there are a lot of propertyfunctions such as isnemitski there that are useful or in some cases even needed to implement an algorithm properly (at least in some cases with SVMs).

@tbreloff
Copy link
Member

Thanks @Evizero!

I chose to only define the baseclass Loss here in LearnBase

I'd really prefer to keep all the abstracts in LearnBase so that someone can define a model loss without needing MLModels... It's a funny distinction, I agree. It might be best to think of MLModels purely as implementations of abstractions that exist in LearnBase. Thoughts?

@Evizero
Copy link
Member Author

Evizero commented Jun 25, 2016

I am agnostic to this, but it would result in a few more special function definitions in LearnBase.

isdifferentiable,
istwicedifferentiable,
isconvex,
isstronglyconvex,
isnemitski,
isunivfishercons,
isfishercons,
islipschitzcont,
islocallylipschitzcont,
islipschitzcont_deriv,
isclipable,
ismarginbased,
isclasscalibrated,
isdistancebased,
issymmetric

@tbreloff
Copy link
Member

Where are those methods used, though? Can we put the abstract types and core methods in LearnBase, and define these other methods closer to where they are used?

I'm ok with adding a bunch of methods like this to LearnBase, by the way, and maybe that's the best option. The fallback methods would be defined for the abstract types, but all the concrete implementations would be in MLModels (and other future packages).

@Evizero
Copy link
Member Author

Evizero commented Jun 25, 2016

well I guess something like isdifferentiable or isconvex might actually be useful for a bunch of things down the road. It is not like they take up a lot of space. pushed the change

@Evizero
Copy link
Member Author

Evizero commented Jun 25, 2016

Where are those methods used, though?

At the same place where value or deriv would also be used. For example an optimization algorithm for SVMs that should throw a warning if you use it with losses that the theory doesn't support.

@tbreloff
Copy link
Member

Gotcha... then I think they all belong as placeholder defs in LearnBase.

@ahwillia
Copy link
Contributor

What do you all think about a series of functions like:

nobs(::Model)         # number of observations
nfeatures(::Model)  # number of inputs
nparams(::Model)   # number of trainable parameters

More appropriate for MLDataUtils or MLModels? I'm agnostic about where these end up, but am curious what convention we want to use.

@Evizero
Copy link
Member Author

Evizero commented Jun 25, 2016

Good point, actually, forgot about those. I think LearnBase should define such functions (at least their existence). (do we want to avoid a nobs clash with StatsBase?). Another one would be getobs, which basically allows for all the MLDataUtils magic.

@ahwillia
Copy link
Contributor

Clashes with StatsBase are going to be thorny. Maybe we should get @andreasnoack and others to weigh in here.

Is there a reason we can't have LearnBase import some of these functions from StatsBase? Do we want to be completely independent?

@andreasnoack
Copy link

It's tricky. In the perfect world, we would stop using different names for the same concepts in statistics and machine learning. However, it will be time consuming to sort out which concepts are the same. Furthermore, which of the terminologies should we choose? I'd vote for statistics terminology since it is the older of the two and happens to be the one I'm familiar with but machine learning is more popular right now so more contributors will probably be familiar with machine learning terminology.

Maybe a way to ask is how annoyed you'd think it'd be to depend on StatsBase.jl and something like https://github.com/johnmyleswhite/Loss.jl. Probably quite a bit so maybe it's not worth the trouble of trying to speak the same language although we use the same methods.

@Evizero
Copy link
Member Author

Evizero commented Jun 26, 2016

We have had discussions like this many times now (and they are still worth having). I still don't have a good personal answer to this. Until now I tried following the mantra of StatsBase with pretty decent success and I am also ok with the statistics naming conventions.

However, here is what I have learned while coding: Some conventions will just straight out not work for us. The StatisticalModel type, while great for what it currently does, makes some ML stuff cumbersome. For example a convention is to say fit(Type{MyModelType}, ...) where the result would then be of type MyModel. In the case of SVMs I did not find a useful way to keep that convention. Similarily I was unable to make any use of the clever tricks in place to allow for transparent use of data frames as input.

My point being that regardless if we choose to depend on StatsBase or not, we will surely have to break conventions established there (i.e. signature conventions for functions etc). The question that remains is if that would do more harm than good. A fresh start might give us more flexibility and generate less confusion all around. That said, I am also not a strong opponent to using StatsBase

Appendix:
To paint a picture of the complexity for a fit signature: SVMs are especially tricky, because one has different kinds of specifications and different kinds of results that depend on hyper parameter. For example I can specify a SVM using Nu formalism or the C formalism, both very different. In the later case I specify loss, penality, kernel, and the penalty parameter C. Now to add to that choises I could be interested in either the dual solution or the primal solution (both could be the output of both formalisms). They all have a very different structure and deserve their respective types. That in turn implies that I can't preallocate the variables I want to learn in the specifying structure passed to fit.

@tbreloff
Copy link
Member

@andreasnoack I wish you had more time to join us when we were discussing at the hackathon! Thanks again for organizing... it was a really great event.

I think, now that we've settled on abstractions that seem to cover many different perspectives and disciplines, it's valuable to do another deep dive into StatsBase and maybe Loss? We need to really get into the weeds and pinpoint which abstractions are appropriate for everyone involved, and specific cases which break the abstractions. My gut feeling is that, as @Evizero just said, the StatsBase abstractions were limiting for use-cases outside of statistics, and that the LearnBase abstractions may be more general and encompass the needs of the JuliaStats community.

If my gut is true (i.e. that it's not a good idea for LearnBase to depend on StatsBase), then I propose keeping LearnBase and StatsBase independent in the short term, and adding a package to "bind together" the abstractions, converting one abstraction into another when it's feasible. That might seem like more work, but it will allow us to continue to use tools made for either side.

Thoughts?

@ahwillia
Copy link
Contributor

ahwillia commented Jun 27, 2016

My philosophy is to push forward separately and then revisit this once the packages below LearnBase are more fleshed out. Too easy to over-specify at this stage. Ultimately, I would favor deferring to StatsBase for functionality that we feel is overlapping. We could just cherrypick and import types/functions we want, right?

Edit: Actually, could we start by having MLDataUtils depend on StatsBase and import all the nice functions like zscore, mean_and_var, etc.? I don't see why importing these is a bad idea.

@Evizero
Copy link
Member Author

Evizero commented Jun 27, 2016

I guess we are in a consensus then. Let us for now move forward with LearnBase independently and explore what it really is we need use-case by use-case. If we go that path I vote to avoid the names chosen by StatsBase to avoid confusion.

it's valuable to do another deep dive into StatsBase and maybe Loss

I explored all Loss related packages I could find when I started on KSVM / LearnBase (now MLModels), none of which satisfied my particular need/goal. I think/hope we should be in pretty good shape with the loss related stuff in MLModels as a starting point in both terms of flexibility as well as performance.

Actually, could we start by having MLDataUtils depend on StatsBase and import all the nice functions like zscore, mean_and_var, etc.? I don't see why importing these is a bad idea.

MLDataUtils does currently depend on StatsBase and it looks like it will remain that way for a while at least

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants