Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent proposal for design of package interface #16

Closed
ablaom opened this issue Nov 9, 2018 · 22 comments
Closed

Recent proposal for design of package interface #16

ablaom opened this issue Nov 9, 2018 · 22 comments
Labels
design discussion Discussing design issues

Comments

@ablaom
Copy link
Member

ablaom commented Nov 9, 2018

Please provide any new feedback on the proposed glue-code
specification

below. @fkiraly has posted some comments
here. It
would be helpful also to have reactions to the two bold items below.

I will probably move the “update” instructions for the fit2 method to
model hyperparameters, leaving keyword arguments for package-specific
features (not so many use cases). It will be simplified, made into an
argument-mutating function without data as arguments. (If data really
needs to be revisited, a reference to it can be passed via cache.) The
document will explain use cases for this better.

I will require all Model field types to be concrete.

Immutable models. To improve performance, @tlienart has
recommended making models immutable. Mutable models are more
convenient because they avoid the need to implement a copy function,
and you can make a function (eg, loss) a hyperparameter (because you
don't need to copy it). The first annoyance can be dealt with (mostly)
with a macro. To deal with the second you replace a functions with
concrete type ("reference") and use type dispatch within fit to get
the function you actually want. Or something like that. In particular,
you need to know ahead of time what functions you might want to
implement. For unity, we might want to prescribe this part of the
abstraction (for common loss functions, optimisers, metrics, etc)
ourselves (or borrow from an existing library).

When I wrote my flux interface for Koala I found it very
convenient to use a function as a hyperparameter to generate the
desired architecture, essentially because a "model" in flux is a
function. (I suppose one could (should?) encode the architecture
a la Onnx or similar).

My vote is to keep Models mutable to make it more convenient for
package interfaces writers and because I'm guessing the performance
drawbacks are small, However, others may have a more informed opinion
than I do. For what it is worth, Scikitlearn.jl has mutable models.

What do others think about making models immutable?

Defaults for hyperparmaters ranges. Is there a desire for
interfaces to prescribe a range (and scale type) for
hyperparmaters, in addition to default values? (To address one
of @fkiraly comments, default values and types of parameters
are already exposed to MLJ through the package interface's model definition.)

@ablaom ablaom changed the title Recent proposal for and package interface design Recent proposal for design of package interface Nov 9, 2018
@fkiraly
Copy link
Collaborator

fkiraly commented Nov 9, 2018 via email

@ablaom
Copy link
Member Author

ablaom commented Nov 11, 2018

I have just realised that it may important to be able to make copies of models (containers for hyper parameters). If I want a composite model to automatically suppress unnecessary retraining of a component model, then the question of whether or not a component needs retraining will depend on what hyperparameters of the composite have changed (i.e. which sub-hyperparameters have changed). However, to compare the previous value of model with the old one, I need to make a copy of the previous one (i.e. not just pass a reference whose target will have changed!). The alternative is to not automate the suppression of retraining but allow the control "freezing" or "unfreezing" of component models via additional hyperparameters of the composite model.
One might view this as a dangerous proposition, for one would then need to know more than just the hyperparameters to know what training has actually been carried out; one would need to know the complete sequence of hyperparameter settings.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 11, 2018

I'm not sure why you think mutating hyper-parameters cannot be avoided?
Can you maybe explain precisely the issue because I perhaps don't understand it fully.

In general, the "stylized" model interface design I often have in mind will have hyper-parameters which do not change when fitting (e.g., regularization constant in ridge regression), and model parameters which do (e.g., coefficients of the linear functional). Conversely, the user can manually set hyper-parameters from the outside, but not model parameters. The distinction is not mathematically justified, but purely an interface convention, as above.

In the simplest instance of "learning networks", namely, grid-tuned hyper-parameters, the separation can still be maintained as follows:

  • in the "primitive learner" X, the parameters to tune are hyper-parameters and should not be changed.
  • in the "tuned learner", which is a first-order learner, e.g., tuned(X, tunectrl), the tuned parameters become model parameters, while the others remain hyper-parameters. The hyper-parameters of X which are tuned are then set by the tuned model's fit.

This requires encapsulating the two kinds of parameters in the first-order operation though - which is why a nice explicit hyper-parameter interface would be nicer than a non-explicit usage convention, in my opinion.

@tlienart
Copy link
Collaborator

tlienart commented Nov 11, 2018

FWIW, this is what I mentioned to you by PM @ablaom, I agree with what I believe @fkiraly is suggesting.
One way of maybe doing this is to indeed have mutable models but where some of the attributes are immutable and represent (e.g.) regularization. Here's e.g. what I had written for a draft generalised linear regression:

mutable struct GeneralizedLinearRegression <: RegressionModel
    loss::Loss
    penalty::Penalty
    fit_intercept::Bool
    n_features::Union{Void, Int}
    intercept::Union{Void, Real}
    coefs::Union{Void, AbstractVector{Real}}
end

The particulars are not very important but what is maybe of interest is that loss and penalty are immutable types, not functions, this is (I believe) more "Julia"-like on top of providing this encapsulation that I believe @fkiraly is talking about above.

Of course there's always the ambiguity that you could in theory change the loss function after such a model has been fitted. If that's important, you could use (maybe uglier) Ref to make this harder:

struct GeneralizedLinearRegression <: RegressionModel
    loss::Loss
    penalty::Penalty
    fit_intercept::Ref{Bool}
    n_features::Ref{Int}
    intercept::Ref{<:Real}
    coefs::Ref{<:Real}
end

(as an aside, for those who may not be familiar with Julia:)

struct Bar
    val::Ref{<:Real}
end
Bar(v::T) where T<:Real = Bar(Ref(v))
fit!(b::Bar) = (b.val.x = 0.5)
b = Bar(1.0)
fit!(b) # b.val.x == 0.5

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 12, 2018

@tlienart, yes, I think we mean the same, but just to make myself clear what I've meant.
In the example above, there would have to be two cases:

(a) the linear regression model is not tuned. In this case loss, penalty, fit_intercept, and n_features cannot be changed by the model, they can be set by the user at initialization of model ("hyper-parameters"). On the other hand, intercept and coefs are set by the model's "fit" method, given data.
(b) the linear regression model is tuned(e.g., by grid-search), within a learning network. If all parameters are fully tuned, all six parameters are model parameters and cannot be set by the user at initialization. Instead, they are determined through fitting. The model has no hyper-parameters (or only ones for how the tuning is done).

For the distinction being possible, I think there needs to be an abstraction which tells you which fields of the struct are of which kind. The way it is in your code, the tuning method, any workflow abstraction, or the user, would not know how "coefs" is different from "penalty" or "fit_intercept". But, in my opinion, these are clearly different kinds of parameters (as outlined above).

@tlienart
Copy link
Collaborator

tlienart commented Nov 12, 2018

(Edited quite a bit, the Ref thing is unnecessary)
In fact something that may combine what I understand to be @fkiraly's idea and what I'm suggesting above is something like this:

mutable struct LearnedParameter{T}
    p::T
end
struct Model
    x::Int
    c::LearnedParameter{Vector{T}} where T <: Real
end
m = Model(1, LearnedParameter(randn(5)))
fit!(m::Model, v::Vector{<:Real}) = (m.c.p = v; m)
fit!(m, [1.0, 2.0, 3.0])

@ablaom
Copy link
Member Author

ablaom commented Nov 12, 2018

There seems to be quite a bit of confusion about my question "Should models be mutable?" for I don't think we all mean the same thing by "model". My apologies for any part of the confusion.

According to my definitions, a model is a container for hyperparameters only (things like regularisation). Taking a purely practical point-of-view, a hyperparameter is something I can pass to an external package's fit method (e.g., regularisation); parameters learned by the package algorithm (e.g., coefficients in a linear model or weights in a neural network) are not hyperparameters. They will form part of what has been called the fit-result, on which we dispatch our MLJ predict method; but the details of what is inside a fit-result is generally not exposed to MLJ. (As an aside, if one really does want access to internals, the way to do this is return the desired information (e.g., coefficients) in the report dictionary returned by the package interface fit method, as I explain in the spec.)

As @fkiraly points out, in a tuning (meta)model that we construct within MLJ, the hyperparameters of a model now take on the role of learned parameters.

With this clarification, should models be mutable? And, should I be allowed to have functions (e.g., loss) as hyperparameters (i.e. fields of a model).

It is my feeling that we should make it very easy for someone to write an MLJ interface for their package. Ideally, they shouldn't need to understand a bunch of conventions about how to represent stuff or be familiar with this or that library. So, I'm inclined to say that they can make any object they like a hyperparameter, provided they are able to implement a copy function and an == function for the model type. I think we need this; see here. This said, I can't now see why we shouldn't make models immutable, except for the extra hassle for implementing tuning. But I admit, I am still a bit nervous about doing so.

@tlienart
Copy link
Collaborator

tlienart commented Nov 12, 2018

I think to clarify the confusion it'd be good to have a barebone API (either mutable or immutable), + where are the learned parameters stored if not in the model itself?

In the example for the DecisionTree it's indeed just hyperparameters however this may not be the case in for instance a generalised linear regression model where you'd want to have one container for different regressions (e.g. Ridge, Lasso, ... ) and not one container per specific regression. In that case you may have a mix of hyperparams and also params that actually define what the model is.

@ablaom
Copy link
Member Author

ablaom commented Nov 13, 2018

I would call these hyperparameters also. I guess your point is that some of the fields of Model might not be changed in tuning, only at some higher level, like benchmarking/model selection or whatever?

@ablaom
Copy link
Member Author

ablaom commented Nov 13, 2018

@fkiraly If I want to check if a Model instance has really changed, it needs to be immutable in Julia, unless I want to overload the default == method:

mutable struct Foo
    x
end

f = Foo(3)
g = Foo(4)
g.x = 3
f == g # false

And even:

Foo(3) == Foo(3) # false

While,

struct Bar
    x
end

f = Bar(3)
g = Bar(3)
f == g # true
f ===g # true

(The last result shows that f and g are indistinguishable, in the sense that no program could tell them apart.)

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 22, 2018

@ablaom @tlienart
I think we have two different designs here for the modelling strategy and the fitted model, so we need to think carefully - a decision will have to be made for at most one of these.

One is:
(a) the modelling strategy is a struct which contains parameters and hyperparameters. Fitting mutates the struct, but only the parameters ("LearnedParameters" in thibault's design), which jointly encode the fitted model.
(b) the modelling strategy is a struct which contains only hyperparameters. Fitting produces a fitted model, distinct from the model struct.

I can see pros/cons of either:
(a) makes it easier to keep the fitted model in one place with the modelling strategy specification it came from.
(b) makes it easier to fit the same modelling strategy to different data without overcopying the specification.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 22, 2018

On a tangential note: assume we were to do something similar to keras, where in fitting the component models can update each other sequentially and multiple times, e.g., through backprop being a meta-algorithm applied to interconnected GLM.
Which of the two designs would work better?
We may want to rapidly update the fits in a specific sequence which also needs access to the specs.

@ablaom
Copy link
Member Author

ablaom commented Nov 23, 2018

My experience in Julia is that it is a bad idea to inseparably fuse together data structures that have different functionality - in this case the model strategy (hyperparameters) and the learned parameters. Indeed, in my first attempt at Koala I did exactly this and lived to regret it.

In my vision these two are separate at the ground (i.e. package interface) level but come together at an intermediate level of abstraction in a "trainable model" which combines:

  • the model (aka model strategy)
  • the fit-result (aka learned parameters), undefined until training
  • ALL the data (i.e., task)
  • some cache that allows secondary calls to fit to be more efficient where possible (e.g. in learning pipelines)

This is pretty much the specification of Learner in MLR3, incidentally, without the cache.

When you call fit! on a trainable model, you call it on the rows of data you want to train and the lower-level fit/update methods create or update the fit-result (and cache).

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 23, 2018

Fair enough - but just to iterate this point, is it correct that, in consequence, you reject both the design of @tlienart and your own earlier one, from Koala?

I have no strong opinion as long as it is consistently done, but as said, it is one of those early design decisions which one in the worst case comes to regret so are worth thinking carefully about...

@tlienart
Copy link
Collaborator

tlienart commented Nov 23, 2018

I may be a bit dense but I don't see the problems / difficulties, having (abstract) code exposing issues would be a great plus to fix (my) ideas.

At the origin, my thinking was that 1 model = 1 function that acts as a proxy for some function you care about but don't have access to ("nature"). That justified (IMO) having characterising elements (immutable) and values for the parameters. Then you can just apply that object as you would a function to predict.

struct Model1
  hparam::HyperParameter{...} # e.g. what loss, what penalty + their parameters, how / when to prune, tree depth, ....
  param::LearnedParameter{...} # e.g. regression coefficients, tree splits
end
(::Model1)(X) = .... # effectively "compute the function on X" or "predict"
fit!(m::Model1, X, y) = ... # update m.param via refs.

Anyway that much I imagine is clear to you. In terms of hyperparameter tuning, there's no real problem. To each hyperparameter to check (e.g. from a gridsearch), one such model is created, fitted, and kept if necessary. Composing such structures into meta models is also easy afaict. It also seems advantageous that this whole structure is pretty simple compared to effectively doubling up all structures (e.g. for a regression there would be one container for the hyperparameters etc and one container for the regression coefficient (?)). I also imagine that this could hurt you in the backprop / graphical model setting you're talking about franz but again maybe I just don't clearly see the proposal that @ablaom is suggesting).

I'm also not sure I see point (b) of what you suggest in your earlier summary @fkiraly, if you have multiple dispatch with fit! function that go from very broad ("fallback" methods, e.g. LBFGS) to very specific (e.g.: "analytic" for ridge) then applying the same strategy effectively just means calling fit! with the same parameters and just update the learned parameters. There's also something to be said about having a "starting-point model" that gets iteratively fitted as data comes along, the learned parameters then effectively act as the "cache".
But again maybe I'm too fixed on a simple understanding of the problems, I think seeing some elementary code of some of the issues that are discussed above by @ablaom would make it clearer why "my" idea might in fact not make sense or just not do the job.

Also, like @fkiraly, I'm actually not hellbound on this, perfectly happy to go with one idea and just stick with it, to be fair I just don't really understand the problems so it's more a question of trying to understand that.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 23, 2018

Maybe it would be helpful to understand what went wrong for you @ablaom with design (a) ?

Re. "I don't see (b)" for @tlienart : the situation is in which you want to fit the model with the same hyperparameter settings to many different datasets or data views. In this situation, you have a single model specification container, but many fitted model containers.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 23, 2018

Btw, there is another argumentI can see in favour of (b): it follows the design principle of separating "instruction" from "result", if you were to equate the two with the model strategy specification and the fitted model. Of course it makes sense to have the "result" point to the "instruction".

And here's one in favour of (a): in (b), how would you easily update a fitted model with regards to hyper-parameters, without losing the reference to the instructions it arose from?

Another one in favour of (a): it follows more closely the "learning machine" idea where the fitted model is interpreted as a "model of the world" which the AI (the model/class etc) has. Note that this interpretation is in contradiction with the design which sees fitted model being a "result" - instead it sees it as a "state".

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 23, 2018

So ... is it true that advanced operations such as update, inference, adaptation, or composition, happen more naturally in design (a)? I feel @ablaom is most likely to contradict so I would be keen to hear counterarguments.

@ablaom
Copy link
Member Author

ablaom commented Nov 26, 2018

The one model - one function paradigm is not ideal. A transformer (e.g., NN encoder-decoder) has two methods: transform and inverse-transform. Also, we might want to consider classifiers as having a predict method and a predict_proba method. And a resampling method might predict a mean performance, or a standard error, and so forth. That is, multiple methods may need to dispatch from the same fitted model. This complicates learning network design because you may have different nodes implementing different methods on the same fit-result (e,g, transforming the target for training a model and inverse_transforming the predictions that model).

Whether we should adopt design (a) or design (b) at the level of glue-code, practical concerns make (b) the clear choice: We can always combine parameters and hyperparameters at a higher level of abstraction (which is what I intend and sketch above) but we can never unfuse them if they are joined already at in the the bottom-level abstraction.

@fkiraly At present I would allow the user to mutate the hyperparameter part of a combined "machine" and then call fit! (without specifying hyper parameters) to update the fit-result (learned parameters). However, this means that before the fit call, the hyperparameters and learned parameters are not yet in sync. Although there is potential for trouble this is very convenient. What do you think?

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 26, 2018

@ablaom , I think there is an important difference between "one model - one function" and "one model - one container".

The first ("1 model - 1 function") is arguably silly and if I understand it no one in this thread would want it. We may need to dispatch on fit, predict, trafo, backtrafo, predict_proba, what_are_my_hyperparameters, predict_supercalifragilistic and similar, but at least on fit/predict, which is 2>1.

The distinction between (a) and (b) is where it stands on the "one model - one container" issue, I think both designs agree with "one model - many interface methods" as is in my opinion reasonable.

Also, (b) is not so clear cut - even if you have a single container, you can always write multiple accessor methods, say instructions_of(container) and fitted_model_in(container)

I still don't have a strong preference, just playing the devil's advocate and pointing out what I see as a gap in reasoning.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 26, 2018

More precisely, I don't think the situation is that clear: whenever you have a collection of (dispatch-on or oop) objects and class methods, there's always the decision on a spectrum of tacking things together and taking them apart. Axes on which to consider these are user friendliness, clarity of code, and the semantic/operational sensibility of tacking the things together or not.

For model instructions and fitted models this is not so clear to me: the first can live almost entirely without data, and the second may be invoked multiple times for a given instruction set ("hyper-parameters"). Python/sklearn and R-base handle the issue differently and I can see the merits of both.

@ysimillides ysimillides added the design discussion Discussing design issues label Dec 11, 2018
@ablaom
Copy link
Member Author

ablaom commented Feb 13, 2019

The design has solidified considerably since this discussion and I am closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues
Projects
None yet
Development

No branches or pull requests

4 participants