-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recent proposal for design of package interface #16
Comments
What do others think about making models immutable?
I’d agree for them to be kept *mutable* - I’m not Julia literate enough to see all consequences, but I have a feeling that the various proposed interfaces, including the hyperparameter interface, would not work well with immutable models or may require brittle (and potentially obscure) workarounds such as referencing/macros.
Defaults for hyperparmaters ranges. Is there a desire for interfaces to prescribe a range (and scale type) for hyperparmaters, in addition to default values?
Personally, I don’t feel this needs to be in the MVP/core design if there is error capture and solid evaluation – sth outside the “real range” would simply be caught as error or badly performing setting.
For something more involved, I’m copying part of one of my e-mails from earlier discussion with Diego, which is a more involved (but potentially fiddly) proposal. As said, would not prioritize but just for consideration.
From discussion on hyperparameter abstraction <quote start>
having thought about the parameter set interface, I’ve understood that the distinction between grids, ranges, and single parameters is, on the mathematical side, largely artificial since these are all three sets – just one may be infinite, and the others have to be single-element sets.
Julia’s dispatch offers a very interesting solution to this:
We could to have a *single* dispatch (or flag) structure of parameter set structs, where the ordering is single-element-set < discrete-set < set.
There’s further the special case of Cartesian products, and single-parameter (not element) ranges, which can be any of these.
Cartesian products can be taken over multiple parameter set structs, and end up in the highest class they are taken over.
So what I called “ParamSet, ParamRange, ParamGrid” are actually instances of the same type structure, let’s call the top element of it “ParamSet”.
Though I’m not sure how much of this to encode in type dispatch order, and how much through flags and nesting.
But in this view, “makeGrid” would be easy: it takes a set, and makes it a discrete-set, which than can be dispatched upon by “getNumOfPts” and “getKthGridPt”.
As signatures go,
makeGrid : ParamSet -> discreteParamSet
getNumofPts : discreteParamSet -> integer,
getKthGridPt : discreteParamSet x integer -> singleParamSet
Kindly let me know if you have any questions – does this make sense?
<quote end>
|
I have just realised that it may important to be able to make copies of models (containers for hyper parameters). If I want a composite model to automatically suppress unnecessary retraining of a component model, then the question of whether or not a component needs retraining will depend on what hyperparameters of the composite have changed (i.e. which sub-hyperparameters have changed). However, to compare the previous value of |
I'm not sure why you think mutating hyper-parameters cannot be avoided? In general, the "stylized" model interface design I often have in mind will have hyper-parameters which do not change when fitting (e.g., regularization constant in ridge regression), and model parameters which do (e.g., coefficients of the linear functional). Conversely, the user can manually set hyper-parameters from the outside, but not model parameters. The distinction is not mathematically justified, but purely an interface convention, as above. In the simplest instance of "learning networks", namely, grid-tuned hyper-parameters, the separation can still be maintained as follows:
This requires encapsulating the two kinds of parameters in the first-order operation though - which is why a nice explicit hyper-parameter interface would be nicer than a non-explicit usage convention, in my opinion. |
FWIW, this is what I mentioned to you by PM @ablaom, I agree with what I believe @fkiraly is suggesting. mutable struct GeneralizedLinearRegression <: RegressionModel
loss::Loss
penalty::Penalty
fit_intercept::Bool
n_features::Union{Void, Int}
intercept::Union{Void, Real}
coefs::Union{Void, AbstractVector{Real}}
end The particulars are not very important but what is maybe of interest is that Of course there's always the ambiguity that you could in theory change the loss function after such a model has been fitted. If that's important, you could use (maybe uglier) struct GeneralizedLinearRegression <: RegressionModel
loss::Loss
penalty::Penalty
fit_intercept::Ref{Bool}
n_features::Ref{Int}
intercept::Ref{<:Real}
coefs::Ref{<:Real}
end (as an aside, for those who may not be familiar with Julia:) struct Bar
val::Ref{<:Real}
end
Bar(v::T) where T<:Real = Bar(Ref(v))
fit!(b::Bar) = (b.val.x = 0.5)
b = Bar(1.0)
fit!(b) # b.val.x == 0.5 |
@tlienart, yes, I think we mean the same, but just to make myself clear what I've meant. (a) the linear regression model is not tuned. In this case loss, penalty, fit_intercept, and n_features cannot be changed by the model, they can be set by the user at initialization of model ("hyper-parameters"). On the other hand, intercept and coefs are set by the model's "fit" method, given data. For the distinction being possible, I think there needs to be an abstraction which tells you which fields of the struct are of which kind. The way it is in your code, the tuning method, any workflow abstraction, or the user, would not know how "coefs" is different from "penalty" or "fit_intercept". But, in my opinion, these are clearly different kinds of parameters (as outlined above). |
(Edited quite a bit, the mutable struct LearnedParameter{T}
p::T
end
struct Model
x::Int
c::LearnedParameter{Vector{T}} where T <: Real
end
m = Model(1, LearnedParameter(randn(5)))
fit!(m::Model, v::Vector{<:Real}) = (m.c.p = v; m)
fit!(m, [1.0, 2.0, 3.0]) |
There seems to be quite a bit of confusion about my question "Should models be mutable?" for I don't think we all mean the same thing by "model". My apologies for any part of the confusion. According to my definitions, a model is a container for hyperparameters only (things like regularisation). Taking a purely practical point-of-view, a hyperparameter is something I can pass to an external package's As @fkiraly points out, in a tuning (meta)model that we construct within MLJ, the hyperparameters of a model now take on the role of learned parameters. With this clarification, should models be mutable? And, should I be allowed to have functions (e.g., loss) as hyperparameters (i.e. fields of a model). It is my feeling that we should make it very easy for someone to write an MLJ interface for their package. Ideally, they shouldn't need to understand a bunch of conventions about how to represent stuff or be familiar with this or that library. So, I'm inclined to say that they can make any object they like a hyperparameter, provided they are able to implement a copy function and an == function for the model type. I think we need this; see here. This said, I can't now see why we shouldn't make models immutable, except for the extra hassle for implementing tuning. But I admit, I am still a bit nervous about doing so. |
I think to clarify the confusion it'd be good to have a barebone API (either mutable or immutable), + where are the learned parameters stored if not in the model itself? In the example for the DecisionTree it's indeed just hyperparameters however this may not be the case in for instance a generalised linear regression model where you'd want to have one container for different regressions (e.g. Ridge, Lasso, ... ) and not one container per specific regression. In that case you may have a mix of hyperparams and also params that actually define what the model is. |
I would call these hyperparameters also. I guess your point is that some of the fields of |
@fkiraly If I want to check if a mutable struct Foo
x
end
f = Foo(3)
g = Foo(4)
g.x = 3
f == g # false And even:
While,
(The last result shows that |
@ablaom @tlienart One is: I can see pros/cons of either: |
On a tangential note: assume we were to do something similar to keras, where in fitting the component models can update each other sequentially and multiple times, e.g., through backprop being a meta-algorithm applied to interconnected GLM. |
My experience in Julia is that it is a bad idea to inseparably fuse together data structures that have different functionality - in this case the model strategy (hyperparameters) and the learned parameters. Indeed, in my first attempt at Koala I did exactly this and lived to regret it. In my vision these two are separate at the ground (i.e. package interface) level but come together at an intermediate level of abstraction in a "trainable model" which combines:
This is pretty much the specification of Learner in MLR3, incidentally, without the cache. When you call fit! on a trainable model, you call it on the rows of data you want to train and the lower-level fit/update methods create or update the fit-result (and cache). |
Fair enough - but just to iterate this point, is it correct that, in consequence, you reject both the design of @tlienart and your own earlier one, from Koala? I have no strong opinion as long as it is consistently done, but as said, it is one of those early design decisions which one in the worst case comes to regret so are worth thinking carefully about... |
I may be a bit dense but I don't see the problems / difficulties, having (abstract) code exposing issues would be a great plus to fix (my) ideas. At the origin, my thinking was that 1 model = 1 function that acts as a proxy for some function you care about but don't have access to ("nature"). That justified (IMO) having characterising elements (immutable) and values for the parameters. Then you can just apply that object as you would a function to predict. struct Model1
hparam::HyperParameter{...} # e.g. what loss, what penalty + their parameters, how / when to prune, tree depth, ....
param::LearnedParameter{...} # e.g. regression coefficients, tree splits
end
(::Model1)(X) = .... # effectively "compute the function on X" or "predict"
fit!(m::Model1, X, y) = ... # update m.param via refs. Anyway that much I imagine is clear to you. In terms of hyperparameter tuning, there's no real problem. To each hyperparameter to check (e.g. from a gridsearch), one such model is created, fitted, and kept if necessary. Composing such structures into meta models is also easy afaict. It also seems advantageous that this whole structure is pretty simple compared to effectively doubling up all structures (e.g. for a regression there would be one container for the hyperparameters etc and one container for the regression coefficient (?)). I also imagine that this could hurt you in the backprop / graphical model setting you're talking about franz but again maybe I just don't clearly see the proposal that @ablaom is suggesting). I'm also not sure I see point (b) of what you suggest in your earlier summary @fkiraly, if you have multiple dispatch with Also, like @fkiraly, I'm actually not hellbound on this, perfectly happy to go with one idea and just stick with it, to be fair I just don't really understand the problems so it's more a question of trying to understand that. |
Maybe it would be helpful to understand what went wrong for you @ablaom with design (a) ? Re. "I don't see (b)" for @tlienart : the situation is in which you want to fit the model with the same hyperparameter settings to many different datasets or data views. In this situation, you have a single model specification container, but many fitted model containers. |
Btw, there is another argumentI can see in favour of (b): it follows the design principle of separating "instruction" from "result", if you were to equate the two with the model strategy specification and the fitted model. Of course it makes sense to have the "result" point to the "instruction". And here's one in favour of (a): in (b), how would you easily update a fitted model with regards to hyper-parameters, without losing the reference to the instructions it arose from? Another one in favour of (a): it follows more closely the "learning machine" idea where the fitted model is interpreted as a "model of the world" which the AI (the model/class etc) has. Note that this interpretation is in contradiction with the design which sees fitted model being a "result" - instead it sees it as a "state". |
So ... is it true that advanced operations such as update, inference, adaptation, or composition, happen more naturally in design (a)? I feel @ablaom is most likely to contradict so I would be keen to hear counterarguments. |
The one model - one function paradigm is not ideal. A transformer (e.g., NN encoder-decoder) has two methods: transform and inverse-transform. Also, we might want to consider classifiers as having a Whether we should adopt design (a) or design (b) at the level of glue-code, practical concerns make (b) the clear choice: We can always combine parameters and hyperparameters at a higher level of abstraction (which is what I intend and sketch above) but we can never unfuse them if they are joined already at in the the bottom-level abstraction. @fkiraly At present I would allow the user to mutate the hyperparameter part of a combined "machine" and then call fit! (without specifying hyper parameters) to update the fit-result (learned parameters). However, this means that before the fit call, the hyperparameters and learned parameters are not yet in sync. Although there is potential for trouble this is very convenient. What do you think? |
@ablaom , I think there is an important difference between "one model - one function" and "one model - one container". The first ("1 model - 1 function") is arguably silly and if I understand it no one in this thread would want it. We may need to dispatch on fit, predict, trafo, backtrafo, predict_proba, what_are_my_hyperparameters, predict_supercalifragilistic and similar, but at least on fit/predict, which is 2>1. The distinction between (a) and (b) is where it stands on the "one model - one container" issue, I think both designs agree with "one model - many interface methods" as is in my opinion reasonable. Also, (b) is not so clear cut - even if you have a single container, you can always write multiple accessor methods, say instructions_of(container) and fitted_model_in(container) I still don't have a strong preference, just playing the devil's advocate and pointing out what I see as a gap in reasoning. |
More precisely, I don't think the situation is that clear: whenever you have a collection of (dispatch-on or oop) objects and class methods, there's always the decision on a spectrum of tacking things together and taking them apart. Axes on which to consider these are user friendliness, clarity of code, and the semantic/operational sensibility of tacking the things together or not. For model instructions and fitted models this is not so clear to me: the first can live almost entirely without data, and the second may be invoked multiple times for a given instruction set ("hyper-parameters"). Python/sklearn and R-base handle the issue differently and I can see the merits of both. |
The design has solidified considerably since this discussion and I am closing the issue. |
Please provide any new feedback on the proposed glue-code
specification
below. @fkiraly has posted some comments
here. It
would be helpful also to have reactions to the two bold items below.
I will probably move the “update” instructions for the
fit2
method tomodel hyperparameters, leaving keyword arguments for package-specific
features (not so many use cases). It will be simplified, made into an
argument-mutating function without data as arguments. (If data really
needs to be revisited, a reference to it can be passed via cache.) The
document will explain use cases for this better.
I will require all
Model
field types to be concrete.Immutable models. To improve performance, @tlienart has
recommended making models immutable. Mutable models are more
convenient because they avoid the need to implement a copy function,
and you can make a function (eg, loss) a hyperparameter (because you
don't need to copy it). The first annoyance can be dealt with (mostly)
with a macro. To deal with the second you replace a functions with
concrete type ("reference") and use type dispatch within
fit
to getthe function you actually want. Or something like that. In particular,
you need to know ahead of time what functions you might want to
implement. For unity, we might want to prescribe this part of the
abstraction (for common loss functions, optimisers, metrics, etc)
ourselves (or borrow from an existing library).
When I wrote my flux interface for Koala I found it very
convenient to use a function as a hyperparameter to generate the
desired architecture, essentially because a "model" in flux is a
function. (I suppose one could (should?) encode the architecture
a la Onnx or similar).
My vote is to keep Models mutable to make it more convenient for
package interfaces writers and because I'm guessing the performance
drawbacks are small, However, others may have a more informed opinion
than I do. For what it is worth, Scikitlearn.jl has mutable models.
What do others think about making models immutable?
Defaults for hyperparmaters ranges. Is there a desire for
interfaces to prescribe a range (and scale type) for
hyperparmaters, in addition to default values? (To address one
of @fkiraly comments, default values and types of parameters
are already exposed to MLJ through the package interface's model definition.)
The text was updated successfully, but these errors were encountered: