Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add trainstep! #666

Open
oxinabox opened this issue Mar 6, 2019 · 9 comments
Open

add trainstep! #666

oxinabox opened this issue Mar 6, 2019 · 9 comments

Comments

@oxinabox
Copy link
Member

oxinabox commented Mar 6, 2019

Following up from #607

We should expose functionality, that lets the user write a training loop,
while thinking only about loss.
Rather than thinking about gradient and update!.
loss is a higher level concept than gradients.

Custom traing loops are important since many things do not coomfortably fit into the abstraction of
train!(args->loss(args...), data, params, opt,callbacks).
The train! functioin is good for things that comfortably fit supervised training,
and while it can do anything it becomes increasing akward the further you are from that.
At the other end is writing a custom training loop, invoking gradients and update.
This is fully general and you can do all kinds of things l like messing with the gradients during the training loop.
But there is a middle ground,
where you can define the loss, but you have nothing to say about the gradients.

For this I think we should have

train_step!(getloss, ps, opt), where getloss is a 0 arg closure returning the loss (and using the model).
This wouldwould have pleasing symetry in name and arguments,
to train!(loss, ps, data, opt) where loss is a closure taking args as provided by iterating data.

This would be useful because you are not being required to use the abstraction of having data, but you have the rest.

The implementation would be very simple, but I feel the abstraction away from gradients is worth it.

function train_step!(getloss, ps, opt)
    gs = gradient(getloss, ps)
    update!(opt, ps, gs)
end

This would go into the core of train!.
Replacing

gs = gradient(ps) do
loss(d...)
end
update!(opt, ps, gs)

with

trainstep!(x->loss(d...), ps, opt)
@MikeInnes
Copy link
Member

I'm on board. It will make sense to have this as the default interface for gradients if we go full force on #628, since the gradient object will be a little more complex than usual.

I think this should be written step!(opt) do ... (it's kind of method of the optimiser). It may as well also return the loss.

Initially it will have to be step!(opt, ps) do ..., but we can remove the ps with #628 (and also reorder train! arguments to be more consistent with this).

@oxinabox up for sketching it out?

@oxinabox
Copy link
Member Author

oxinabox commented Mar 7, 2019

Do you really think the train! needs to have it's argument order changed too?
I like the consistency,
But it #628 will remove ps,
then reorder args will mean just an extra round of depreciations for little gain.

@MikeInnes
Copy link
Member

Doesn't have to be an extra round – we can do train!(loss, ps, data, opt) -> train!(loss, opt, data) in one go, separately from implementing step!.

@MikeInnes
Copy link
Member

MikeInnes commented Mar 10, 2019

I think this actually has to step!(loss, opt, x...) where x is whatever training data you have. The reason is that if we move to implicit parameters, if you write

step!(opt) do 
  mse(W*X .+ b, Y)
end

or similar then you're going to be surprised when Flux improves your loss by optimising the training data. The other option is to write dropgrad(X) but, meh. It seems pretty consistent to say: closed-over variables are parameters, formal arguments and constants are not.

You can also view this as being like the gradient(f, x) interface but instead of getting dx and ignoring f, we actually get df and ignore x. This seems weird because it's exactly opposite to every other framework and AD tool in existence, but I think it's necessary if we want to shift the focus to optimising programs like f rather than values like x.

I will sketch this out, along with #379, in #669, and also write up some simple usage examples to give a feel for it.

@oxinabox
Copy link
Member Author

One thing I have been thinking about, is should multiple return values be allowed?

Because it is often more convenient to calculate certain other things while you are calculating the loss. E.g. some metrics perhaps used in early stopping.

Right now for those I have just been modifing a vector in the parent scope.
But an alternative is that step! returns whatever getloss returns, but only optimizers first of that.

I think the implict parameter thing makes sense. This is distinct from train! as step! does not iterator it's arguments.
This is one minibatch, or ors online or ...
Infact it knows nothing about them at all. Using them is purly the responsibility of the getloss function.

@MikeInnes
Copy link
Member

In effect train! would become a thin wrapper around step! that replaces the single input to loss with an iterator of them. I think seeing it as a slight generalisation of step! is a nice way to look at it.

As far as calculating other values goes, I think it's closures FTW here:

total_loss = 0
for ...
  step!(opt) do
    total_loss += loss(...)
  end
end

Of course if we return the loss from step! then the example doesn't need to be written that way, but generally you could accumulate or inspect any intermediate result this way.

@oxinabox
Copy link
Member Author

oxinabox commented Mar 10, 2019

Closures like that is what i have been doing so far.
I might play around with my current code and see what it looks like without closures.

Hmm when writing it up might be good to highlight that

step!(opt, X, y) do X, y
   # code
   ...
end

Can be written

step!(opt) do
   X=dropgrad(X)
   Y=dropgrad(y)
   # code
   ...
end

@oxinabox
Copy link
Member Author

Just a example I started to write for other reasons:

callback = let
    prev_dev_loss=Inf
    function()
        dev_loss = loss(mdl(X_dev), Y_dev)
        @info(dev_loss)
        dev_loss > prev_dev_loss && throw(Stop())
        prev_dev_loss = dev_loss
    end
end

train!(loss, repeat((X_train, Y_train), 1000), opt, params(mdl), cb=callback)

One would write:

prev_dev_loss=Inf
ps = params(mdl)
for epoch in 1:1000
    train_loss = step!(opt, ps) do
        loss(mdl(X_train), Y_train)
    end

    dev_loss = loss(mdl(X_dev), Y_dev)
    @info(dev_loss)
    dev_loss > prev_dev_loss && break   # Early stopping
    prev_dev_loss = dev_loss
end

@CarloLucibello
Copy link
Member

CarloLucibello commented Feb 7, 2020

My 2 cents: both the current train abstraction and the step! one proposed here add just a little more conciseness while making the code more obscure compared to the "unrolled" train documented in #994

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants