New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizers #26
Optimizers #26
Conversation
My first thought is, this is putting a fairly large burden on the backend to implement much of the optimisation process; I was expecting that a backend wouldn't have to do much more than forward Being able to implement optimisers with I think it would be a good idea to implement the most straightforward version of this that works in pure Julia mode; e.g. when calling Hopefully you should find that straightforward to do, but let me know if it's not clear. |
I'm not very clear about the semantic of To address this, we need to clearly define what
If we go the first way, there should be a way to "turn off" the synchronization of params in the training process, otherwise transfering weights and grads every batch is inevitable. This will make things very complex. If we choose 2 or 3, that will make things much more easier since we can convert
As for where to put the state, yes I can have the optimizer to keep a And for the |
Great questions. I'll try to explain my current thinking on this as much as I can. In general, functions of models have "wrapper semantics"; think I don't think any of those options should make things significantly more or less complex then; it's just a question of when
Keep the thoughts coming, especially if that's not clear. |
51a9482
to
21af9f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, this looks much improved.
Although I started this out with a recursive update!
, I'm wondering if it would just be cleaner to grab all params up front – like params(Affine(10,5)) == [Param(10,5), Param(1,5)]
, opt = SGD(params(model))
. Then you could call update!(opt)
to carry out the update. What do you think?
That would also make it easier to get rid of the nested closures and just have an SGD
object with the appropriate state, and an update!
method.
I really like the way you can compose optimisers together, but it would be nice if that was built on top of the basic framework, rather than special cased at the bottom. For example, with the tweaks above:
struct Multi
fs
end
update!(m::Multi) = foreach(update!, m.fs)
That would also avoid the need to repeat the decay stuff many times. If the user wants a decay they can easily just compose the basic optimiser with a decay themselves.
decay::Real=0, | ||
nesterov::Bool=false) | ||
|
||
@restrict_range lr "[0, ∞)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just @assert 0 < lr < ∞
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeInnes you mean 0 <= lr || throw(ArgumentError("lr must be > 0"))
.
The square bracket indicates inclusive in this notation.
And ArgumentError
is semantically different from @assert
.
In particular, oneday I hope to see the ability to disable asserts in optimised mode.
And it is code using @assert
to check user logic that is stopping that happen.
(I'm sure you've seen the issues/PRs in the julialang repo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@assert
is not for this purpose as @oxinabox indicated. However, https://github.com/jw3126/ArgCheck.jl could be a good option. I just feel it's still not clean enough when many checks happen together:
@argcheck lr >= 0
@argcheck 0 <= beta1 <= 1
@argcheck 0 <= beta2 <= 1
@argcheck epsilon > 0
@argcheck decay >= 0
compared with this, which I think is more aligned and mathematical:
@restrict_range lr "[0, ∞)"
@restrict_range beta1 "[0, 1]"
@restrict_range beta2 "[0, 1]"
@restrict_range epsilon "(0, ∞)"
@restrict_range decay "[0, ∞)"
I merged this in 97ecb26 (though it's not active yet, given that I need to make some tweaks for the big refactor). |
I tried to add some optimizers. As for the definition of model in src/model.jl line 8:
Optimizers are essentially models, which have the signature of
(param, grad) -> delta
with their internal states. So we can implement them just as models, using@net
. This PR did exactly this. The only problem is that we cannot use the sameback!
andupdate!
to update optimizers. Currently I introduced two APIs calledaxpy!
andzero!
and found them enough to update almost all common optimizers. Maybe there is better way to design this.a definition of optimizer looks like this:
and to train a model using optimizers (SGD with momentum and decay) looks like this:
Currently them works on MXNet backend. What's your opinions of this approch?