Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NewOptimizer] make optimisers not terrible #283

Closed
wants to merge 1 commit into from
Closed

Conversation

MikeInnes
Copy link
Member

@MikeInnes MikeInnes commented Jun 4, 2018

I'm opening this early on to get some API feedback. This looks something like:

x = rand(...)
delta = similar(x)
opt = Momentum(0.1)
delta = update!(opt, x, delta)
x .-= delta
opt.eta = 0.2 # alter learning rate
opt2 = Compose(InvDecay(...), ADAM(...))
# etc

The key benefits of this design are:

  • Learning rates can be changed after initialisation
  • Easier to create combinations of decay with other optimisers
  • Parameters and optimisers are not tied together (better for distributed) – and params can be created on-the-fly
  • Generally clearer what's going on compared to the current closure nest

Small thing, but to make some slackers happy I'm proposing to rename SGD to Descent, any thoughts?

It would be nice if we could write SGD+Momentum as a combination of those two optimisers, but while equivalent it would change the meaning of those parameters slightly. I suspect it's more important to be consistent with other frameworks here, but would take other opinions.

I expect that this is going to require changing training syntax to

Flux.train!(model, params, optimiser, callback)

or something equivalent. I'm not stoked about the extra parameter to train!, but it's really just making explicit what was there already. We could technically support the current-style API Descent(params, ...) but I'm not sure it's semantically justified (and isn't great for things like distributed training).

Also need to figure out how to not break everyone's code here, which will probably require re-implementing the current API temporarily anyway.

@datnamer
Copy link

datnamer commented Jun 4, 2018

https://github.com/JuliaML/LearningStrategies.jl Has composable optimizers

@staticfloat
Copy link
Contributor

If you're going to rename SGD, I would follow TensorFlow's lead and call it GradientDescent. I don't care that it's called SGD, but really the problem is the S, not the G, right? ;)

Generally clearer what's going on compared to the current closure nest

This is the biggest motivator for me; while it's neat to have everything implicit, there are large benefits to readability and debugability to making things explicit like this, so 👍 from me.

I expect that this is going to require changing training syntax

TBH I have 3-4 different versions of training harnesses floating around within my largest ML codebase. Without a more opinionated training harness, I would guess that most users of Flux end up using a customized version. (E.g. something that passes epoch numbers to callbacks, provides support for regularizers by passing in a list of parameters, supports multiple outputs and multiple loss functions for adversarial networks, etc...)

@MikeInnes
Copy link
Member Author

Included in #379.

@MikeInnes MikeInnes closed this Feb 22, 2019
@CarloLucibello CarloLucibello deleted the newoptimizer branch April 7, 2022 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants