[NewOptimizer] make optimisers not terrible #283

MikeInnes · 2018-06-04T15:17:16Z

I'm opening this early on to get some API feedback. This looks something like:

x = rand(...)
delta = similar(x)
opt = Momentum(0.1)
delta = update!(opt, x, delta)
x .-= delta
opt.eta = 0.2 # alter learning rate
opt2 = Compose(InvDecay(...), ADAM(...))
# etc

The key benefits of this design are:

Learning rates can be changed after initialisation
Easier to create combinations of decay with other optimisers
Parameters and optimisers are not tied together (better for distributed) – and params can be created on-the-fly
Generally clearer what's going on compared to the current closure nest

Small thing, but to make some slackers happy I'm proposing to rename SGD to Descent, any thoughts?

It would be nice if we could write SGD+Momentum as a combination of those two optimisers, but while equivalent it would change the meaning of those parameters slightly. I suspect it's more important to be consistent with other frameworks here, but would take other opinions.

I expect that this is going to require changing training syntax to

Flux.train!(model, params, optimiser, callback)

or something equivalent. I'm not stoked about the extra parameter to train!, but it's really just making explicit what was there already. We could technically support the current-style API Descent(params, ...) but I'm not sure it's semantically justified (and isn't great for things like distributed training).

Also need to figure out how to not break everyone's code here, which will probably require re-implementing the current API temporarily anyway.

datnamer · 2018-06-04T19:55:13Z

https://github.com/JuliaML/LearningStrategies.jl Has composable optimizers

staticfloat · 2018-06-08T17:47:35Z

If you're going to rename SGD, I would follow TensorFlow's lead and call it GradientDescent. I don't care that it's called SGD, but really the problem is the S, not the G, right? ;)

Generally clearer what's going on compared to the current closure nest

This is the biggest motivator for me; while it's neat to have everything implicit, there are large benefits to readability and debugability to making things explicit like this, so 👍 from me.

I expect that this is going to require changing training syntax

TBH I have 3-4 different versions of training harnesses floating around within my largest ML codebase. Without a more opinionated training harness, I would guess that most users of Flux end up using a customized version. (E.g. something that passes epoch numbers to callbacks, provides support for regularizers by passing in a list of parameters, supports multiple outputs and multiple loss functions for adversarial networks, etc...)

MikeInnes · 2019-02-22T15:58:35Z

Included in #379.

initial sketch

f1ec0c6

This was referenced Jun 5, 2018

Tracker.grad on a non-param StackOverflows #284

Closed

Optimisers make me sad #234

Closed

willtebbutt mentioned this pull request Jul 26, 2018

Add L-BFGS to Turing TuringLang/Turing.jl#421

Closed

DhairyaLGandhi mentioned this pull request Aug 28, 2018

New optimisers interface #379

Merged

MikeInnes force-pushed the master branch from 65799d1 to 193c4de Compare September 5, 2018 15:53

MikeInnes closed this Feb 22, 2019

CarloLucibello deleted the newoptimizer branch April 7, 2022 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NewOptimizer] make optimisers not terrible #283

[NewOptimizer] make optimisers not terrible #283

MikeInnes commented Jun 4, 2018 •

edited

Loading

datnamer commented Jun 4, 2018

staticfloat commented Jun 8, 2018

MikeInnes commented Feb 22, 2019

[NewOptimizer] make optimisers not terrible #283

[NewOptimizer] make optimisers not terrible #283

Conversation

MikeInnes commented Jun 4, 2018 • edited Loading

datnamer commented Jun 4, 2018

staticfloat commented Jun 8, 2018

MikeInnes commented Feb 22, 2019

MikeInnes commented Jun 4, 2018 •

edited

Loading