Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisers make me sad #234

Closed
MikeInnes opened this issue Apr 14, 2018 · 3 comments
Closed

Optimisers make me sad #234

MikeInnes opened this issue Apr 14, 2018 · 3 comments

Comments

@MikeInnes
Copy link
Member

MikeInnes commented Apr 14, 2018

I've been meaning to redesign our optimisers for a while, but I figured I'd write down some thoughts in case anyone else wants to take a crack at the design.

The main challenge here is keeping optimisers composable. This means that concepts like SGD, momentum and "SGD with momentum" should all be the same kind of object, and we should be able to construct the latter easily from the two former, like optimiser(SGD(η), Momentum(ρ)) (or perhaps SGD(η, Momentum(ρ))).

What makes this slightly difficult is that the update step x -= Δx should be applied once after all the other optimisers. One easy way to solve this is to make that update explicit (optimiser(SGD(η), Momentum(ρ), Update())), but this is fairly bad for convenience. Another way is to make the update itself part of the training loop, but then it can't be adjusted by an optimiser (maybe that's something we don't need to support anyway?).

Some other concerns that I think are less hard to address: implementing new optimisers should be easy (~5 lines), parameters should be adjustable (#233), the update should be fast. Also, we can't in future assume that gradients will be updated in-place, as we will likely have a more functional AD (#86). See also #228, #106.

Edit: People with knowledge have informed me that making the update part of the train loop would be OK. What I'm thinking in that case is that an optimiser takes a param, grad and updates grad in place. If it needs to store state, it can use the param itself as an ID. I think that addresses everything we need.

@jekbradbury
Copy link
Contributor

we can't in future assume that gradients will be updated in-place

We were just talking in Slack about how in-place gradient updates are actually two separate optimizations: avoiding the temporary allocation of an update array, and fusing the loops of the optimizer expression and the update operation. Few if any deep learning frameworks do the latter (what in Julia would simply be new_param .-= eta .* grad), and the update doesn't usually account for a significant fraction of training time, but all frameworks I'm familiar with include the first optimization for memory reasons. Often (not always) the parameters represent a major chunk of the network's memory usage; if they do you really only want to have one copy each of parameters and gradients live at any given time, so you have to do the update itself in-place.

@MikeInnes
Copy link
Member Author

#283

@MikeInnes
Copy link
Member Author

Optimisers no longer make me sad! They are not perfect though, so see #637.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants