Move all optimizers to Optimisers.jl #9

darsnack · 2021-01-29T00:25:41Z

This PR finishes the already good work in this repo to move all the optimizers from Flux to Optimisers.jl. The main advantage of the implementation in Optimisers.jl is the use of functors and explicit state. This eliminates the need for any IdDicts and simplifies how compositions of basic optimizers work.

One notable exception: this implementation is mutable. The simple reason for this is that immutable updates will copy entire models and result in high memory usage. Until we have a compiler optimization to address this, this PR opts to have mutable updates for now. You can read more about the reasoning in this design note.

Remaining tasks:

docstrings
clean up tests

cc @DhairyaLGandhi @CarloLucibello @mcabbott @ToucheSir

ToucheSir · 2021-01-29T00:44:37Z

src/rules.jl

 end

-function (o::Descent)(m, m̄)
-  update(o, m, m̄, state(o, m))[1]
+(o::Descent)(m, dm, st) = update!(o, m, dm, st)


Following from FluxML/FluxML-Community-Call-Minutes#22 (comment), I'm not sure we should make these structs callable at all. Given none of methods do anything more than forward to update(!) with the optimizer object itself, why not get rid of them altogether?

I agree but left them in the PR so we could discuss and make a decision

It also (wrongly) makes it look like the optimizer has completely internal state even though this will still return an updated state.

darsnack · 2021-01-29T00:47:22Z

Sorry for the mass cc's but I figured this was an important enough change that we should have a full discussion

src/rules.jl

src/interface.jl

src/rules.jl

DhairyaLGandhi

Thanks so much for looking into this!

I haven't gone through the entire pr yet but with optimisers.jl, we can't do inplace updates and have a slightly different design to allow for things like xla (and even the GPU compilation work in Julia) to do more aggressive optimisations.

src/Optimisers.jl

src/interface.jl

src/rules.jl

darsnack · 2021-01-29T03:46:12Z

Okay but I don't want to halt other progress until XLA/GPUCompiler is ready. We could always support both in-place and immutable updates for now.

darsnack · 2021-01-29T03:54:15Z

I'm happy to switch this PR to be completely immutable if we already have something ready to address the allocation issue.

DhairyaLGandhi · 2021-01-29T04:00:03Z

Supporting both doesn't seem worth it, and the work will immediately be useful in 1.6, so it's not halting progress there. We should be alright

darsnack · 2021-01-29T04:02:30Z

Do you mean optimization passes through AbstractInterpreter?

DhairyaLGandhi · 2021-01-29T04:13:02Z

No, I mean the gpu codegen work that has already been merged in base.

darsnack · 2021-01-29T04:21:14Z

Sorry, do you mind linking PRs? So this will make all the out-of-place updates mutating as an optimization for code involving CuArrays?

DhairyaLGandhi · 2021-01-29T05:54:53Z

There isn't one pr directly linked with it, unfortunately. I guess I could link to the recent work in GPU work in the language. It's not something that will enable universally better code I don't think, but it fits in line with the programming model of GPUs where you do end up wanting to free up memory as soon as possible.

Allocations can be improved by improving the function calls just as much, so not inherently something to hold this up on.

src/rules.jl

CarloLucibello · 2021-01-29T07:26:08Z

This is breaking the current convention of passing the lr as the first positional argument, it is going to break basically all existing code. Did we already have some discussion on that? Also, having a unicode keyword arg so often used is not ideal. I'm not opposing the change, but we have to be careful about it, maybe doit somewhere else

DhairyaLGandhi · 2021-01-29T10:16:02Z

I would leave out the changes in interface.jl for the purposes of this PR.

It may make things faster to model structure as in #3

darsnack · 2021-01-29T13:42:40Z

This is breaking the current convention of passing the lr as the first positional argument, it is going to break basically all existing code. Did we already have some discussion on that?

No, and sorry about that! TBH in all my time using Flux, I never realized that the 1st arg was always LR. I always end up looking it up, because it is easier to double check than two have my training not work cause I set the momentum instead of the LR. Thus, the change to keyword args where it is easier to remember. That being said, I have no strong feelings on this change and happy to go back to avoid breaking things.

The changes in interface.jl are mostly the mutating stuff. I feel like we should discuss this, because the key feature of functional optimizers is explicit state not immutability. If there is something ready to roll out in a few weeks that make immutable updates mutating as an optimization, then I'll just drop the changes to interface.jl and make everything immutable.

But if it will be months before we get such features ready for use, then I would strongly recommend we consider keeping mutability right now. At the end of the day, there is no world in which we can have optimizers that make copies of all the model parameters/gradients (even on Flux#master). It seems silly to not provide what we think is a better optimizer and scheduling just because we also want feature X. Why not make feature X part of a later PR if it will be ready later?

Re: ParameterSchedulers.jl: @DhairyaLGandhi you mentioned you wanted to discuss here. What are the design assumptions that need to be addressed? We can fix ParameterSchedulers.jl to account for them.

DhairyaLGandhi · 2021-01-29T13:54:24Z

explicit state not immutability

Both are

be months before we get such features ready for use, then I would strongly recommend we consider keeping mutability right now

Not sure which features you're talking about here? But pretty sure immutable updates here is an explicit design, and we would maintain that.

Re: ParameterSchedulers.jl: apologies, I meant that we should move the discussion and implementation from FluxML/Flux.jl#1487 to FluxML/Flux.jl#1481, and (as a follow up) the scheduling interface should be an implementation of the optimisers+training without needing additional handling on either side. I'm other words, we need to generalise the interface around these methods to allow for scheduling.

DhairyaLGandhi · 2021-01-29T13:55:48Z

Either way, the optimisation around mutability is orthogonal, and would not need weeks if it's required urgently.

darsnack · 2021-01-29T13:59:04Z

Not sure which features you're talking about here?

I'm talking about the feature where x, st = update(o, x, dx, st) does not make copies of x or dx even though update is immutable.

the scheduling interface should be an implementation of the optimisers+training without needing additional handling on either side. I'm other words, we need to generalise the interface around these methods to allow for scheduling.

The explicit state is that generalization. That was the purpose of this PR.

Based on this discussion, I think we have two options. Either make a PR to Flux.jl itself that makes all the optimizers use explicit state and leave Optimisers.jl alone. Or merge some version of this PR into Optimisers.jl with the work around mutability for the future.

DhairyaLGandhi · 2021-01-29T13:59:24Z

Also I just pushed some code I had implemented some time back for the remaining optimisers in dg/adams which we can merge, it's tested and everything.

DhairyaLGandhi · 2021-01-29T14:03:21Z

The explicit state is that generalization. That was the purpose of this PR.

Agreed, but it's incomplete in and of itself. As an example, making some information around training available to the scheduler would be very helpful.

Either make a PR to Flux.jl itself that makes all the optimizers use explicit state

Maybe I'm not following this properly, apologies if I am, but I'm pretty sure that optimisers.jl is that

darsnack · 2021-01-29T14:16:30Z

As an example, making some information around training available to the scheduler would be very helpful.

What information would that be? Scheduling policies require the current iteration which is easily specified through the explicit state.

Maybe I'm not following this properly, apologies if I am, but I'm pretty sure that optimisers.jl is that

That's what I'm trying to say!

As a requirement for doing scheduling right, it became clear that the optimizers needed to move to explicit state. So, I have opened the discussion topic, then I prepared this PR in light of that discussion. This PR follows the design principles of Optimisers.jl to use explicit state. What isn't required (right now) is immutability. So, I am suggesting we save the immutability design goal for later so that we can have explicit state optimizers in Flux now.

DhairyaLGandhi · 2021-01-29T14:46:13Z

What information would that be?

The Epoch of training for example.

Immutability is part of the design and interface, and I'd keep that. I would much prefer an issue to follow up rather than block the current work.

darsnack · 2021-01-29T14:57:04Z

The Epoch of training for example.

The epoch of training wouldn't necessarily be information in Flux.train!. In any case, ParameterSchedulers.jl already has usage patterns in place when you want to schedule by epoch, iteration, or some mixture. We can save the discussion on scheduling for later maybe.

Immutability is part of the design and interface, and I'd keep that. I would much prefer an issue to follow up rather than block the current work.

That's why I suggested that if we can't budge on immutability in Optimisers.jl, then I can submit a PR to Flux.jl that implements explicit state but not immutability.

src/rules.jl

CarloLucibello · 2021-02-03T18:38:01Z

This looks fine, except for the fact the composite optimizer should be renamed.
I'm worried about the breakage though. Everyone's code with custom training loops (i.e. everything using update!, like all my flux code) is going to be broken. Of minor importance, custom optimizers overloading apply! will fail.
Do we have a deprecation path?
Maybe we should temporarily add a state field to each optimizer here so that we can repurpose update! in Flux while deprecating it at the same time.

DhairyaLGandhi · 2021-02-03T19:18:25Z

All good concerns, thanks!

I think the naming is not the end of the world, but I would definitely prefer a better name. I dont think we want to add a state field though, this is a different design and we shouldn't be too worried about the breakages that can be managed elsewhere (see the corresponding PR to Flux where I have updated the commented out update!)

We should definitely come up with a smoother breakage path though, although I'm not sure we can get around everything.

darsnack · 2021-02-03T20:48:18Z

AFAICT, even with the commented update! in FluxML/Flux.jl#1481, there is still a breaking change since the explicit state must be passed into update!.

I don't think there is really any clean way around it. The only way to avoid passing around the state would be to store it in some structure. The only structures available to update! are the optimizer, Params, and Grads. Params/Grads are clearly poor candidates leaving only the optimizer like @CarloLucibello suggested.

We could do something like

struct OptAndState{O, S}
  opt::O
  state::S
end

Then use @forward to push things onto the optimizer. All this code would be in Flux.jl not Optimisers.jl, so we can save the discussion for FluxML/Flux.jl#1481. Can we agree that the soft-breakage code doesn't belong in this repo?

ToucheSir · 2021-02-03T21:51:58Z

Will update! and Optimiser get deprecation warnings as part of this bringup?

CarloLucibello · 2021-02-04T08:57:17Z

@darsnack this is a basic script that I'd like to see still working after the change

using Flux
using Flux.Optimise

opt = ADAM(0.1)
x = rand(2)

for t=1:10
    g = gradient(x -> sum(x.^2), x)
    update!(opt, x, g[1])
    opt.eta /= 2
end

If we keep this PR as it is, in order to provide a deprecation path in Flux we would have to create a parallel family of optimizers, ADAM, Nesterov, etc... where

struct ADAM
   opt::Optimisers.ADAM
   state   
end

then @forward a few methods and also overload setfield and getfield. This could be done, but wouldn't be more practical to just add a state to the optimizers here?

DhairyaLGandhi · 2021-02-04T09:04:03Z

In that case I'd rather just have a breaking change.

darsnack · 2021-02-04T15:20:39Z

Okay renamed to ChainOptimiser from SequenceOptimiser as was agreed on the last ML community call.

I disagree with a breaking change. This is such a fundamental piece of the framework, that we shouldn't play fast and loose with people's code.

@CarloLucibello I think we could have deprecation path code in this repo in a flux.jl file, but I still feel that the code belongs in Flux.jl itself. Materially, I don't think it matters too much, since Flux.jl will probably be the only downstream user of Optimiser.jl. But I don't think any users of Optimisers.jl should be allowed to use implicit state. It was never a part of the package, so I think we shouldn't introduce it only to deprecate it. The only users who can expect implicit state are Flux.jl users, so that's where I think the deprecation path belongs.

If there is a concern about code complexity, then I think something like the following should resolve it:

for $Opt in (:Descent, :ADAM) # we would have all in the real code
  @eval begin
    mutable struct $(Opt){T}
      opt::Optimisers.$Opt
      state::T
    end

    Base.getproperty(o::$Opt, p) = getproperty(o.opt, p)
    Base.setproperty!(o::$Opt, p, v) = setproperty!(o.opt, p, v)

    function apply!(o::$Opt, x, dx)
      dx, s = apply(o.opt, x, dx, o.state)
      o.state = s
      
      return dx
    end
  end
end

CarloLucibello · 2021-02-04T15:58:30Z

ok, we'll cook something like that in flux, handling init as well.

Two more things (feel free to ignore both)

the chain in ChainOptimizer feels like a verb, maybe better OptimizersChain or OptChain?
can anyone do or point me to a very short summarization of the advantages of the external state approach compared to the IdDict one? One disadvantage I see is that having to init the state and pass it around is slightly annoying

DhairyaLGandhi · 2021-02-04T16:12:24Z

I don't think we would play fast and lose but instead deprecate the current ones as a whole, pointing to the new explicit state approach, and handle both of them inside Flux for a technically breaking change. That way we can preserve custom loops for one more release cycle.

darsnack · 2021-02-04T16:13:01Z

What about ChainedOptimiser? Happy to do OptimiserChain too.

There are probably more reasons than I list here for explicit state, so @ToucheSir and @DhairyaLGandhi can chime in:

With IdDicts (or any form of implicit state), the state is an implementation detail instead of part of the interface. This is an issue for people trying to extend the interface, since using state is a requirement. For example, if you are writing a scheduler, you might do:

struct Scheduler{S, O}
  opt::O
  schedule::S
  iter::Int
end

function apply!(o, x, dx)
  o.opt.eta = o.schedule[o.iter]
  o.iter += 1
  return apply!(o.opt, x, dx)
end

This seems like a perfectly reasonable and logical thing to write, but it won't work unless o is optimizing exactly one array x. This is because the state which is associated with each x that o is applied to is not accounted for anywhere. This can be rectified by replacing iter with an IdDict and doing o.iter[x], but it wouldn't be obvious to anyone extending the interface at first glance.

So, state is a requirement for almost every optimizer. Aside from a complicated paragraph in the docs, the best way to make this clear to users is by making state part of the interface if it is a requirement and not an implementation detail. It also allows anyone writing an optimizer to write the natural code (w.r.t. a single x), then we manage passing around the state for complicated data structures.

I think largely, the benefits are that this is a clearer, more extensible design. It should make for safer code with easier to detect errors. There's also the XLA-side of this where explicit state and immutability can allow compiler optimizations, but those are further down the line.

darsnack · 2021-02-04T16:35:20Z

It also means that the code for an optimizer is stable. When we want to special case how we handle the multiple states associated with a data structure, it only needs to be done once, and none of the optimizer rules need to change their code.

darsnack · 2021-02-05T15:50:52Z

I think everything should be resolved in the latest commit. I was convinced that OptimiserChain was the best name.

Docstrings are in place, but we should make setting up the docs a separate PR. Tests are completely ported from Flux.jl.

CarloLucibello · 2021-02-07T09:46:21Z

@DhairyaLGandhi can I have write access to this repo? (actually, if it could be made org-wide it would be much more convenient)

darsnack · 2021-02-08T16:20:30Z

Bump can we get the write-access for this repo to match the org? I think this can be merged.

DhairyaLGandhi · 2021-02-08T16:25:22Z

Thanks!

darsnack added 2 commits January 28, 2021 10:47

Initial refactor

d7c53ac

Add remaining optimizers

a7d2a36

darsnack mentioned this pull request Jan 29, 2021

Move to Optimisers.jl and add schedules FluxML/Flux.jl#1487

Closed

3 tasks

ToucheSir requested changes Jan 29, 2021

View reviewed changes

ToucheSir reviewed Jan 29, 2021

View reviewed changes