Discriminative learning rates #35

lorenzoh · 2021-04-11T08:48:25Z

Discriminative learning rates means using different learning rates for differents part of a model, so-called layer groups. This is used in fastai when finetuning models.

lorenzoh · 2021-04-11T10:17:15Z

So as practiced in fastai (see fastai._BaseOptimizer.set_hyper, fastai.fine_tune), this can usually be reduced to having one absolute learning rate and then having different parts of the model be trained with that learning rate times a constant factor. For example, when finetuning, the pretrained backbone might have a tenth that learning rate while the randomly initialized head of the model has the full learning rate.

Sorting through the Flux.jl optimizer documentation and its source, I think this can be reduced to a wrapper optimizer that discounts the gradient based on the factor associated with a parameter. Leaving the model splitting and constructor aside, the update! step could look something like this:

mutable struct DiscriminativeLR
    factors::IdDict
end

function update!(o::DiscriminativeLR, x, Δ::AbstractArray{T})
    factor = convert(T, get(o.factors, x, 1))
    if factor != 1
        @. Δ *= factor
    end
end

An optimizer could then be adapted to use discriminative learning rates as follows:

optim = Optimiser(DiscriminativeLR(...), ADAM())

Now, this implementation requires an extra multiplication for every parameter, but I'm not sure if this makes a large performance difference in practice and I see the same being done in the implementation of ExpDecay.

@darsnack, I recall you being involved with the Flux.jl optimizers, what do you think about this approach?

ToucheSir · 2021-04-11T16:47:09Z

This looks reasonable and wouldn't be too difficult to port to https://github.com/FluxML/Optimisers.jl either, but the real challenge is in how to generate factors for the optimizer in the first place. Fastai as a concept of param groups generated by a splitting function, but after working with it I'm not convinced that's the best approach (requires deep knowledge of the model structure, where you want to split, etc). This will likely require a couple of design iterations :)

lorenzoh · 2021-04-11T17:04:27Z

I totally agree that that's the hard part 😅 splitting models into groups will also be useful for other things like freezing and advanced visualizations, so it definitely is important to find a nice API for it. I think we should start with the basic, but common case where you have a Chain and you have index ranges that define param groups/layer groups. When fine-tuning (which is the main usecase for discriminative learning rates as far as I could tell) a model created by methodmodel you will usually have a 2-element Chain(backbone, head) where this works great. For now, in finetune! we could check for a 2-element Chain and use those as groups or else tell the user to pass in a splitter (e.g. indices).

And then, as you said, iterate on how a splitter is represented and in what other ways of splitting beside indices might be useful. Does that sound reasonable?

darsnack · 2021-04-11T17:27:31Z

Implementation-wise, this is the correct way to do this for the current state of Flux's optimizers. The IdDict approach is simple to port to FluxML/Optimisers.jl too.

That being said, I don't like the idea of something called "discriminative LR" using the optimizer interface (similarly, I advocate for ExpDecay and friends to no longer be provided as "optimizers"). The optimizer interface is great for optimizers, but not good for hyper-parameter schedules. So, it makes sense to push all the hyper-parameter-related code into a schedule/hyper-parameter interface.

Once we move to FluxML/Optimisers.jl, optimizer instances will be cheap. So, I think a good API might have something like GroupOptimiser that looks like:

opt = GroupOptimiser(
    Scheduler(Exp(λ = 1e-3, γ = 0.75), ADAM()) => params(model[1]), # backbone parameters
    Scheduler(Exp(λ = 1e-4, γ = 0.5), ADAM()) => params(model[2]),  # head parameters
)

Of course, this is thinking about it from the vanilla perspective and not specifically FastAI.jl. I haven't entirely figured out how GroupOptimiser would iterate the parameter groups and optimizers. Certainly, if there was a broader interface for parameter groups, then this would be easier.

darsnack · 2021-04-11T17:28:17Z

Performance-wise, I think the extra multiply is negligible. So no need to worry about that.

ToucheSir · 2021-04-11T17:55:06Z

The issue with => params(model[1]) is that it necessitates the introduction of an IdDict again, which is exactly what we're trying to avoid with Optimisers.jl. That said, I do think the idea of a GroupOptmiser is interesting. One could imagine a mechanism like FluxTraining's stateaccess where you return a structure with the same schema as the model, but with schedulers at areas that should use a different learning rate.

darsnack · 2021-04-11T18:24:25Z

The issue with => params(model[1]) is that it necessitates the introduction of an IdDict again

Yeah I agree. To use GroupOptimiser with minimal mental overhead on the user, I think you need IdDict.

That's why I ended up suggesting working on a parameter group interface. Something that allows us to reason about collections of parameters as a structure better than Params.

return a structure with the same schema as the model, but with schedulers at areas that should use a different learning rate.

I think if we iterate on the "schema" here, then that might be what I'm going for with parameter groups.

lorenzoh · 2021-04-11T18:37:01Z

Okay, I see this is a larger topic than I thought and it will take some time and effort across packages for this to stabilize. I think for now I'll go with the "old" optimizer solution I suggested above with support for splitting on indices only, and we can move to a better solution once things are ironed out and Optimizers.jl moves center-stage.

lorenzoh · 2021-05-13T11:25:28Z

Now implemented and used in finetune!.

lorenzoh self-assigned this Apr 11, 2021

lorenzoh added enhancement New feature or request fastai-parity Feature that fastai has numfocusgrant Part of @lorenzoh's NumFOCUS grant labels Apr 11, 2021

lorenzoh added this to To do in Computer vision (NumFOCUS development grant) Apr 11, 2021

darsnack mentioned this issue Apr 17, 2021

Adjoints for regularizers? FluxML/Flux.jl#1575

Closed

lorenzoh moved this from To do to In progress in Computer vision (NumFOCUS development grant) May 4, 2021

lorenzoh closed this as completed May 13, 2021

Computer vision (NumFOCUS development grant) automation moved this from In progress to Done May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discriminative learning rates #35

Discriminative learning rates #35

lorenzoh commented Apr 11, 2021

lorenzoh commented Apr 11, 2021 •

edited

ToucheSir commented Apr 11, 2021

lorenzoh commented Apr 11, 2021

darsnack commented Apr 11, 2021

darsnack commented Apr 11, 2021

ToucheSir commented Apr 11, 2021

darsnack commented Apr 11, 2021 •

edited

lorenzoh commented Apr 11, 2021

lorenzoh commented May 13, 2021

Discriminative learning rates #35

Discriminative learning rates #35

Comments

lorenzoh commented Apr 11, 2021

lorenzoh commented Apr 11, 2021 • edited

ToucheSir commented Apr 11, 2021

lorenzoh commented Apr 11, 2021

darsnack commented Apr 11, 2021

darsnack commented Apr 11, 2021

ToucheSir commented Apr 11, 2021

darsnack commented Apr 11, 2021 • edited

lorenzoh commented Apr 11, 2021

lorenzoh commented May 13, 2021

lorenzoh commented Apr 11, 2021 •

edited

darsnack commented Apr 11, 2021 •

edited