Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discriminative learning rates #35

Closed
lorenzoh opened this issue Apr 11, 2021 · 9 comments
Closed

Discriminative learning rates #35

lorenzoh opened this issue Apr 11, 2021 · 9 comments
Assignees
Labels
enhancement New feature or request fastai-parity Feature that fastai has numfocusgrant Part of @lorenzoh's NumFOCUS grant

Comments

@lorenzoh
Copy link
Member

Discriminative learning rates means using different learning rates for differents part of a model, so-called layer groups. This is used in fastai when finetuning models.

@lorenzoh lorenzoh self-assigned this Apr 11, 2021
@lorenzoh lorenzoh added enhancement New feature or request fastai-parity Feature that fastai has numfocusgrant Part of @lorenzoh's NumFOCUS grant labels Apr 11, 2021
@lorenzoh
Copy link
Member Author

lorenzoh commented Apr 11, 2021

So as practiced in fastai (see fastai._BaseOptimizer.set_hyper, fastai.fine_tune), this can usually be reduced to having one absolute learning rate and then having different parts of the model be trained with that learning rate times a constant factor. For example, when finetuning, the pretrained backbone might have a tenth that learning rate while the randomly initialized head of the model has the full learning rate.

Sorting through the Flux.jl optimizer documentation and its source, I think this can be reduced to a wrapper optimizer that discounts the gradient based on the factor associated with a parameter. Leaving the model splitting and constructor aside, the update! step could look something like this:

mutable struct DiscriminativeLR
    factors::IdDict
end

function update!(o::DiscriminativeLR, x, Δ::AbstractArray{T})
    factor = convert(T, get(o.factors, x, 1))
    if factor != 1
        @. Δ *= factor
    end
end

An optimizer could then be adapted to use discriminative learning rates as follows:

optim = Optimiser(DiscriminativeLR(...), ADAM())

Now, this implementation requires an extra multiplication for every parameter, but I'm not sure if this makes a large performance difference in practice and I see the same being done in the implementation of ExpDecay.

@darsnack, I recall you being involved with the Flux.jl optimizers, what do you think about this approach?

@ToucheSir
Copy link
Member

This looks reasonable and wouldn't be too difficult to port to https://github.com/FluxML/Optimisers.jl either, but the real challenge is in how to generate factors for the optimizer in the first place. Fastai as a concept of param groups generated by a splitting function, but after working with it I'm not convinced that's the best approach (requires deep knowledge of the model structure, where you want to split, etc). This will likely require a couple of design iterations :)

@lorenzoh
Copy link
Member Author

I totally agree that that's the hard part 😅 splitting models into groups will also be useful for other things like freezing and advanced visualizations, so it definitely is important to find a nice API for it. I think we should start with the basic, but common case where you have a Chain and you have index ranges that define param groups/layer groups. When fine-tuning (which is the main usecase for discriminative learning rates as far as I could tell) a model created by methodmodel you will usually have a 2-element Chain(backbone, head) where this works great. For now, in finetune! we could check for a 2-element Chain and use those as groups or else tell the user to pass in a splitter (e.g. indices).

And then, as you said, iterate on how a splitter is represented and in what other ways of splitting beside indices might be useful. Does that sound reasonable?

@darsnack
Copy link
Member

Implementation-wise, this is the correct way to do this for the current state of Flux's optimizers. The IdDict approach is simple to port to FluxML/Optimisers.jl too.

That being said, I don't like the idea of something called "discriminative LR" using the optimizer interface (similarly, I advocate for ExpDecay and friends to no longer be provided as "optimizers"). The optimizer interface is great for optimizers, but not good for hyper-parameter schedules. So, it makes sense to push all the hyper-parameter-related code into a schedule/hyper-parameter interface.

Once we move to FluxML/Optimisers.jl, optimizer instances will be cheap. So, I think a good API might have something like GroupOptimiser that looks like:

opt = GroupOptimiser(
    Scheduler(Exp= 1e-3, γ = 0.75), ADAM()) => params(model[1]), # backbone parameters
    Scheduler(Exp= 1e-4, γ = 0.5), ADAM()) => params(model[2]),  # head parameters
)

Of course, this is thinking about it from the vanilla perspective and not specifically FastAI.jl. I haven't entirely figured out how GroupOptimiser would iterate the parameter groups and optimizers. Certainly, if there was a broader interface for parameter groups, then this would be easier.

@darsnack
Copy link
Member

Performance-wise, I think the extra multiply is negligible. So no need to worry about that.

@ToucheSir
Copy link
Member

The issue with => params(model[1]) is that it necessitates the introduction of an IdDict again, which is exactly what we're trying to avoid with Optimisers.jl. That said, I do think the idea of a GroupOptmiser is interesting. One could imagine a mechanism like FluxTraining's stateaccess where you return a structure with the same schema as the model, but with schedulers at areas that should use a different learning rate.

@darsnack
Copy link
Member

darsnack commented Apr 11, 2021

The issue with => params(model[1]) is that it necessitates the introduction of an IdDict again

Yeah I agree. To use GroupOptimiser with minimal mental overhead on the user, I think you need IdDict.

That's why I ended up suggesting working on a parameter group interface. Something that allows us to reason about collections of parameters as a structure better than Params.

return a structure with the same schema as the model, but with schedulers at areas that should use a different learning rate.

I think if we iterate on the "schema" here, then that might be what I'm going for with parameter groups.

@lorenzoh
Copy link
Member Author

Okay, I see this is a larger topic than I thought and it will take some time and effort across packages for this to stabilize. I think for now I'll go with the "old" optimizer solution I suggested above with support for splitting on indices only, and we can move to a better solution once things are ironed out and Optimizers.jl moves center-stage.

@lorenzoh
Copy link
Member Author

Now implemented and used in finetune!.

Computer vision (NumFOCUS development grant) automation moved this from In progress to Done May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fastai-parity Feature that fastai has numfocusgrant Part of @lorenzoh's NumFOCUS grant
Projects
No open projects
Development

No branches or pull requests

3 participants