New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discriminative learning rates #35
Comments
So as practiced in fastai (see Sorting through the Flux.jl optimizer documentation and its source, I think this can be reduced to a wrapper optimizer that discounts the gradient based on the factor associated with a parameter. Leaving the model splitting and constructor aside, the mutable struct DiscriminativeLR
factors::IdDict
end
function update!(o::DiscriminativeLR, x, Δ::AbstractArray{T})
factor = convert(T, get(o.factors, x, 1))
if factor != 1
@. Δ *= factor
end
end An optimizer could then be adapted to use discriminative learning rates as follows: optim = Optimiser(DiscriminativeLR(...), ADAM()) Now, this implementation requires an extra multiplication for every parameter, but I'm not sure if this makes a large performance difference in practice and I see the same being done in the implementation of @darsnack, I recall you being involved with the Flux.jl optimizers, what do you think about this approach? |
This looks reasonable and wouldn't be too difficult to port to https://github.com/FluxML/Optimisers.jl either, but the real challenge is in how to generate |
I totally agree that that's the hard part 😅 splitting models into groups will also be useful for other things like freezing and advanced visualizations, so it definitely is important to find a nice API for it. I think we should start with the basic, but common case where you have a And then, as you said, iterate on how a splitter is represented and in what other ways of splitting beside indices might be useful. Does that sound reasonable? |
Implementation-wise, this is the correct way to do this for the current state of Flux's optimizers. The That being said, I don't like the idea of something called "discriminative LR" using the optimizer interface (similarly, I advocate for Once we move to FluxML/Optimisers.jl, optimizer instances will be cheap. So, I think a good API might have something like opt = GroupOptimiser(
Scheduler(Exp(λ = 1e-3, γ = 0.75), ADAM()) => params(model[1]), # backbone parameters
Scheduler(Exp(λ = 1e-4, γ = 0.5), ADAM()) => params(model[2]), # head parameters
) Of course, this is thinking about it from the vanilla perspective and not specifically FastAI.jl. I haven't entirely figured out how |
Performance-wise, I think the extra multiply is negligible. So no need to worry about that. |
The issue with |
Yeah I agree. To use That's why I ended up suggesting working on a parameter group interface. Something that allows us to reason about collections of parameters as a structure better than
I think if we iterate on the "schema" here, then that might be what I'm going for with parameter groups. |
Okay, I see this is a larger topic than I thought and it will take some time and effort across packages for this to stabilize. I think for now I'll go with the "old" optimizer solution I suggested above with support for splitting on indices only, and we can move to a better solution once things are ironed out and Optimizers.jl moves center-stage. |
Now implemented and used in |
Discriminative learning rates means using different learning rates for differents part of a model, so-called layer groups. This is used in fastai when finetuning models.
The text was updated successfully, but these errors were encountered: