Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear wording in "Composing Optimizers" section of docs #1627

Open
StevenWhitaker opened this issue Jun 23, 2021 · 2 comments
Open

Unclear wording in "Composing Optimizers" section of docs #1627

StevenWhitaker opened this issue Jun 23, 2021 · 2 comments

Comments

@StevenWhitaker
Copy link
Contributor

In the Composing Optimizers section of the docs it states:

opt = Optimiser(ExpDecay(0.001, 0.1, 1000, 1e-4), Descent())

Here we apply exponential decay to the Descent optimiser.

I think that last sentence is a bit misleading to someone who is not very familiar with Flux. When I read "apply exponential decay to the Descent optimiser", I take that to mean that we use the learning rate defined by Descent, but update it according to some schedule defined by ExpDecay. But in reality, if I understand correctly, the above example from the docs uses an initial learning rate that is actually 0.0001 (because the default learning rate of Descent is 0.1), not 0.1 like I would expect.

(Somewhat tangential note: One possible reason for the confusion might be the fact that ExpDecay by itself can be used to do gradient descent, but that isn't conveyed by the name ExpDecay. I would think that something called ExpDecay would have to be paired with an optimizer, and would not have its own learning rate.)

Two possible ways to improve the clarity of the docs:

  1. Update the example with

    opt = Optimiser(ExpDecay(1, 0.1, 1000, 1e-4), Descent())

    and leave the rest of the wording as is.

  2. Use a different example, especially since

    opt = Optimiser(ExpDecay(0.001, 0.1, 1000, 1e-4), Descent())

    is equivalent to

    opt = ExpDecay(0.0001, 0.1, 1000, 1e-4)

    (if I understand correctly), so the current example doesn't seem as useful as another might be.

I am happy to open a PR to make the simple change suggested in 1., but I don't know Flux well enough to do 2.

@darsnack
Copy link
Member

Short term fix: perhaps the default value for the LR in ExpDecay should be 1 so it composes sensibly with other optimizers.

Long term rant:

One possible reason for the confusion might be the fact that ExpDecay by itself can be used to do gradient descent, but that isn't conveyed by the name ExpDecay. I would think that something called ExpDecay would have to be paired with an optimizer, and would not have its own learning rate.

Thank you for taking the effort to type this up, because what you are highlighting here is something that I've brought up before. Schedules and optimizers are not the same thing. It's a cute trick that certain schedules can be "composed" with our optimizers, but I really strongly feel we should fix this by making schedules their own distinct thing.

Schedulers (in general) wrap an optimizer, they do not compose with them. The fact that ExpDecay can currently be used like an optimizer when it isn't one just underscores this point so well.

@DhairyaLGandhi
Copy link
Member

1 should be fine for now

bors bot added a commit that referenced this issue Jun 24, 2021
1628: Update "Composing Optimisers" docs r=darsnack a=StevenWhitaker

Addresses #1627 (perhaps only partially).

Use `1` instead of `0.001` for the first argument of `ExpDecay` in the example, so that the sentence following the example, i.e.,

> Here we apply exponential decay to the `Descent` optimiser.

makes more sense.

It was also [suggested](#1627 (comment)) in the linked issue that it might be worth changing the default learning rate of `ExpDecay` to `1`. Since this PR doesn't address that, I'm not sure merging this PR should necessarily close the issue.

Co-authored-by: StevenWhitaker <steventwhitaker@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants