Fix LR scheduler behaviour with AMP #16229

milesial · 2023-01-03T14:54:09Z

What does this PR do?

When training when native AMP and a LR scheduler, we get this warning that indicates that a LR step has been taken when an optimizer step was skipped (expected at the beginning of the training with native AMP):

/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:138: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

Fixes #16228 #5558

Does your PR introduce any breaking changes? If yes, please list them.

No

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

src/pytorch_lightning/core/module.py

milesial · 2023-01-03T20:24:07Z

In the process of fixing tests I discovered and fixed a bug where the scheduler wouldn't match its optimizer when multiple optimizers are instantiated with frequencies. Now the optimizers and schedulers match and alternate as they should, resetting the cycle every epoch.

milesial · 2023-01-09T08:41:31Z

@carmocca Ready for final review

tchaton · 2023-01-09T09:19:52Z

src/pytorch_lightning/loops/epoch/training_epoch_loop.py

@@ -390,7 +391,7 @@ def update_lr_schedulers(self, interval: str, update_plateau_schedulers: bool) -
        if interval == "step" and self._should_accumulate():
            return
        active_optimizers = _get_active_optimizers(
-            self.trainer.optimizers, self.trainer.optimizer_frequencies, self.total_batch_idx
+            self.trainer.optimizers, self.trainer.optimizer_frequencies, self.batch_idx


Could you add a test to verify this works properly ?

I modified the third case of test_step_scheduling_for_multiple_optimizers_with_frequency so that it tests that

carmocca

Can you check the failing tests?

carmocca · 2023-01-10T16:14:48Z

setup.cfg

@@ -34,6 +34,7 @@ markers =
    cloud:Run the cloud tests for example
 filterwarnings =
    error::FutureWarning
+    error:Detected call of `lr_scheduler.step\(\)` before `optimizer.step\(\)`:UserWarning


I added this line so that our CI fails if this warning appears. This way it tests that your patch works as expected.

Thanks, but this also makes IPU tests fail, this PR is focused on GPU. Not sure where to fix the issue on IPUs

@milesial IMHO, we could just skip the IPU test failures as done in my old PR #11755 for now.

for more information, see https://pre-commit.ci

milesial · 2023-01-10T20:06:49Z

The way I fixed the tests/tests_pytorch/models/test_hooks.py::test_trainer_model_hook_system_fit[True-kwargs1] tests is very flaky, so I'd appreciate if someone more familiar with these tests comes up with a better fix.
edit: seems like it didn't even fix it...

milesial · 2023-01-10T20:08:40Z

I also modified native_amp.py in both lightning_fabric and pytorch_lightning. It doesn't seem like the lightning_fabric one is called for a typical workflow, so I don't know if it's the right way

mrembalski · 2023-04-20T14:27:00Z

Hi @Borda, I also encounter the same issue. Will this be merged?

Borda · 2023-04-25T15:48:17Z

Hi @Borda, I also encounter the same issue. Will this be merged?

Let me check what is missing here...

for more information, see https://pre-commit.ci

alimoezzi · 2023-05-23T13:26:06Z

Is this PR merged already? I'm still having this issue.

Borda · 2023-05-23T15:36:48Z

there were some failing tests, @milesial mind have a look?

gitguardian · 2023-11-18T08:20:47Z

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
-		Generic High Entropy Secret	`78fa3af`	tests/tests_app/utilities/test_login.py	View secret
-		Base64 Basic Authentication	`78fa3af`	tests/tests_app/utilities/test_login.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!}

codecov · 2023-11-18T08:49:13Z

Codecov Report

Merging #16229 (965fc03) into master (6497e36) will decrease coverage by 54%.
Report is 1 commits behind head on master.
The diff coverage is 29%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #16229      +/-   ##
==========================================
- Coverage      83%      29%     -54%     
==========================================
  Files         450      442       -8     
  Lines       38089    37941     -148     
==========================================
- Hits        31803    11015   -20788     
- Misses       6286    26926   +20640

for more information, see https://pre-commit.ci

milesial requested review from williamFalcon, tchaton, awaelchli and carmocca as code owners January 3, 2023 14:54

github-actions bot added the pl Generic label for PyTorch Lightning package label Jan 3, 2023

This was referenced Jan 3, 2023

Wrong LR scheduler behaviour when using AMP #16228

Closed

Mixed precision: scheduler and optimizer are called in the wrong order #5558

Open

carmocca reviewed Jan 3, 2023

View reviewed changes

src/pytorch_lightning/core/module.py Outdated Show resolved Hide resolved

milesial requested a review from justusschock as a code owner January 3, 2023 20:22

mergify bot added the has conflicts label Jan 4, 2023

milesial requested a review from carmocca January 6, 2023 08:40

milesial force-pushed the master branch from bde5d8f to 2539dba Compare January 9, 2023 08:39

mergify bot removed the has conflicts label Jan 9, 2023

tchaton reviewed Jan 9, 2023

View reviewed changes

awaelchli added community This PR is from the community optimization labels Jan 9, 2023

carmocca requested a review from Borda as a code owner January 10, 2023 16:13

carmocca reviewed Jan 10, 2023

View reviewed changes

milesial and others added 6 commits January 10, 2023 20:35

Fix LR scheduler behaviour with AMP

9d92a4a

[pre-commit.ci] auto fixes from pre-commit.com hooks

2c2b138

for more information, see https://pre-commit.ci

Fix LR schedulers when optimizers with frequencies

1b3365e

Move implementation to scales comparison

a237283

Catch warnings

183a6a6

Fix implementation

0406a55

milesial force-pushed the master branch from 4e71c28 to 0406a55 Compare January 10, 2023 20:03

[pre-commit.ci] auto fixes from pre-commit.com hooks

c44a892

for more information, see https://pre-commit.ci

stale bot added the won't fix This will not be worked on label Mar 18, 2023

Borda approved these changes Mar 20, 2023

View reviewed changes

stale bot removed the won't fix This will not be worked on label Mar 20, 2023

Merge branch 'master' into master

6b0f59b

stale bot added the won't fix This will not be worked on label Apr 13, 2023

Lightning-AI deleted a comment from stale bot Apr 14, 2023

stale bot removed the won't fix This will not be worked on label Apr 14, 2023

Borda added 2 commits April 14, 2023 10:08

Merge branch 'master' into master

33454f4

Merge branch 'master' into master

f5353fc

mergify bot added the has conflicts label Apr 21, 2023

Lightning-AI deleted a comment from stale bot Apr 25, 2023

Borda assigned carmocca Apr 25, 2023

Merge branch 'master' into master

9d7b2b8

mergify bot removed the has conflicts label Apr 26, 2023

pre-commit-ci bot and others added 3 commits April 26, 2023 19:47

[pre-commit.ci] auto fixes from pre-commit.com hooks

b997075

for more information, see https://pre-commit.ci

Merge branch 'master' into master

930ba4c

Merge branch 'master' into master

13f5fb4

Merge branch 'master' into master

374e856

Merge branch 'master' into master

b900ded

mergify bot added the has conflicts label Feb 13, 2024

Merge branch 'master' into milesial/master

f10b897

mergify bot removed the has conflicts label Feb 16, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

965fc03

for more information, see https://pre-commit.ci

carmocca removed their assignment May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LR scheduler behaviour with AMP #16229

Fix LR scheduler behaviour with AMP #16229

milesial commented Jan 3, 2023 •

edited

milesial commented Jan 3, 2023

milesial commented Jan 9, 2023

tchaton Jan 9, 2023

milesial Jan 9, 2023

carmocca left a comment

carmocca Jan 10, 2023

milesial Jan 10, 2023

akihironitta Jan 13, 2023

milesial commented Jan 10, 2023 •

edited

milesial commented Jan 10, 2023

mrembalski commented Apr 20, 2023

Borda commented Apr 25, 2023

alimoezzi commented May 23, 2023

Borda commented May 23, 2023

gitguardian bot commented Nov 18, 2023 •

edited

codecov bot commented Nov 18, 2023 •

edited

Fix LR scheduler behaviour with AMP #16229

Are you sure you want to change the base?

Fix LR scheduler behaviour with AMP #16229

Conversation

milesial commented Jan 3, 2023 • edited

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

milesial commented Jan 3, 2023

milesial commented Jan 9, 2023

tchaton Jan 9, 2023

Choose a reason for hiding this comment

milesial Jan 9, 2023

Choose a reason for hiding this comment

carmocca left a comment

Choose a reason for hiding this comment

carmocca Jan 10, 2023

Choose a reason for hiding this comment

milesial Jan 10, 2023

Choose a reason for hiding this comment

akihironitta Jan 13, 2023

Choose a reason for hiding this comment

milesial commented Jan 10, 2023 • edited

milesial commented Jan 10, 2023

mrembalski commented Apr 20, 2023

Borda commented Apr 25, 2023

alimoezzi commented May 23, 2023

Borda commented May 23, 2023

gitguardian bot commented Nov 18, 2023 • edited

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

codecov bot commented Nov 18, 2023 • edited

Codecov Report

milesial commented Jan 3, 2023 •

edited

milesial commented Jan 10, 2023 •

edited

gitguardian bot commented Nov 18, 2023 •

edited

codecov bot commented Nov 18, 2023 •

edited