-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed precision: scheduler and optimizer are called in the wrong order #5558
Comments
Hi! thanks for your contribution!, great first issue! |
Looking at the warning message, it seems that this is a problem related to the precision. As it is explained in documentation, if 16-bit precision is used, optimization is automatically managed by PyTorch Lightning. From versions >= 1.1.0 in PyTorch, Detected call of |
I'm getting the same warning when |
Same issue. |
I am getting the same issue still as well |
@griff4692 @sanxing-chen Hi, thank you for your report. Which version are you using? Could you try with the latest version of |
@akihironitta I have run again the colab from the beginning of the issue and the warning problem is still there. This is the environment printed by the collecting script:
|
@javierlorenzod Thanks a lot for your report! Let me look into it. |
I am using |
As another datapoint, I'm finding this issue with pytorch-lightning==1.3.4 |
Same issue here with |
Same issue with |
I'm using PL pytorch-lightning==1.6.4 but still same issue |
A quick around is to override class YourLightningModule(LightningModule):
def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, **kwargs):
self.should_skip_lr_scheduler_step = False
scaler = getattr(self.trainer.strategy.precision_plugin, "scaler", None)
if scaler:
scale_before_step = scaler.get_scale()
optimizer.step(closure=optimizer_closure)
if scaler:
scale_after_step = scaler.get_scale()
self.should_skip_lr_scheduler_step = scale_before_step > scale_after_step
def lr_scheduler_step(self, scheduler, optimizer_idx, metric):
if self.should_skip_lr_scheduler_step:
return
scheduler.step() See here for a complete script using BoringModel: https://github.com/akihironitta/gist/blob/repro/5558-amp-scheduler-workaround/pl_boring_model/main.py |
I'm not using PTL right now but I'm interested in the "right" solution here. The issue has nothing to do with PTL like other people have said. @akihironitta A couple of comments / questions.
Cheers, Edit: I couldn't successfully suppress the warning, ended up comparing to Edit2: Testing for None as a return value doesn't work for all optimizers, e.g. AdamW without a closure will return None even when stepped. So testing the scale before and after seems like the best way. |
Hi @collinmccarthy, thank you for your comment.
Yes, as I commented a while ago #5558 (comment), this issue stems from how amp is implemented.
That’s totally fine if you’re fine with it. However, some people might still prefer to use the hack above to avoid excessive import warnings
warnings.filterwarnings(“ignore”, "Detected call of", UserWarning) https://docs.python.org/3/library/warnings.html#warnings.filterwarnings |
@akihironitta hi! i'm using YourLightningModule code, ValueError: Tried to step 42552 times. The specified number of total steps is 42550
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 220, in advance
self.update_lr_schedulers("step", update_plateau_schedulers=False)
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 397, in update_lr_schedulers
self._update_learning_rates(
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 458, in _update_learning_rates
self.trainer._call_lightning_module_hook(
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1305, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/data/asr_proj/stt/RNNTransducer/model.py", line 200, in lr_scheduler_step
scheduler.step()
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 161, in step
values = self.get_lr()
File "/data/asr_proj/stt/RNNTransducer/.venv/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 1686, in get_lr
raise ValueError("Tried to step {} times. The specified number of total steps is {}"
ValueError: Tried to step 42552 times. The specified number of total steps is 42550 my code like this def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, **kwargs):
self.should_skip_lr_scheduler_step = False
scaler = getattr(self.trainer.strategy.precision_plugin, "scaler", None)
if scaler:
scale_before_step = scaler.get_scale()
optimizer.step(closure=optimizer_closure)
if scaler:
scale_after_step = scaler.get_scale()
self.should_skip_lr_scheduler_step = scale_before_step > scale_after_step
def lr_scheduler_step(self, scheduler, optimizer_idx, metric):
if self.should_skip_lr_scheduler_step:
return
scheduler.step()
def configure_optimizers(self):
optimizer = torch.optim.AdamW(
[{"params": [p for p in self.parameters()], "name": "OneCycleLR"}],
lr=self.args.learning_rate,
weight_decay=self.args.weight_decay,
)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=self.args.max_lr,
steps_per_epoch=self.steps_per_epoch,
epochs=self.trainer.max_epochs,
pct_start=0.05,
)
lr_scheduler = {"interval": "step", "scheduler": scheduler, "name": "AdamW"}
return [optimizer], [lr_scheduler] |
@collinmccarthy |
I propose a fix in #16229 . The issue is not on the PyT side, it's on PTL side. When using LR scheduler for each step together with AMP, the PyT user (PTL) should check that the optimizer step wasn't skipped by the grad scaler before stepping the scheduler. In this PR I use the same check that PyT uses to generate that warning |
I think we should implement pytorch/pytorch#67590 (PyTorch). Any additions in Lightning would always be workarounds. |
Following and waiting. |
any update? |
|
In my case, the warning is raised during the first four steps, while an epoch consists of 500+ steps. Since the warning occurs in the first step, I also receive the "scheduler called before optimizer is called" warning. I like to address these warnings not only because they are annoying and can lead others in the project to assume there is a significant problem, but also because there is no guarantee that the skipped optimizer steps will always be limited. I have noticed that my optimizer (AdamW) has _step_count in it. After debugging it, I observed that the count is not increased during skipped steps. Therefore, another possible workaround would be:
|
I'm having the same warning But the warning disappears when I set: |
🐛 Bug
When using mixed-precision training, scheduler and optimizer are called in the wrong order. Warning is generated:
Please reproduce using the BoringModel
https://colab.research.google.com/drive/1G7pk6E9XUYq-pS41DXKhqM9Srx8sikiP?usp=sharing
There are four tests. Three of them doesn't raise the warning:
This testcase raises the warning:
To Reproduce
configure_optimizers
in a following dictionary style:precision=16
in aTrainer
Note
When scheduler is defined in another way, the issue seems to not occur:
Expected behavior
No warning
Environment
cc @tchaton @rohitgr7 @carmocca @justusschock @awaelchli @akihironitta
The text was updated successfully, but these errors were encountered: