Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler is still stepped when optimizer stepping is skipped. #18828

Closed
oguz-hanoglu opened this issue Oct 19, 2023 · 3 comments
Closed

Scheduler is still stepped when optimizer stepping is skipped. #18828

oguz-hanoglu opened this issue Oct 19, 2023 · 3 comments
Labels
bug Something isn't working duplicate This issue or pull request already exists precision: amp Automatic Mixed Precision ver: 2.0.x

Comments

@oguz-hanoglu
Copy link

oguz-hanoglu commented Oct 19, 2023

Bug description

In 16-mixed precision use case, sometimes optimizer step is skipped.

(This is expected behaviour as explained in torch documentation - 'If no inf/NaN gradients are found, invokes optimizer.step() using the unscaled gradients. Otherwise, optimizer.step() is skipped to avoid corrupting the params.' https://pytorch.org/docs/stable/amp.html)

In those cases, lightning still calls the scheduler afterwards. And as a result, scheduler updates the learning rate as if this was a valid step.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

By chance, my first step of first batch has a 'skipped step'. And as a result, I get '[UserWarning: Detected call of lr_scheduler.step() before optimizer.step() warning of Pytorch.

My current solution for the time being is overring lr_scheduler as follow:

def lr_scheduler_step(self, scheduler, metric):
   ...
    if optimizer_is_skipped:
       return
    super().lr_scheduler_step(
            scheduler, metric
    )

And I calculate optimizer_is_skipped by comparing global_step with the step counter of optimizer object. If the offset between them keeps the same, that indicates the optimizer stepped and scheduler can be stepped as well.

cc @carmocca @justusschock @awaelchli

@oguz-hanoglu oguz-hanoglu added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Oct 19, 2023
@oguz-hanoglu
Copy link
Author

Observing if optimizer is skipped and if yes, skipping the scheduler stepping is also discussed here:
https://discuss.pytorch.org/t/optimizer-step-before-lr-scheduler-step-error-using-gradscaler/92930/10

@oguz-hanoglu oguz-hanoglu changed the title Scheduler is still called when optimizer fails. Scheduler is still called when optimizer stepping is skipped. Oct 19, 2023
@oguz-hanoglu oguz-hanoglu changed the title Scheduler is still called when optimizer stepping is skipped. Scheduler is still stepped when optimizer stepping is skipped. Oct 19, 2023
@awaelchli
Copy link
Member

@oguz-hanoglu This is essentially a duplicate of the long-standing issue #5558. If you read that thread there, you'll find that we'd need pytorch/pytorch#67590 in PyTorch to implement the needed logic. For now, the only thing we can do as a user is ignore the warning. This should be fine in most cases since optimizer step skips from amp are infrequent and thus shouldn't affect the learning rate schedule too much in general.

@awaelchli awaelchli added duplicate This issue or pull request already exists precision: amp Automatic Mixed Precision and removed needs triage Waiting to be triaged by maintainers labels Oct 23, 2023
@oguz-hanoglu
Copy link
Author

oguz-hanoglu commented Oct 23, 2023

@awaelchli is quite right. This is a duplicate. I've posted my experience including the workaround solution to #5558.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists precision: amp Automatic Mixed Precision ver: 2.0.x
Projects
None yet
Development

No branches or pull requests

2 participants