Scheduler is still stepped when optimizer stepping is skipped. #18828
Labels
bug
Something isn't working
duplicate
This issue or pull request already exists
precision: amp
Automatic Mixed Precision
ver: 2.0.x
Bug description
In 16-mixed precision use case, sometimes optimizer step is skipped.
(This is expected behaviour as explained in torch documentation - 'If no inf/NaN gradients are found, invokes optimizer.step() using the unscaled gradients. Otherwise, optimizer.step() is skipped to avoid corrupting the params.' https://pytorch.org/docs/stable/amp.html)
In those cases, lightning still calls the scheduler afterwards. And as a result, scheduler updates the learning rate as if this was a valid step.
What version are you seeing the problem on?
v2.0
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
By chance, my first step of first batch has a 'skipped step'. And as a result, I get '[UserWarning: Detected call of
lr_scheduler.step()
beforeoptimizer.step()
warning of Pytorch.My current solution for the time being is overring lr_scheduler as follow:
And I calculate optimizer_is_skipped by comparing global_step with the step counter of optimizer object. If the offset between them keeps the same, that indicates the optimizer stepped and scheduler can be stepped as well.
cc @carmocca @justusschock @awaelchli
The text was updated successfully, but these errors were encountered: