You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The checkpoint callback is run before batch_progress.increment_completed() in training_epoch_loop's advance method. Thus in the checkpoint
checkpoint['loops']['fit_loop']['epoch_loop.batch_progress']['total']['completed'] e.g. 9
is one smaller than for example
checkpoint['loops']['fit_loop']['epoch_loop.batch_progress']['total']['processed'] e.g. 10 or global step.
same for checkpoint['loops']['fit_loop']['epoch_loop.state_dict']['_batches_that_stepped']
Thus when restoring from the checkpoint the batch with batch_idx 9 is run again, even though optimizer step was already done for this batch.
This behavior is unexpected enough to at least warrant a hint in the documentation if not regarded as a bug.
I think this is also related to #18595. The fact that the modelcheckpoint is saved before properly incrementing all parts of the counters seems to lead to a host of unforeseen and hard to debug issues.
Bug description
The checkpoint callback is run before
batch_progress.increment_completed()
in training_epoch_loop'sadvance
method. Thus in the checkpointcheckpoint['loops']['fit_loop']['epoch_loop.batch_progress']['total']['completed'] e.g. 9
is one smaller than for example
checkpoint['loops']['fit_loop']['epoch_loop.batch_progress']['total']['processed'] e.g. 10 or global step.
same for
checkpoint['loops']['fit_loop']['epoch_loop.state_dict']['_batches_that_stepped']
Thus when restoring from the checkpoint the batch with batch_idx 9 is run again, even though optimizer step was already done for this batch.
This behavior is unexpected enough to at least warrant a hint in the documentation if not regarded as a bug.
What version are you seeing the problem on?
master
How to reproduce the bug
Error messages and logs
None
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: