Fix fit loop restart logic to enable resume using the checkpoint #12821

rohitgr7 · 2022-04-20T12:43:05Z

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda @tchaton @rohitgr7 @carmocca @justusschock @ananthsub @ninginthecloud

pytorch_lightning/loops/fit_loop.py

rohitgr7 · 2022-04-26T09:23:02Z

pytorch_lightning/loops/fit_loop.py

    def on_run_start(self) -> None:  # type: ignore[override]
        """Calls the ``on_train_start`` hook."""
+        # update the current_epoch in-case of checkpoint reload
+        if not self._iteration_based_training():


during restart if it's not iteration-based training, we need to update the current epoch so that it starts from a fresh epoch rather than the old one for the cases where checkpoint is reloaded using the one saved before on_train_end.

Does this make the following 2 lines unnecessary?

https://github.com/PyTorchLightning/pytorch-lightning/blob/46c59d04db4156ae98e184e1d9321932f7e2ebf7/pytorch_lightning/loops/fit_loop.py#L169-L171

Since on_run_start is runs before done and stop_epochs is only valid under "not iteration-based"

yes! good catch! will update

but even if that's the case, do we increment epoch_progress during iteration-based training? not sure. need to check

Does this make the following 2 lines unnecessary?

maybe... will check. I remember a test was failing: https://github.com/PyTorchLightning/pytorch-lightning/runs/6101153350?check_suite_focus=true

but it should not I guess since we increment current.completed after every epoch.

okay looks like done is called within skip, so we need to keep it.

Hey @rohitgr7
I don't understand this logic. There is no true iteration based training in Lightning and we always have epochs. We may restart from a completed epoch or from an incomplete epoch, regardless of how the max_ flags on the trainer is set.

if a user is using max_steps=7 and uses a checkpoint from step=4 to restart, we need to start from step 5 in that case. Although I guess the dataloaders are re-iterated from the beginning in that case.

This explanation is not satisfying. It doesn't answer how max_steps has anything to do with the way we restart.

You could also have max_epochs=1 where the epoch size is 7 (equivalent to max_steps=7) and you would still restore the checkpoint on step 4 the exact same way.

max_steps / max_epochs are the stopping conditions. Them affecting the way we restart is beyond my understanding.

You could also have max_epochs=1 where the epoch size is 7 (equivalent to max_steps=7) and you would still restore the checkpoint on step 4 the exact same way.

yes, it does but in that case we do start from step=4 but on an entirely new epoch. The reason we need to separate this a little is that we don't need to update the current epoch. Although just noticed that we do update the current_epoch at the end even if the user is performing training based on max_steps. Do you think we should remove this condition and keep incrementing the current epoch on each restart even though the training is based on max_steps?

Whatever the fix is, it will not at all be conditioned on max_steps / max_epochs trainer flags. I don't know how to solve this problem and we need to brainstorm this what should be fixed.

All I know right now is that this PR did something weird 😅 that can't be the fix.

rohitgr7 · 2022-04-26T10:13:33Z

~~failing ci is on master and unrelated to this PR.~~

pytorch_lightning/loops/fit_loop.py

carmocca · 2022-04-29T16:45:57Z

pytorch_lightning/loops/fit_loop.py

    def on_run_start(self) -> None:  # type: ignore[override]
        """Calls the ``on_train_start`` hook."""
+        # update the current_epoch in-case of checkpoint reload
+        if not self._iteration_based_training():


Does this make the following 2 lines unnecessary?

https://github.com/PyTorchLightning/pytorch-lightning/blob/46c59d04db4156ae98e184e1d9321932f7e2ebf7/pytorch_lightning/loops/fit_loop.py#L169-L171

Since on_run_start is runs before done and stop_epochs is only valid under "not iteration-based"

tests/models/test_hooks.py

) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

awaelchli · 2022-05-05T13:09:05Z

pytorch_lightning/loops/fit_loop.py

    def on_run_start(self) -> None:  # type: ignore[override]
        """Calls the ``on_train_start`` hook."""
+        # update the current_epoch in-case of checkpoint reload
+        if not self._iteration_based_training():


Hey @rohitgr7
I don't understand this logic. There is no true iteration based training in Lightning and we always have epochs. We may restart from a completed epoch or from an incomplete epoch, regardless of how the max_ flags on the trainer is set.

awaelchli · 2022-05-05T13:09:22Z

pytorch_lightning/loops/fit_loop.py


    def on_run_start(self) -> None:  # type: ignore[override]
        """Calls the ``on_train_start`` hook."""
+        # update the current_epoch in-case of checkpoint reload


…ckpoint - #12821 Summary: patch fix the PR Lightning-AI/pytorch-lightning#12821 Reviewed By: hudeven, rayhou0710 Differential Revision: D36193410 fbshipit-source-id: 0adf2d3e5202ed85d8d0a305906df9be2ee696c3

…ckpoint - #12821 (#1249) Summary: Pull Request resolved: #1249 patch fix the PR Lightning-AI/pytorch-lightning#12821 Reviewed By: hudeven, rayhou0710 Differential Revision: D36193410 fbshipit-source-id: 6594c4daf8fe5be1eaec72d42c45789f5da36125

fix fit loop restart logic

5ad881b

rohitgr7 added bug Something isn't working loops Related to the Loop API labels Apr 20, 2022

rohitgr7 added this to the 1.6.x milestone Apr 20, 2022

rohitgr7 added 6 commits April 21, 2022 00:49

fix logic

2a6ee96

use method

b63fbfe

fix

6f37ed1

add test

66dc601

chlog

6148687

Merge branch 'master' into fix/loop_restart

17ebfd0

rohitgr7 commented Apr 26, 2022

View reviewed changes

pytorch_lightning/loops/fit_loop.py Outdated Show resolved Hide resolved

rohitgr7 commented Apr 26, 2022

View reviewed changes

rohitgr7 added the priority: 0 High priority task label Apr 26, 2022

rohitgr7 marked this pull request as ready for review April 26, 2022 10:12

rohitgr7 requested review from Borda, SeanNaren, awaelchli, carmocca, justusschock, kaushikb11, tchaton and williamFalcon as code owners April 26, 2022 10:12

Merge remote-tracking branch 'origin/master' into fix/loop_restart

a93e8d8

mergify bot added the has conflicts label Apr 28, 2022

Add hook test for reloading with max epochs

c000630

carmocca mentioned this pull request Apr 29, 2022

Add hook test for reloading with max epochs #12932

Merged

carmocca added 2 commits April 29, 2022 15:11

Minor test reorder

4923a52

Merge branch 'master' into fix/loop_restart

25cc29a

mergify bot removed the has conflicts label Apr 29, 2022

Merge branch 'test/add-hook-test-max-epochs' into fix/loop_restart

5b956d1

mergify bot removed the has conflicts label Apr 29, 2022

Bad merge

be291cb

rohitgr7 commented Apr 29, 2022

View reviewed changes

pytorch_lightning/loops/fit_loop.py Outdated Show resolved Hide resolved

Rename variables

8117596

carmocca approved these changes Apr 29, 2022

View reviewed changes

Base automatically changed from test/add-hook-test-max-epochs to master May 2, 2022 12:41

mergify bot added the has conflicts label May 2, 2022

Merge branch 'master' into fix/loop_restart

bc1fcad

mergify bot removed the has conflicts label May 2, 2022

rohitgr7 enabled auto-merge (squash) May 2, 2022 15:13

mergify bot added the has conflicts label May 2, 2022

Borda approved these changes May 3, 2022

View reviewed changes

tests/models/test_hooks.py Show resolved Hide resolved

otaj approved these changes May 3, 2022

View reviewed changes

Merge branch 'master' into fix/loop_restart

6ce61de

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels May 3, 2022

carmocca approved these changes May 3, 2022

View reviewed changes

rohitgr7 merged commit 46ed9dc into master May 3, 2022

rohitgr7 deleted the fix/loop_restart branch May 3, 2022 16:27

carmocca added a commit that referenced this pull request May 3, 2022

Fix fit loop restart logic to enable resume using the checkpoint (#12821

65ae1ff

) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

carmocca added a commit that referenced this pull request May 3, 2022

Fix fit loop restart logic to enable resume using the checkpoint (#12821

694a819

) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

lexierule pushed a commit that referenced this pull request May 3, 2022

Fix fit loop restart logic to enable resume using the checkpoint (#12821

a64e1df

) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

awaelchli reviewed May 5, 2022

View reviewed changes

ninginthecloud mentioned this pull request May 27, 2022

[patch fix] Fix fit loop restart logic to enable resume using the checkpoint - #12821 facebookresearch/mmf#1249

Closed

rohitgr7 mentioned this pull request Jul 1, 2022

Weekly patch release v1.6.5 #13481

Merged

12 tasks

Fix fit loop restart logic to enable resume using the checkpoint #12821

Fix fit loop restart logic to enable resume using the checkpoint #12821

Uh oh!

Conversation

rohitgr7 commented Apr 20, 2022 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

Uh oh!

rohitgr7 Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohitgr7 Apr 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awaelchli May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohitgr7 commented Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rohitgr7 commented Apr 20, 2022 •

edited by github-actions bot

Loading

rohitgr7 Apr 26, 2022 •

edited

Loading

rohitgr7 Apr 29, 2022 •

edited

Loading

awaelchli May 5, 2022 •

edited

Loading

rohitgr7 commented Apr 26, 2022 •

edited

Loading