Skip to content

Conversation

@rohitgr7
Copy link
Contributor

@rohitgr7 rohitgr7 commented Apr 20, 2022

What does this PR do?

Fixes #12724

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @Borda @tchaton @rohitgr7 @carmocca @justusschock @ananthsub @ninginthecloud

@rohitgr7 rohitgr7 added bug Something isn't working loops Related to the Loop API labels Apr 20, 2022
@rohitgr7 rohitgr7 added this to the 1.6.x milestone Apr 20, 2022
def on_run_start(self) -> None: # type: ignore[override]
"""Calls the ``on_train_start`` hook."""
# update the current_epoch in-case of checkpoint reload
if not self._iteration_based_training():
Copy link
Contributor Author

@rohitgr7 rohitgr7 Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during restart if it's not iteration-based training, we need to update the current epoch so that it starts from a fresh epoch rather than the old one for the cases where checkpoint is reloaded using the one saved before on_train_end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make the following 2 lines unnecessary?

https://github.com/PyTorchLightning/pytorch-lightning/blob/46c59d04db4156ae98e184e1d9321932f7e2ebf7/pytorch_lightning/loops/fit_loop.py#L169-L171

Since on_run_start is runs before done and stop_epochs is only valid under "not iteration-based"

Copy link
Contributor Author

@rohitgr7 rohitgr7 Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! good catch! will update

but even if that's the case, do we increment epoch_progress during iteration-based training? not sure. need to check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make the following 2 lines unnecessary?

maybe... will check. I remember a test was failing: https://github.com/PyTorchLightning/pytorch-lightning/runs/6101153350?check_suite_focus=true

but it should not I guess since we increment current.completed after every epoch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay looks like done is called within skip, so we need to keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rohitgr7
I don't understand this logic. There is no true iteration based training in Lightning and we always have epochs. We may restart from a completed epoch or from an incomplete epoch, regardless of how the max_ flags on the trainer is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a user is using max_steps=7 and uses a checkpoint from step=4 to restart, we need to start from step 5 in that case. Although I guess the dataloaders are re-iterated from the beginning in that case.

Copy link
Contributor

@awaelchli awaelchli May 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation is not satisfying. It doesn't answer how max_steps has anything to do with the way we restart.

You could also have max_epochs=1 where the epoch size is 7 (equivalent to max_steps=7) and you would still restore the checkpoint on step 4 the exact same way.

max_steps / max_epochs are the stopping conditions. Them affecting the way we restart is beyond my understanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also have max_epochs=1 where the epoch size is 7 (equivalent to max_steps=7) and you would still restore the checkpoint on step 4 the exact same way.

yes, it does but in that case we do start from step=4 but on an entirely new epoch. The reason we need to separate this a little is that we don't need to update the current epoch. Although just noticed that we do update the current_epoch at the end even if the user is performing training based on max_steps. Do you think we should remove this condition and keep incrementing the current epoch on each restart even though the training is based on max_steps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever the fix is, it will not at all be conditioned on max_steps / max_epochs trainer flags. I don't know how to solve this problem and we need to brainstorm this what should be fixed.

All I know right now is that this PR did something weird 😅 that can't be the fix.

@rohitgr7 rohitgr7 added the priority: 0 High priority task label Apr 26, 2022
@rohitgr7 rohitgr7 marked this pull request as ready for review April 26, 2022 10:12
@rohitgr7
Copy link
Contributor Author

rohitgr7 commented Apr 26, 2022

failing ci is on master and unrelated to this PR.

@mergify mergify bot removed the has conflicts label Apr 29, 2022
@mergify mergify bot removed the has conflicts label Apr 29, 2022
def on_run_start(self) -> None: # type: ignore[override]
"""Calls the ``on_train_start`` hook."""
# update the current_epoch in-case of checkpoint reload
if not self._iteration_based_training():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make the following 2 lines unnecessary?

https://github.com/PyTorchLightning/pytorch-lightning/blob/46c59d04db4156ae98e184e1d9321932f7e2ebf7/pytorch_lightning/loops/fit_loop.py#L169-L171

Since on_run_start is runs before done and stop_epochs is only valid under "not iteration-based"

Base automatically changed from test/add-hook-test-max-epochs to master May 2, 2022 12:41
@mergify mergify bot added the has conflicts label May 2, 2022
@mergify mergify bot removed the has conflicts label May 2, 2022
@rohitgr7 rohitgr7 enabled auto-merge (squash) May 2, 2022 15:13
@mergify mergify bot added the has conflicts label May 2, 2022
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels May 3, 2022
@rohitgr7 rohitgr7 merged commit 46ed9dc into master May 3, 2022
@rohitgr7 rohitgr7 deleted the fix/loop_restart branch May 3, 2022 16:27
carmocca added a commit that referenced this pull request May 3, 2022
)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
carmocca added a commit that referenced this pull request May 3, 2022
)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
lexierule pushed a commit that referenced this pull request May 3, 2022
)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
def on_run_start(self) -> None: # type: ignore[override]
"""Calls the ``on_train_start`` hook."""
# update the current_epoch in-case of checkpoint reload
if not self._iteration_based_training():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rohitgr7
I don't understand this logic. There is no true iteration based training in Lightning and we always have epochs. We may restart from a completed epoch or from an incomplete epoch, regardless of how the max_ flags on the trainer is set.


def on_run_start(self) -> None: # type: ignore[override]
"""Calls the ``on_train_start`` hook."""
# update the current_epoch in-case of checkpoint reload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case

ninginthecloud added a commit to ninginthecloud/mmf that referenced this pull request May 27, 2022
…ckpoint - #12821

Summary: patch fix the PR Lightning-AI/pytorch-lightning#12821

Reviewed By: hudeven, rayhou0710

Differential Revision: D36193410

fbshipit-source-id: 0adf2d3e5202ed85d8d0a305906df9be2ee696c3
facebook-github-bot pushed a commit to facebookresearch/mmf that referenced this pull request May 31, 2022
…ckpoint - #12821 (#1249)

Summary:
Pull Request resolved: #1249

patch fix the PR Lightning-AI/pytorch-lightning#12821

Reviewed By: hudeven, rayhou0710

Differential Revision: D36193410

fbshipit-source-id: 6594c4daf8fe5be1eaec72d42c45789f5da36125
@rohitgr7 rohitgr7 mentioned this pull request Jul 1, 2022
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working loops Related to the Loop API priority: 0 High priority task ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PT 1.6.0 could not resume a training with plugins monitoring on metrics

5 participants