Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure EMA checkpoints are also deleted when normal checkpoints are #5724

Merged
merged 6 commits into from
Jan 5, 2023

Conversation

SeanNaren
Copy link
Collaborator

@SeanNaren SeanNaren commented Jan 3, 2023

What does this PR do ?

Closes #5631

Collection: Common

Changelog

  • Delete EMA checkpoints when normal checkpoints are deleted.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@SeanNaren SeanNaren added the bug Something isn't working label Jan 3, 2023
@SeanNaren SeanNaren self-assigned this Jan 3, 2023
@@ -1,6 +1,6 @@
hydra-core>=1.2.0,<1.3
omegaconf>=2.2,<2.3
pytorch-lightning>=1.8.3
pytorch-lightning>=1.8.6
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope its ok to upgrade the version here @titu1994

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine, r1.15 has been cut already so it affects only main.

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@@ -248,6 +248,27 @@ def test_exp_manager_ema_weights(self, tmpdir):
for saved_weight, ema_weight in zip(duplicate_model.state_dict().values(), ema_weights):
assert torch.allclose(saved_weight.cpu(), ema_weight.cpu())

@pytest.mark.unit
def test_exp_manager_ema_weights_topk(self, tmpdir):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this test fails when using pytorch-lightning==1.8.3

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@github-actions github-actions bot added the common label Jan 5, 2023
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great ! Thanks !

@@ -1,6 +1,6 @@
hydra-core>=1.2.0,<1.3
omegaconf>=2.2,<2.3
pytorch-lightning>=1.8.3
pytorch-lightning>=1.8.6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine, r1.15 has been cut already so it affects only main.

@titu1994 titu1994 merged commit 78a7d8a into main Jan 5, 2023
@titu1994 titu1994 deleted the fix/ema-del-checkpoints branch January 5, 2023 19:27
erastorgueva-nv pushed a commit that referenced this pull request Jan 12, 2023
…5724)

* Ensure EMA checkpoints are also deleted when normal checkpoints are

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Simplify test

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Remove comment

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Fix bug where `save_best_model` caused a crash

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Swap to logging only on rank 0

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
titu1994 pushed a commit to titu1994/NeMo that referenced this pull request Mar 24, 2023
…VIDIA#5724)

* Ensure EMA checkpoints are also deleted when normal checkpoints are

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Simplify test

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Remove comment

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Fix bug where `save_best_model` caused a crash

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

* Swap to logging only on rank 0

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working common
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EMA Doesn't delete previous EMA ckpts when k > 0 for checkpointing
2 participants