Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent duplicated checkpoints #9015

Merged
merged 15 commits into from
May 8, 2024
Merged

Conversation

mikolajblaz
Copy link
Collaborator

@mikolajblaz mikolajblaz commented Apr 23, 2024

What does this PR do ?

This PR fixes the behavior of encountering an existing distributed checkpoint. Previously the save was skipped based on logic in NLPDDPStrategy.save_checkpoint, which is not a good place for such logic (strategy should just save the checkpoint). This PR reuses the PTL versioning mechanism to adjust the checkpoint name accordingly (add -v1 suffix to ckpt name)

Collection: NLP

Changelog

  • fix the behavior of skipping duplicate checkpoint by reusing PTL versioning mechanism

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@github-actions github-actions bot added NLP core Changes to NeMo Core labels Apr 23, 2024
mikolajblaz and others added 5 commits April 23, 2024 13:02
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@mikolajblaz mikolajblaz force-pushed the mblaz/prevent-duplicated-checkpoints branch from b92f3bd to 76df0d4 Compare April 23, 2024 11:02
@mikolajblaz mikolajblaz self-assigned this Apr 23, 2024
@mikolajblaz mikolajblaz marked this pull request as ready for review April 23, 2024 11:06
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@mikolajblaz mikolajblaz force-pushed the mblaz/prevent-duplicated-checkpoints branch from 51ad5a2 to 10297aa Compare April 23, 2024 12:22
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
dimapihtar
dimapihtar previously approved these changes Apr 25, 2024
Copy link
Collaborator

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@dimapihtar dimapihtar self-requested a review May 1, 2024 11:45
dimapihtar
dimapihtar previously approved these changes May 1, 2024
Copy link
Collaborator

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@dimapihtar dimapihtar merged commit 305392d into main May 8, 2024
134 checks passed
@dimapihtar dimapihtar deleted the mblaz/prevent-duplicated-checkpoints branch May 8, 2024 19:23
BoxiangW pushed a commit to BoxiangW/NeMo that referenced this pull request Jun 5, 2024
* Prevent duplicated checkpoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning to save_to

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning logic to all .nemo files

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add dist-ckpt test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Rename existing ckpts instead of using different name

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add comment

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Run dist-ckpt test on GPU

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Prevent saving last for non-equal val intervals

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Move checkpoint on rank 0

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core NLP Run CICD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants