Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement async distributed checkpoint save #9028

Merged
merged 86 commits into from
May 15, 2024
Merged

Conversation

mikolajblaz
Copy link
Collaborator

@mikolajblaz mikolajblaz commented Apr 24, 2024

What does this PR do ?

Adds async distribtued checkpoint save implementation.

The base of this PR is set to mblaz/async-dist-ckpt-minimal-base, since this PR requires #9016 and #9015 to be merged first and the base branch contains combined changes from those branches.

Collection: NLP

Changelog

  • Adds AsyncFinalizableCheckpointIO async saving wrapper (along with AsyncFinalizerCallback)
  • Adds temporary PyT Dist async save strategy (to be replaced with MCore v0.7 version)

Usage

  • set config exp_manager.checkpoint_callback_params.async_save=True
# Add a code snippet demonstrating how to use this 

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

mikolajblaz and others added 23 commits April 19, 2024 13:44
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
…ist-ckpt-io

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@@ -0,0 +1,220 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimapihtar this whole file won't be necessary once the following MR is available in MCore: https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/1380

The fix from prevent-duplicated-checkpoints is required to skip the checkpoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
dimapihtar
dimapihtar previously approved these changes May 6, 2024
Copy link
Collaborator

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
mikolajblaz and others added 2 commits May 13, 2024 21:43
Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@dimapihtar dimapihtar self-requested a review May 15, 2024 11:56
Copy link
Collaborator

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@mikolajblaz mikolajblaz merged commit c2daa91 into main May 15, 2024
131 checks passed
@mikolajblaz mikolajblaz deleted the mblaz/async-dist-ckpt branch May 15, 2024 11:57
mikolajblaz added a commit that referenced this pull request May 15, 2024
* Prevent duplicated checkpoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Introduce DistributedCheckpointIO

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix DistCkptIO usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use NeMo logger

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [DCIO] Fix save_to dist ckpt path

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add versioning to save_to

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning logic to all .nemo files

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add dist-ckpt test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Rename existing ckpts instead of using different name

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add comment

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist ckpt flag in all methods

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Improve error msg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add dist ckpt unit tests

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load_checkpoint

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix auto-issues

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix ckpt_dir var

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Restore skipping behavior

The fix from prevent-duplicated-checkpoints is required to skip the checkpoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix steps on single-GPU machine

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Run dist-ckpt test on GPU

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply black

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Prevent saving last for non-equal val intervals

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Move checkpoint on rank 0

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix num steps in tests

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async ckpt implementation

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Abstract AsyncFinalizableCheckpointIO away

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Change async_save flag location

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add debug info

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Handle multiple async saves

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Move finalization calls to a callback

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Avoid deadlock in teardown

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Adjust to MCore implementation

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add notes and copyrights

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix async_request attribute

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add MCore import guards

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix finalize_fn arg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove checkpoints from accurate steps

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix MCore class usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix logger usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix rebase

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix code scan issues

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove unsused import

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist-ckpt for Bert

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load checkpoint return val

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist-ckpt based on sharded_state_dict

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async logging

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove deprecated argument

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use correct checkpoint_io

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bad merge

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Improve debug msg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Run async test on GPU

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix async ckpt unit test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>

* Clarify async logs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add schema print

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
BoxiangW pushed a commit to BoxiangW/NeMo that referenced this pull request Jun 5, 2024
* Prevent duplicated checkpoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Introduce DistributedCheckpointIO

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix DistCkptIO usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use NeMo logger

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [DCIO] Fix save_to dist ckpt path

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add versioning to save_to

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning logic to all .nemo files

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add dist-ckpt test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Rename existing ckpts instead of using different name

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add comment

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist ckpt flag in all methods

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Improve error msg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add dist ckpt unit tests

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load_checkpoint

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix auto-issues

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix ckpt_dir var

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Restore skipping behavior

The fix from prevent-duplicated-checkpoints is required to skip the checkpoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix steps on single-GPU machine

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Run dist-ckpt test on GPU

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply black

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Prevent saving last for non-equal val intervals

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Move checkpoint on rank 0

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix num steps in tests

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async ckpt implementation

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Abstract AsyncFinalizableCheckpointIO away

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Change async_save flag location

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add debug info

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Handle multiple async saves

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Move finalization calls to a callback

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Avoid deadlock in teardown

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Adjust to MCore implementation

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add notes and copyrights

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix async_request attribute

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add MCore import guards

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix finalize_fn arg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove checkpoints from accurate steps

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix MCore class usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix logger usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix rebase

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix code scan issues

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove unsused import

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist-ckpt for Bert

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load checkpoint return val

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist-ckpt based on sharded_state_dict

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async logging

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove deprecated argument

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use correct checkpoint_io

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bad merge

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Improve debug msg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Run async test on GPU

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix async ckpt unit test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>

* Clarify async logs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add schema print

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Prevent duplicated checkpoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Introduce DistributedCheckpointIO

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix DistCkptIO usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use NeMo logger

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [DCIO] Fix save_to dist ckpt path

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add versioning to save_to

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning logic to all .nemo files

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add versioning test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add dist-ckpt test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Rename existing ckpts instead of using different name

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add comment

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist ckpt flag in all methods

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Improve error msg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add dist ckpt unit tests

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load_checkpoint

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix auto-issues

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix ckpt_dir var

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Restore skipping behavior

The fix from prevent-duplicated-checkpoints is required to skip the checkpoints

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix steps on single-GPU machine

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Run dist-ckpt test on GPU

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply black

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Prevent saving last for non-equal val intervals

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Move checkpoint on rank 0

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix num steps in tests

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async ckpt implementation

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Abstract AsyncFinalizableCheckpointIO away

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Change async_save flag location

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add debug info

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Handle multiple async saves

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Move finalization calls to a callback

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Avoid deadlock in teardown

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Adjust to MCore implementation

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add notes and copyrights

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply formatting

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix async_request attribute

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add MCore import guards

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix finalize_fn arg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove checkpoints from accurate steps

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix MCore class usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update docs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix logger usage

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix rebase

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix code scan issues

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove unsused import

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist-ckpt for Bert

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix load checkpoint return val

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use dist-ckpt based on sharded_state_dict

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add async logging

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Remove deprecated argument

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use correct checkpoint_io

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bad merge

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Improve debug msg

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Run async test on GPU

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Fix async ckpt unit test

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>

* Clarify async logs

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Add schema print

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@ko3n1g ko3n1g mentioned this pull request Jul 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Changes to NeMo Core NLP Run CICD
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants