-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement async distributed checkpoint save #9028
Conversation
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
…ist-ckpt-io Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
22235de
to
99d515a
Compare
@@ -0,0 +1,220 @@ | |||
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimapihtar this whole file won't be necessary once the following MR is available in MCore: https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/1380
The fix from prevent-duplicated-checkpoints is required to skip the checkpoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you!
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
b36e603
to
44f0949
Compare
Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you!
* Prevent duplicated checkpoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Introduce DistributedCheckpointIO Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix DistCkptIO usage Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use NeMo logger Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [DCIO] Fix save_to dist ckpt path Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add versioning to save_to Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add versioning logic to all .nemo files Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add versioning test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add dist-ckpt test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Rename existing ckpts instead of using different name Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add comment Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use dist ckpt flag in all methods Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Improve error msg Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add dist ckpt unit tests Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix load_checkpoint Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix auto-issues Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix ckpt_dir var Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Restore skipping behavior The fix from prevent-duplicated-checkpoints is required to skip the checkpoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix steps on single-GPU machine Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Run dist-ckpt test on GPU Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add docs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply black Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Prevent saving last for non-equal val intervals Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Move checkpoint on rank 0 Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix num steps in tests Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add async ckpt implementation Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Abstract AsyncFinalizableCheckpointIO away Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Change async_save flag location Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add debug info Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply formatting Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Handle multiple async saves Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply formatting Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Move finalization calls to a callback Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Avoid deadlock in teardown Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Adjust to MCore implementation Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add notes and copyrights Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply formatting Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix async_request attribute Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add MCore import guards Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add async test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix finalize_fn arg Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add docs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Remove checkpoints from accurate steps Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix MCore class usage Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Update docs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix logger usage Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix rebase Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix code scan issues Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Remove unsused import Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use dist-ckpt for Bert Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix load checkpoint return val Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use dist-ckpt based on sharded_state_dict Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add async logging Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Remove deprecated argument Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use correct checkpoint_io Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bad merge Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Improve debug msg Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Run async test on GPU Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix async ckpt unit test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com> * Clarify async logs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add schema print Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Prevent duplicated checkpoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Introduce DistributedCheckpointIO Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix DistCkptIO usage Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use NeMo logger Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [DCIO] Fix save_to dist ckpt path Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add versioning to save_to Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add versioning logic to all .nemo files Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add versioning test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add dist-ckpt test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Rename existing ckpts instead of using different name Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add comment Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use dist ckpt flag in all methods Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Improve error msg Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add dist ckpt unit tests Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix load_checkpoint Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix auto-issues Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix ckpt_dir var Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Restore skipping behavior The fix from prevent-duplicated-checkpoints is required to skip the checkpoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix steps on single-GPU machine Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Run dist-ckpt test on GPU Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add docs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply black Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Prevent saving last for non-equal val intervals Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Move checkpoint on rank 0 Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix num steps in tests Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add async ckpt implementation Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Abstract AsyncFinalizableCheckpointIO away Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Change async_save flag location Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add debug info Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply formatting Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Handle multiple async saves Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply formatting Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Move finalization calls to a callback Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Avoid deadlock in teardown Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Adjust to MCore implementation Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add notes and copyrights Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply formatting Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix async_request attribute Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add MCore import guards Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add async test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix finalize_fn arg Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add docs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Remove checkpoints from accurate steps Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix MCore class usage Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Update docs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix logger usage Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix rebase Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix code scan issues Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Remove unsused import Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use dist-ckpt for Bert Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix load checkpoint return val Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use dist-ckpt based on sharded_state_dict Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add async logging Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Remove deprecated argument Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use correct checkpoint_io Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bad merge Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Improve debug msg Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Run async test on GPU Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix async ckpt unit test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com> * Clarify async logs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add schema print Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
* Prevent duplicated checkpoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Introduce DistributedCheckpointIO Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix DistCkptIO usage Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use NeMo logger Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [DCIO] Fix save_to dist ckpt path Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add versioning to save_to Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add versioning logic to all .nemo files Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add versioning test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add dist-ckpt test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Rename existing ckpts instead of using different name Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add comment Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use dist ckpt flag in all methods Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Improve error msg Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add dist ckpt unit tests Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix load_checkpoint Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix auto-issues Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix ckpt_dir var Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Restore skipping behavior The fix from prevent-duplicated-checkpoints is required to skip the checkpoints Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix steps on single-GPU machine Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Run dist-ckpt test on GPU Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add docs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply black Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Prevent saving last for non-equal val intervals Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Move checkpoint on rank 0 Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix num steps in tests Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add async ckpt implementation Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Abstract AsyncFinalizableCheckpointIO away Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Change async_save flag location Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add debug info Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply formatting Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Handle multiple async saves Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply formatting Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Move finalization calls to a callback Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Avoid deadlock in teardown Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Adjust to MCore implementation Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add notes and copyrights Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply formatting Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix async_request attribute Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add MCore import guards Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add async test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix finalize_fn arg Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add docs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Remove checkpoints from accurate steps Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix MCore class usage Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Update docs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix logger usage Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix rebase Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix code scan issues Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Remove unsused import Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use dist-ckpt for Bert Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix load checkpoint return val Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use dist-ckpt based on sharded_state_dict Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add async logging Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Remove deprecated argument Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Use correct checkpoint_io Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bad merge Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Improve debug msg Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Run async test on GPU Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Fix async ckpt unit test Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Apply isort and black reformatting Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com> * Clarify async logs Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> * Add schema print Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> --------- Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com> Signed-off-by: mikolajblaz <mikolajblaz@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do ?
Adds async distribtued checkpoint save implementation.
The base of this PR is set to
mblaz/async-dist-ckpt-minimal-base
, since this PR requires #9016 and #9015 to be merged first and the base branch contains combined changes from those branches.Collection: NLP
Changelog
Usage
exp_manager.checkpoint_callback_params.async_save=True
# Add a code snippet demonstrating how to use this
Jenkins CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
There's no need to comment
jenkins
on the PR to trigger Jenkins CI.The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information