-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 Dirpath + Async Uploading Support for Default Checkpoints #9045
S3 Dirpath + Async Uploading Support for Default Checkpoints #9045
Conversation
68523e9
to
28077ce
Compare
…ting Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
…ls into s3_utils.py Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
…and work with upstreamed implementation of removing unfinished checkpoints Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
for more information, see https://pre-commit.ci
… file upload and download Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
28077ce
to
4c934fd
Compare
Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some high-level comments.
How does this PR relate to this one: NVIDIA/Megatron-LM#748?
# If the future is complete, we can remove the temp file since we choose to clear the temp file when uploading. | ||
try: | ||
self._temp_files.remove(item[2]) | ||
except: |
Check notice
Code scanning / CodeQL
Except block handles 'BaseException' Note
|
||
try: | ||
import awscrt | ||
import s3transfer.crt |
Check notice
Code scanning / CodeQL
Unused import Note
The PR linked is for supporting S3 checkpointing for the distributed checkpoint format. |
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
…ort S3Utils depending on whether dirpath is an S3 address or not Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
… nlp_overrides DDP initializer to properly assign updated checkpoint io to base class. Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
…9045) * Add S3 dirpath and asynchronous uploading support for basic checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update megtron_gpt_pretraining config to support S3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed unused imports Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * move s3_checkpoint_io into callbacks. consolidate checkpoint_file_utils into s3_utils.py Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update setup() in nemo_model_checkpoint to broadcast checkpoint path and work with upstreamed implementation of removing unfinished checkpoints Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add boto3 dependency for testing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove redundant setup() in nemo_model_checkpoint Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove comment line from import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed explicit CRT calls since boto[crt] automatically uses CRT for file upload and download Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * remove un-used s3transfer import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * add s3 prefix for s3-related checkpointing config Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * dummy sleep function lowered from 1 to 0.01 seconds Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove local_rank checking for rank, and use is_global_rank_zero. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * add tenacity dependency Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Add filtering of unfinished checkpoint to non-s3 checkpoint resuming Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * isort black reformatting Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Remove dependency requirement for checking if dirpath is an s3 path Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Make dependencies fully optional; allow exp_manager to optionally import S3Utils depending on whether dirpath is an S3 address or not Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add rst doc for s3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove unneeded assert Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed dependencies Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Updated documentation on async save to S3 Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Update S3 checkpointing doc and fix visibility on website. Update the nlp_overrides DDP initializer to properly assign updated checkpoint io to base class. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Slight fix in s3 checkpoint doc Signed-off-by: Alexander Zhang <alxzhang@amazon.com> --------- Signed-off-by: Alexander Zhang <alxzhang@amazon.com> Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com> Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
…9045) * Add S3 dirpath and asynchronous uploading support for basic checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update megtron_gpt_pretraining config to support S3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed unused imports Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * move s3_checkpoint_io into callbacks. consolidate checkpoint_file_utils into s3_utils.py Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Update setup() in nemo_model_checkpoint to broadcast checkpoint path and work with upstreamed implementation of removing unfinished checkpoints Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add boto3 dependency for testing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove redundant setup() in nemo_model_checkpoint Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove comment line from import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed explicit CRT calls since boto[crt] automatically uses CRT for file upload and download Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * remove un-used s3transfer import Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * add s3 prefix for s3-related checkpointing config Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * dummy sleep function lowered from 1 to 0.01 seconds Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove local_rank checking for rank, and use is_global_rank_zero. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Style fix Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * add tenacity dependency Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Add filtering of unfinished checkpoint to non-s3 checkpoint resuming Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * isort black reformatting Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Remove dependency requirement for checking if dirpath is an s3 path Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Make dependencies fully optional; allow exp_manager to optionally import S3Utils depending on whether dirpath is an S3 address or not Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Add rst doc for s3 checkpointing Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Remove unneeded assert Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Removed dependencies Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Updated documentation on async save to S3 Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Update S3 checkpointing doc and fix visibility on website. Update the nlp_overrides DDP initializer to properly assign updated checkpoint io to base class. Signed-off-by: Alexander Zhang <alxzhang@amazon.com> * Apply isort and black reformatting Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> * Slight fix in s3 checkpoint doc Signed-off-by: Alexander Zhang <alxzhang@amazon.com> --------- Signed-off-by: Alexander Zhang <alxzhang@amazon.com> Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com> Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
What does this PR do ?
This PR creates a new CheckpointIO which allows users to upload checkpoints directly to S3.
Add a one line overview of what this PR aims to accomplish.
Collection: [NLP]
Changelog
S3CheckpointIO
to enable strategy to upload default checkpoints to S3.checkpoint_file_utils
file which includes identifies existing pathsS3CheckpointIO
to cleanup existing checkpointsS3Utils
class that contains helper methods for interacting with S3.exp_manager
check_resume
to only run on rank 0 when using an S3 dirpath (prevents throttling S3 due to check_resume operations).NeMoCheckpointConnector
toexp_manager
to callresume_start
using the broadcasted checkpoint path.setup()
override inNeMoModelCheckpoint
to broadcast thetrainer.ckpt_path
since only Rank 0 has it aftercheck_resume
requirements_nlp.txt
to include support for S3 filesystem as well as CRT support for faster uploading.Usage
Example from
megatron_gpt_pretraining.py
exampleConfig Updates:
Jenkins CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
There's no need to comment
jenkins
on the PR to trigger Jenkins CI.The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information