Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Dirpath + Async Uploading Support for Default Checkpoints #9045

Merged

Conversation

alxzhang-amazon
Copy link
Contributor

@alxzhang-amazon alxzhang-amazon commented Apr 26, 2024

What does this PR do ?

This PR creates a new CheckpointIO which allows users to upload checkpoints directly to S3.

Add a one line overview of what this PR aims to accomplish.

Collection: [NLP]

Changelog

  • Adds a new S3CheckpointIO to enable strategy to upload default checkpoints to S3.
  • Adds a checkpoint_file_utils file which includes identifies existing paths
    • used in S3CheckpointIO to cleanup existing checkpoints
  • Adds a S3Utils class that contains helper methods for interacting with S3.
    • Writes files using CRT built in due to installing boto3[crt]: reference
  • Update exp_manager check_resume to only run on rank 0 when using an S3 dirpath (prevents throttling S3 due to check_resume operations).
  • Add NeMoCheckpointConnector to exp_manager to call resume_start using the broadcasted checkpoint path.
  • Add setup() override in NeMoModelCheckpoint to broadcast the trainer.ckpt_path since only Rank 0 has it after check_resume
  • Updated requirements_nlp.txt to include support for S3 filesystem as well as CRT support for faster uploading.

Usage

  • To upload checkpoints to S3, add the checkpoint connector to the trainer, and apply the S3CheckpointIO to your existing strategy.
  • Update the dirpath field in your config to an s3 address.

Example from megatron_gpt_pretraining.py example

def main(cfg) -> None:
    logging.info("\n\n************** Experiment configuration ***********")
    logging.info(f'\n{OmegaConf.to_yaml(cfg)}')
 
    trainer = MegatronTrainerBuilder(cfg).create_trainer()
    trainer._checkpoint_connector = NeMoCheckpointConnector(trainer)
    exp_manager(trainer, cfg.exp_manager)
 
    model = MegatronGPTModel(cfg.model, trainer)
    trainer.fit(model)
 
if __name__ == '__main__':
    main()
async_checkpointing = self.cfg.checkpointing.get('enable_async_checkpointing', False)
dirpath = self.cfg.exp_manager.checkpoint_callback_params.get('dirpath')
s3_checkpoint_io = S3CheckpointIO(dirpath=dirpath, async_checkpointing=async_checkpointing)
return NLPDDPStrategy(
    no_ddp_communication_hook=True,
    checkpoint_io=s3_checkpoint_io,
    gradient_as_bucket_view=self.cfg.model.gradient_as_bucket_view,
    find_unused_parameters=False,
    nccl_communicator_config_path=self.cfg.model.get('nccl_communicator_config_path', None),
    sharp=self.cfg.model.get('sharp', False),
)

Config Updates:

checkpoint_callback_params:
    dirpath: s3://mstar-eks-dev-us-east-2/alxzhang/nemo123/1n/checkpoints
 
checkpointing:
  # write_concurrency * tp * pp * 1.15 (buffer) should be within 3500 S3 TPS limit per partition
  max_write_concurrency: 10
  # read_concurrency * tp * pp * 1.15 (buffer) should be within 5500 S3 TPS limit per partition
  max_read_concurrency: 15
  chunk_size_MB: 64
  # enables asynchronous checkpoint writing to S3
  enable_async_checkpointing: False

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the NLP label Apr 26, 2024
nemo/utils/callbacks/nemo_model_checkpoint.py Fixed Show fixed Hide fixed
nemo/utils/checkpoint_file_utils.py Fixed Show fixed Hide fixed
nemo/utils/checkpoint_file_utils.py Fixed Show fixed Hide fixed
nemo/utils/checkpoint_file_utils.py Fixed Show fixed Hide fixed
nemo/utils/checkpoint_file_utils.py Fixed Show fixed Hide fixed
nemo/utils/s3_utils.py Fixed Show fixed Hide fixed
nemo/utils/s3_utils.py Fixed Show fixed Hide fixed
@alxzhang-amazon alxzhang-amazon force-pushed the s3-checkpointing-support-upstream branch 4 times, most recently from 68523e9 to 28077ce Compare April 30, 2024 15:58
alxzhang-amazon and others added 11 commits April 30, 2024 13:03
…ting

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
…ls into s3_utils.py

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
…and work with upstreamed implementation of removing unfinished checkpoints

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
… file upload and download

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
@alxzhang-amazon alxzhang-amazon force-pushed the s3-checkpointing-support-upstream branch from 28077ce to 4c934fd Compare April 30, 2024 20:03
@alxzhang-amazon alxzhang-amazon marked this pull request as ready for review May 1, 2024 16:40
Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com>
Copy link
Collaborator

@mikolajblaz mikolajblaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some high-level comments.

How does this PR relate to this one: NVIDIA/Megatron-LM#748?

nemo/utils/callbacks/s3_checkpoint_io.py Outdated Show resolved Hide resolved
nemo/utils/callbacks/s3_checkpoint_io.py Show resolved Hide resolved
nemo/utils/exp_manager.py Outdated Show resolved Hide resolved
nemo/utils/exp_manager.py Show resolved Hide resolved
# If the future is complete, we can remove the temp file since we choose to clear the temp file when uploading.
try:
self._temp_files.remove(item[2])
except:

Check notice

Code scanning / CodeQL

Except block handles 'BaseException' Note

Except block directly handles BaseException.
nemo/utils/s3_utils.py Fixed Show fixed Hide fixed

try:
import awscrt
import s3transfer.crt

Check notice

Code scanning / CodeQL

Unused import Note

Import of 's3transfer' is not used.
@alxzhang-amazon
Copy link
Contributor Author

I added some high-level comments.

How does this PR relate to this one: NVIDIA/Megatron-LM#748?

I added some high-level comments.

How does this PR relate to this one: NVIDIA/Megatron-LM#748?

The PR linked is for supporting S3 checkpointing for the distributed checkpoint format.
This PR is for the legacy format, but also can help the linked PR by enabling handling of S3 dirpaths in the exp_manager

alxzhang-amazon and others added 5 commits May 9, 2024 09:57
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
alxzhang-amazon and others added 6 commits June 12, 2024 09:04
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
…ort S3Utils depending on whether dirpath is an S3 address or not

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
import re
import time
from io import BytesIO
from pathlib import Path

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'Path' is not used.
alxzhang-amazon and others added 3 commits June 13, 2024 09:40
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
ericharper
ericharper previously approved these changes Jun 14, 2024
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

… nlp_overrides DDP initializer to properly assign updated checkpoint io to base class.

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
alxzhang-amazon and others added 3 commits June 14, 2024 23:04
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ericharper ericharper merged commit 1f31f3b into NVIDIA:main Jun 15, 2024
110 of 208 checks passed
JesusPaz pushed a commit to JesusPaz/NeMo that referenced this pull request Jun 18, 2024
…9045)

* Add S3 dirpath and asynchronous uploading support for basic checkpointing

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Update megtron_gpt_pretraining config to support S3 checkpointing

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Removed unused imports

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* move s3_checkpoint_io into callbacks. consolidate checkpoint_file_utils into s3_utils.py

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Update setup() in nemo_model_checkpoint to broadcast checkpoint path and work with upstreamed implementation of removing unfinished checkpoints

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Add boto3 dependency for testing

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Remove redundant setup() in nemo_model_checkpoint

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Remove comment line from import

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed explicit CRT calls since boto[crt] automatically uses CRT for file upload and download

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Style fix

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* remove un-used s3transfer import

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* add s3 prefix for s3-related checkpointing config

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* dummy sleep function lowered from 1 to 0.01 seconds

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Remove local_rank checking for rank, and use is_global_rank_zero.

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Style fix

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* add tenacity dependency

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* Add filtering of unfinished checkpoint to non-s3 checkpoint resuming

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* isort black reformatting

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* Remove dependency requirement for checking if dirpath is an s3 path

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Make dependencies fully optional; allow exp_manager to optionally import S3Utils depending on whether dirpath is an S3 address or not

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Add rst doc for s3 checkpointing

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Remove unneeded assert

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Removed dependencies

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* Updated documentation on async save to S3

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* Update S3 checkpointing doc and fix visibility on website. Update the nlp_overrides DDP initializer to properly assign updated checkpoint io to base class.

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* Slight fix in s3 checkpoint doc

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

---------

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com>
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…9045)

* Add S3 dirpath and asynchronous uploading support for basic checkpointing

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Update megtron_gpt_pretraining config to support S3 checkpointing

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Removed unused imports

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* move s3_checkpoint_io into callbacks. consolidate checkpoint_file_utils into s3_utils.py

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Update setup() in nemo_model_checkpoint to broadcast checkpoint path and work with upstreamed implementation of removing unfinished checkpoints

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Add boto3 dependency for testing

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Remove redundant setup() in nemo_model_checkpoint

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Remove comment line from import

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removed explicit CRT calls since boto[crt] automatically uses CRT for file upload and download

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Style fix

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* remove un-used s3transfer import

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* add s3 prefix for s3-related checkpointing config

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* dummy sleep function lowered from 1 to 0.01 seconds

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Remove local_rank checking for rank, and use is_global_rank_zero.

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Style fix

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* add tenacity dependency

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* Add filtering of unfinished checkpoint to non-s3 checkpoint resuming

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* isort black reformatting

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* Remove dependency requirement for checking if dirpath is an s3 path

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Make dependencies fully optional; allow exp_manager to optionally import S3Utils depending on whether dirpath is an S3 address or not

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Add rst doc for s3 checkpointing

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Remove unneeded assert

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Removed dependencies

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* Updated documentation on async save to S3

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* Update S3 checkpointing doc and fix visibility on website. Update the nlp_overrides DDP initializer to properly assign updated checkpoint io to base class.

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

* Apply isort and black reformatting

Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>

* Slight fix in s3 checkpoint doc

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>

---------

Signed-off-by: Alexander Zhang <alxzhang@amazon.com>
Signed-off-by: alxzhang-amazon <166076199+alxzhang-amazon@users.noreply.github.com>
Signed-off-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: alxzhang-amazon <alxzhang-amazon@users.noreply.github.com>
@ko3n1g ko3n1g mentioned this pull request Jul 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants