Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Support pre-extracted nemo checkpoint for restoration #4061

Merged
merged 5 commits into from
Apr 28, 2022

Conversation

titu1994
Copy link
Collaborator

Signed-off-by: smajumdar titu1994@gmail.com

What does this PR do ?

Adds support to SaveRestoreConnector to employ a pre-extracted nemo directory path as a restoration directory.
This is required for extremely large language model restoration, which cannot be extracted partially during multinode inference.

Collection: [Core, Megatron]

Changelog

  • Add specific line by line info of high level changes in this PR.

Note

  • During restore_from, the path of the nemo file is ignored in its entirety since the actual pre-extracted directory is passed.
  • The filepath passed to restore_from in such a case is not preserved, it is overwritten by the value of connector.model_extracted_dir inside AppState.

Usage

# Extract a nemo file apriori to restoration
# Here we will use the SaveRestoreConnector's method to extract the checkpoint to a given directory 

connector = MySaveRestoreConnector()
MySaveRestoreConnector._unpack_nemo_file(nemo_filepath, extracted_directory)

# Then update SaveRestoreConnector with path of the extracted directory
connector.model_extracted_dir = extracted_directory

# Call restore as always
# note, we pass in the "old" nemo_filepath, stored somewhere other than the extracted directory
restored_model = ModelPT.restore_from(nemo_filepath, save_restore_connector=connector)

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Who can review?

@okuchaiev @ericharper

Signed-off-by: smajumdar <titu1994@gmail.com>
@okuchaiev
Copy link
Member

/blossom-ci

2 similar comments
@okuchaiev
Copy link
Member

/blossom-ci

@okuchaiev
Copy link
Member

/blossom-ci

if save_restore_connector.model_extracted_dir is None:
restore_path = os.path.abspath(os.path.expanduser(restore_path))
else:
restore_path = os.path.abspath(os.path.expanduser(save_restore_connector.model_extracted_dir))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the user provided absolute path outside of their homedir?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expanduser() only expands ~ in a user provided path, so if absolute path is given then as long as theres no ~ in it, it will be correct expansion.

@@ -523,3 +541,67 @@ def test_mock_model_model_collision(self):
with pytest.raises(ValueError, match="Creating model config node is forbidden"):
model = MockModel(cfg=cfg.model, trainer=None) # type: MockModel
model = model.to('cpu')

@pytest.mark.unit
def test_restore_from_save_restore_connector_extracted_dir(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding unittests!

@titu1994 titu1994 merged commit c9d78c0 into NVIDIA:main Apr 28, 2022
@titu1994 titu1994 deleted the extracted_nemofile branch April 28, 2022 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants