Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic dataloading for Megatron T5 and GPT models #6181

Merged
merged 10 commits into from
Mar 27, 2023

Conversation

vysarge
Copy link
Contributor

@vysarge vysarge commented Mar 13, 2023

What does this PR do ?

Enables testing functionality and performance of training code using a synthetic version of existing datasets.

Collection: NLP (language_modeling)

Changelog

  • Added SyntheticGPTDataset which produces the same size and datatypes of output as GPTDataset
  • Added SyntheticT5Dataset which produces the same size and datatypes of output as T5Dataset

Usage

HYDRA_FULL_ERROR=1 python NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
    model.data.data_impl=synthetic \
    model.data.data_prefix=[]

HYDRA_FULL_ERROR=1 python NeMo/examples/nlp/language_modeling/megatron_t5_pretraining.py \
    model.data.data_impl=synthetic \
    model.data.data_prefix=[]

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the NLP label Mar 13, 2023
@vysarge vysarge force-pushed the synthetic_dataloading branch 2 times, most recently from 1135923 to 2210856 Compare March 13, 2023 19:19
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
@vysarge vysarge changed the title WIP: Synthetic dataloading for Megatron models Synthetic dataloading for Megatron T5 and GPT models Mar 17, 2023
@vysarge vysarge marked this pull request as ready for review March 17, 2023 20:49
@okuchaiev okuchaiev requested a review from yidong72 March 20, 2023 23:26
Copy link
Collaborator

@yidong72 yidong72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. could you add a unit test case? or Jenkins test case?

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
@github-actions github-actions bot added the CI label Mar 21, 2023
yidong72
yidong72 previously approved these changes Mar 21, 2023
Copy link
Collaborator

@yidong72 yidong72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. looks like the ci build is failing, but not related to this change. Once CI passes, we are good to go.

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
# Files from this location will not be read; mock data will be generated instead.
logging.warning(f"Requested data_impl={data_impl}, so ignoring data_prefix setting: {data_prefix}")
if dataset_type == DSET_TYPE_T5:
from nemo.collections.nlp.data.language_modeling.megatron.t5_dataset import MockT5Dataset

Check notice

Code scanning / CodeQL

Cyclic import

Import of module [nemo.collections.nlp.data.language_modeling.megatron.t5_dataset](1) begins an import cycle.
Copy link
Collaborator

@yidong72 yidong72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yidong72 yidong72 merged commit ef6b8f0 into NVIDIA:main Mar 27, 2023
hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023
* Added dataset producing synthetic data in the same shape as GPTDataset

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added dataset producing synthetic data in the same shape as T5Dataset

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added some config checks, simplified SyntheticGPTDataset, and cleaned up code

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Rename synthetic -> mock

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Added Jenkins test

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Alter random generation seed based on currently requested idx

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
@vysarge vysarge deleted the synthetic_dataloading branch March 12, 2024 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants