Synthetic dataloading for Megatron T5 and GPT models #6181

vysarge · 2023-03-13T19:07:21Z

What does this PR do ?

Enables testing functionality and performance of training code using a synthetic version of existing datasets.

Collection: NLP (language_modeling)

Changelog

Added SyntheticGPTDataset which produces the same size and datatypes of output as GPTDataset
Added SyntheticT5Dataset which produces the same size and datatypes of output as T5Dataset

Usage

HYDRA_FULL_ERROR=1 python NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
    model.data.data_impl=synthetic \
    model.data.data_prefix=[]

HYDRA_FULL_ERROR=1 python NeMo/examples/nlp/language_modeling/megatron_t5_pretraining.py \
    model.data.data_impl=synthetic \
    model.data.data_prefix=[]

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

for more information, see https://pre-commit.ci

… up code Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

for more information, see https://pre-commit.ci

yidong72

LGTM. could you add a unit test case? or Jenkins test case?

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

yidong72

LGTM. looks like the ci build is failing, but not related to this change. Once CI passes, we are good to go.

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

nemo/collections/nlp/data/language_modeling/megatron/dataset_utils.py

+            # Files from this location will not be read; mock data will be generated instead.
+            logging.warning(f"Requested data_impl={data_impl}, so ignoring data_prefix setting: {data_prefix}")
+        if dataset_type == DSET_TYPE_T5:
+            from nemo.collections.nlp.data.language_modeling.megatron.t5_dataset import MockT5Dataset


yidong72

LGTM

* Added dataset producing synthetic data in the same shape as GPTDataset Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added dataset producing synthetic data in the same shape as T5Dataset Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added some config checks, simplified SyntheticGPTDataset, and cleaned up code Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename synthetic -> mock Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Added Jenkins test Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Alter random generation seed based on currently requested idx Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com> Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

github-actions bot added the NLP label Mar 13, 2023

vysarge force-pushed the synthetic_dataloading branch 2 times, most recently from 1135923 to 2210856 Compare March 13, 2023 19:19

Added dataset producing synthetic data in the same shape as GPTDataset

28893ee

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

vysarge force-pushed the synthetic_dataloading branch from 2210856 to 28893ee Compare March 13, 2023 19:20

pre-commit-ci bot and others added 4 commits March 13, 2023 19:24

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa686a3

for more information, see https://pre-commit.ci

Added dataset producing synthetic data in the same shape as T5Dataset

d2b813f

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

fa07a69

for more information, see https://pre-commit.ci

Added some config checks, simplified SyntheticGPTDataset, and cleaned…

a40b01a

… up code Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

vysarge force-pushed the synthetic_dataloading branch from f4695ee to a40b01a Compare March 17, 2023 01:20

[pre-commit.ci] auto fixes from pre-commit.com hooks

d9c2498

for more information, see https://pre-commit.ci

vysarge changed the title ~~WIP: Synthetic dataloading for Megatron models~~ Synthetic dataloading for Megatron T5 and GPT models Mar 17, 2023

vysarge marked this pull request as ready for review March 17, 2023 20:49

okuchaiev requested a review from yidong72 March 20, 2023 23:26

yidong72 reviewed Mar 21, 2023

View reviewed changes

vysarge added 2 commits March 21, 2023 12:47

Rename synthetic -> mock

587c2e0

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

Added Jenkins test

b0476c4

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

github-actions bot added the CI label Mar 21, 2023

yidong72 previously approved these changes Mar 21, 2023

View reviewed changes

Alter random generation seed based on currently requested idx

547ad17

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

vysarge dismissed yidong72’s stale review via 547ad17 March 21, 2023 22:14

Merge branch 'main' into synthetic_dataloading

4904482

github-advanced-security bot found potential problems Mar 27, 2023

View reviewed changes

yidong72 approved these changes Mar 27, 2023

View reviewed changes

yidong72 merged commit ef6b8f0 into NVIDIA:main Mar 27, 2023

vysarge deleted the synthetic_dataloading branch March 12, 2024 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetic dataloading for Megatron T5 and GPT models #6181

Synthetic dataloading for Megatron T5 and GPT models #6181

vysarge commented Mar 13, 2023 •

edited

Loading

yidong72 left a comment

yidong72 left a comment

yidong72 left a comment

Synthetic dataloading for Megatron T5 and GPT models #6181

Synthetic dataloading for Megatron T5 and GPT models #6181

Conversation

vysarge commented Mar 13, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

yidong72 left a comment

Choose a reason for hiding this comment

yidong72 left a comment

Choose a reason for hiding this comment

yidong72 left a comment

Choose a reason for hiding this comment

vysarge commented Mar 13, 2023 •

edited

Loading