-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synthetic dataloading for Megatron T5 and GPT models #6181
Conversation
1135923
to
2210856
Compare
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
2210856
to
28893ee
Compare
for more information, see https://pre-commit.ci
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
for more information, see https://pre-commit.ci
… up code Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
f4695ee
to
a40b01a
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. could you add a unit test case? or Jenkins test case?
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. looks like the ci build is failing, but not related to this change. Once CI passes, we are good to go.
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
# Files from this location will not be read; mock data will be generated instead. | ||
logging.warning(f"Requested data_impl={data_impl}, so ignoring data_prefix setting: {data_prefix}") | ||
if dataset_type == DSET_TYPE_T5: | ||
from nemo.collections.nlp.data.language_modeling.megatron.t5_dataset import MockT5Dataset |
Check notice
Code scanning / CodeQL
Cyclic import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Added dataset producing synthetic data in the same shape as GPTDataset Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added dataset producing synthetic data in the same shape as T5Dataset Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added some config checks, simplified SyntheticGPTDataset, and cleaned up code Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename synthetic -> mock Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Added Jenkins test Signed-off-by: Valerie Sarge <vsarge@nvidia.com> * Alter random generation seed based on currently requested idx Signed-off-by: Valerie Sarge <vsarge@nvidia.com> --------- Signed-off-by: Valerie Sarge <vsarge@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com> Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
What does this PR do ?
Enables testing functionality and performance of training code using a synthetic version of existing datasets.
Collection: NLP (language_modeling)
Changelog
Usage
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information