-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text memmap dataset #4068
Text memmap dataset #4068
Conversation
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
This pull request fixes 1 alert when merging 61daef3 into d97e0d3 - view on LGTM.com fixed alerts:
|
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
This pull request introduces 1 alert and fixes 1 when merging 77cbd9b into d97e0d3 - view on LGTM.com new alerts:
fixed alerts:
|
This pull request introduces 1 alert and fixes 1 when merging 061beeb into da1b56c - view on LGTM.com new alerts:
fixed alerts:
|
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
This pull request introduces 4 alerts and fixes 1 when merging 535a2f9 into 0d052c8 - view on LGTM.com new alerts:
fixed alerts:
|
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
… dataset-memmap-text
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
This pull request introduces 7 alerts and fixes 1 when merging 0b09a96 into 0d052c8 - view on LGTM.com new alerts:
fixed alerts:
|
Signed-off-by: Micha Livne <mlivne@nvidia.com>
This pull request introduces 7 alerts and fixes 1 when merging 0f3810c into 70d9687 - view on LGTM.com new alerts:
fixed alerts:
|
This pull request introduces 7 alerts and fixes 1 when merging 865d8c8 into 70d9687 - view on LGTM.com new alerts:
fixed alerts:
|
Signed-off-by: Micha Livne <mlivne@nvidia.com>
This pull request introduces 7 alerts and fixes 1 when merging fbc977f into 70d9687 - view on LGTM.com new alerts:
fixed alerts:
|
Signed-off-by: Micha Livne <mlivne@nvidia.com>
This pull request introduces 2 alerts when merging fe85c7b into 1d64497 - view on LGTM.com new alerts:
|
Signed-off-by: Micha Livne <mlivne@nvidia.com>
This pull request introduces 1 alert when merging a014a2f into 1d64497 - view on LGTM.com new alerts:
|
Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>
This pull request introduces 1 alert when merging f0b0d0e into 3816254 - view on LGTM.com new alerts:
|
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>
This pull request introduces 3 alerts when merging f4cba69 into 3816254 - view on LGTM.com new alerts:
|
Signed-off-by: Micha Livne <mlivne@nvidia.com>
This pull request introduces 2 alerts when merging f207758 into 3816254 - view on LGTM.com new alerts:
|
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great PR!
nemo/collections/nlp/data/common/sequence_to_sequence_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Micha Livne <mlivne@cs.toronto.edu>
Signed-off-by: Micha Livne <mlivne@nvidia.com>
This pull request introduces 1 alert when merging c5b4c61 into 3816254 - view on LGTM.com new alerts:
|
* 1. Initial import. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Renamed file. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Removed MegatronDataset. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added CSVMemMapDataset. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added support in tokenizer. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added index building script. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added supported in glob when building indices. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added format version to .idx files. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added sanity checks. 2. Improved speed. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Consolidated MemmapSequenceToSequenceDataset and SequenceToSequenceDataset Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed usage of "SequenceToSequenceDataset" Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Style fix. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. added control over ndex suffix files. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> Co-authored-by: Micha Livne <mlivne@nvidia.com> Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com> Signed-off-by: Yi Dong <yidong@nvidia.com>
* 1. Initial import. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Renamed file. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Removed MegatronDataset. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added CSVMemMapDataset. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added support in tokenizer. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added index building script. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added supported in glob when building indices. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added format version to .idx files. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Added sanity checks. 2. Improved speed. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Consolidated MemmapSequenceToSequenceDataset and SequenceToSequenceDataset Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed usage of "SequenceToSequenceDataset" Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Style fix. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. added control over ndex suffix files. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> * 1. Fixed style. Signed-off-by: Micha Livne <mlivne@cs.toronto.edu> * 1. Debugging. Signed-off-by: Micha Livne <mlivne@nvidia.com> Co-authored-by: Micha Livne <mlivne@nvidia.com> Co-authored-by: Oleksii Kuchaiev <okuchaiev@users.noreply.github.com>
Signed-off-by: Micha Livne mlivne@nvidia.com
What does this PR do ?
Has mechanism to retire older ind files by updating internal idx version.
Indexing speed of 1443990774 samples in 147 files using 6 workers
Loading speed
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
TextMemMapDataset
CSVMemMapDataset
MegatronDataset
nemo/collections/nlp/data/language_modeling/text_memmap_dataset.py
to preprocess indices (else happs on the fly at first run)nemo/collections/nlp/data/machine_translation/sequence_to_sequence_dataset.py
scripts/nlp_language_modeling/build_index_memmap_data.py
Usage
Example for caching index files:
NeMo/scripts/nlp_language_modeling/build_index_memmap_data.py *.txt
Index files will be created when instantiating a memory mapped dataset if missing.
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information