Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binarized memmap dataloader for Megatron NMT, Inference and checkpoint -> nemo #4137

Merged
merged 112 commits into from
Jun 1, 2022

Conversation

MaximumEntropy
Copy link
Contributor

@MaximumEntropy MaximumEntropy commented May 9, 2022

What does this PR do ?

  1. Adds megatron memory-mapped dataloaders to NMT.
  2. Inference script/config with a translate() method.

Collection: NLP

Changelog

  • Add a new dataset class for megatron memmap dataset.
  • Add an inference script with the associated yaml config.
  • Change the use_tarred_dataset arg to a generic dataset_type arg that can take [text, tarred, bin_memmap, text_memmap]

Usage

  • Set the following in the yaml config
  • dataset_type: bin_memmap.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented May 9, 2022

This pull request fixes 1 alert when merging 590f40e into 470587a - view on LGTM.com

fixed alerts:

  • 1 for Unused import

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented May 9, 2022

This pull request fixes 1 alert when merging 968938b into 470587a - view on LGTM.com

fixed alerts:

  • 1 for Unused import

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
MaximumEntropy and others added 3 commits May 31, 2022 10:51
aklife97
aklife97 previously approved these changes May 31, 2022
@lgtm-com
Copy link

lgtm-com bot commented May 31, 2022

This pull request introduces 5 alerts and fixes 2 when merging fd0b976 into e838862 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 2 for `__init__` method calls overridden method
  • 1 for Unused local variable

fixed alerts:

  • 1 for Unused import
  • 1 for `__init__` method calls overridden method

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented May 31, 2022

This pull request introduces 5 alerts and fixes 2 when merging 09884f3 into e838862 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 2 for `__init__` method calls overridden method
  • 1 for Unused local variable

fixed alerts:

  • 1 for Unused import
  • 1 for `__init__` method calls overridden method

aklife97
aklife97 previously approved these changes May 31, 2022
@lgtm-com
Copy link

lgtm-com bot commented May 31, 2022

This pull request introduces 5 alerts and fixes 2 when merging ff3df3d into f6936ce - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 2 for `__init__` method calls overridden method
  • 1 for Unused local variable

fixed alerts:

  • 1 for Unused import
  • 1 for `__init__` method calls overridden method

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented May 31, 2022

This pull request introduces 5 alerts and fixes 2 when merging a6c9dda into f6936ce - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 2 for `__init__` method calls overridden method
  • 1 for Unused local variable

fixed alerts:

  • 1 for Unused import
  • 1 for `__init__` method calls overridden method

@lgtm-com
Copy link

lgtm-com bot commented Jun 1, 2022

This pull request introduces 5 alerts and fixes 2 when merging 2d3ba1a into 2af3786 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 2 for `__init__` method calls overridden method
  • 1 for Unused local variable

fixed alerts:

  • 1 for Unused import
  • 1 for `__init__` method calls overridden method

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented Jun 1, 2022

This pull request introduces 5 alerts and fixes 2 when merging 4214909 into f9d45db - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 2 for `__init__` method calls overridden method
  • 1 for Unused local variable

fixed alerts:

  • 1 for Unused import
  • 1 for `__init__` method calls overridden method

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented Jun 1, 2022

This pull request introduces 5 alerts and fixes 2 when merging 37cda3b into ed35577 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 2 for `__init__` method calls overridden method
  • 1 for Unused local variable

fixed alerts:

  • 1 for Unused import
  • 1 for `__init__` method calls overridden method

@lgtm-com
Copy link

lgtm-com bot commented Jun 1, 2022

This pull request introduces 5 alerts and fixes 2 when merging 86137df into 760e628 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 2 for `__init__` method calls overridden method
  • 1 for Unused local variable

fixed alerts:

  • 1 for Unused import
  • 1 for `__init__` method calls overridden method

@aklife97 aklife97 merged commit 1f4c744 into main Jun 1, 2022
@aklife97 aklife97 deleted the nmt_memmap_dataloader branch June 1, 2022 21:23
gkucsko pushed a commit to gkucsko/NeMo that referenced this pull request Jun 2, 2022
…t -> nemo (NVIDIA#4137)

* Temp

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Add megatron dataset

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Update config and fix global batch fetcher

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Add dataset class

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Update comments

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Update yaml

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix duplicate yaml key

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Translate method and preprocess script for raw text

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove pdb

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix arg name

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix other arg

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Change sampler back

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Move back to global batch fetcher to use distributed sampler

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Add text memmap data

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Update monitor

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fixes for PP

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove unused import

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Truncate examples in text memmap

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* NMT training batch interpolation key

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* tarred data fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Change dataset type check

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix sampler

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Pass dataset cfg to determine type

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Log global step on validation step as well

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix NMT model saving with artifacts

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Initialize DDP in decode if not initialized. Needed for inference only mode

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Megatron NMT inference script

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Inference config file

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* hardcode max delta temporarily

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* detokenizer if processor is not none

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Sampler config

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Compat with configs without sampler arg

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Comment for validation dataset type

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix tokenizer building

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* CI test for megatron nmt

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix tokenizer in restore

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* O2 restore from fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove print

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Change tokenizer model name in config

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Logging

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Set seed for distributed sampler

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Cluster debugging messages

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix max generation delta

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* No LM Init

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Use nlp save restore connector

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove useless infer args

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Typo

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* UTF8 safe print of translation result

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Add save restore connector back with comment

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Refactor

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix CI test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Add missing args

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Address comments

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Empty to restart

* Fix CI test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Check for test ds

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* set fusion to false

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: Micha Livne <michalivne@users.noreply.github.com>
Co-authored-by: Abhinav Khattar <aklife97@gmail.com>
Signed-off-by: Georg Kucsko <gkucsko@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants