Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LM adapted T5 dataset #3654

Merged
merged 23 commits into from
Feb 17, 2022
Merged

LM adapted T5 dataset #3654

merged 23 commits into from
Feb 17, 2022

Conversation

MaximumEntropy
Copy link
Contributor

@MaximumEntropy MaximumEntropy commented Feb 11, 2022

Signed-off-by: MaximumEntropy sandeep.subramanian.1@umontreal.ca

What does this PR do ?

Adds the ability to use the prefix-LM objective for T5 which reuses the GPT dataset class that already exists.

Collection: NLP

Changelog

  • Adds at T5LMAdaptedDataset class that inherits from GPTDataset that just splits a text sequence into encoder and decoder components
  • Re-uses the existing build_train_valid_test_datasets functions that exist for BERT and T5 for this class.
  • Moves some dataset utils to a new file to prevent circular imports between gpt_dataset.py and dataset_utils.py.

Usage

Add the dataset_type flag to either t5 or t5_prefix_lm in megatron_t5_config.yaml.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented Feb 11, 2022

This pull request introduces 2 alerts when merging b255e58 into 058fa38 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Feb 14, 2022

This pull request introduces 2 alerts when merging 99fabbb into 461a866 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

Copy link
Collaborator

@michalivne michalivne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful PR! Looks good. See minor comments below.

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented Feb 14, 2022

This pull request introduces 2 alerts when merging 377342c into 6a517f0 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Feb 14, 2022

This pull request introduces 2 alerts when merging a826ac7 into 6a517f0 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

michalivne
michalivne previously approved these changes Feb 14, 2022
Copy link
Collaborator

@michalivne michalivne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates. LGTM!

@lgtm-com
Copy link

lgtm-com bot commented Feb 14, 2022

This pull request introduces 2 alerts when merging 0f1579d into fbbfb08 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented Feb 14, 2022

This pull request introduces 2 alerts when merging 9d4c3aa into aeeb0d2 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented Feb 15, 2022

This pull request introduces 2 alerts when merging fd8c672 into 277b088 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Feb 15, 2022

This pull request introduces 2 alerts when merging 69acc37 into b466ebc - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Feb 16, 2022

This pull request introduces 2 alerts when merging 7d1626f into b5012d0 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Feb 16, 2022

This pull request introduces 2 alerts when merging 5da9724 into 7231aca - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

@@ -106,6 +106,7 @@ model:
dataloader_type: single # cyclic
masked_lm_prob: 0.15
short_seq_prob: 0.1
dataset_type: 't5'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add possible types here in a comment like # t5, ...

ericharper
ericharper previously approved these changes Feb 16, 2022
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
@lgtm-com
Copy link

lgtm-com bot commented Feb 16, 2022

This pull request introduces 2 alerts when merging acf4909 into 2ebca22 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Feb 16, 2022

This pull request introduces 2 alerts when merging 77ae69f into 8ffc92e - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Feb 16, 2022

This pull request introduces 3 alerts when merging ed3be60 into 37fe5b4 - view on LGTM.com

new alerts:

  • 1 for Unused local variable
  • 1 for Unused import
  • 1 for Unreachable code

@lgtm-com
Copy link

lgtm-com bot commented Feb 17, 2022

This pull request introduces 3 alerts when merging 8d026b1 into a8f29af - view on LGTM.com

new alerts:

  • 1 for Unused local variable
  • 1 for Unused import
  • 1 for Unreachable code

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@MaximumEntropy MaximumEntropy merged commit 49183b7 into main Feb 17, 2022
@MaximumEntropy MaximumEntropy deleted the t5_lm_adaptation branch February 17, 2022 00:32
fayejf pushed a commit that referenced this pull request Mar 2, 2022
* LM adapted T5 dataset

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* File renaming

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* change assert to raising valueerror

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Printing changes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style fixes

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Comment out ICT dataset

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: Micha Livne <michalivne@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants