Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for new mcore ds features #9388

Merged
merged 17 commits into from
Jun 11, 2024
Merged

Conversation

dimapihtar
Copy link
Collaborator

@dimapihtar dimapihtar commented Jun 5, 2024

What does this PR do ?

Adds new mcore dataset features to NeMo.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the NLP label Jun 5, 2024
@dimapihtar dimapihtar changed the title add validation_drop_last and add_extra_token params support for mcore ds add support for mcore ds new features Jun 5, 2024
dimapihtar and others added 2 commits June 6, 2024 05:10
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>
@dimapihtar dimapihtar marked this pull request as ready for review June 6, 2024 13:27
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
@dimapihtar dimapihtar changed the title add support for mcore ds new features add support for new mcore ds features Jun 6, 2024
@ShriyaPalsamudram
Copy link
Collaborator

@dimapihtar - can you add
drop_last_partial_validation_sequence = True and add_extra_token_to_sequence = True to a github actions test?

@dimapihtar
Copy link
Collaborator Author

@dimapihtar - can you add drop_last_partial_validation_sequence = True and add_extra_token_to_sequence = True to a github actions test?

@ShriyaPalsamudram we have these values as True by default so I think all the tests are running in this way.

@github-actions github-actions bot added the CI label Jun 10, 2024
dimapihtar and others added 3 commits June 10, 2024 09:48
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
@jkamalu
Copy link

jkamalu commented Jun 10, 2024

One necessary consistency test will be to measure validation eval between legacy and mcore code paths, with all options turned on to eval on the entire validation dataset

You'll have to make sure that the loss and ppl computation is able to handle partial batches and partial sequences. NeMo already does this at least for the legacy code path, but you'll want to make sure there aren't any breaking changes.

dimapihtar and others added 3 commits June 11, 2024 15:37
…lers.py

Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
…el.py

Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
@dimapihtar dimapihtar merged commit 91ab412 into main Jun 11, 2024
112 checks passed
@dimapihtar dimapihtar deleted the dpykhtar/mcore_ds_features branch June 11, 2024 15:27
janekl pushed a commit that referenced this pull request Jun 12, 2024
* add validation_drop_last and add_extra_token params support for mcore ds

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* pad samples with dummy tokens only

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

* use no_seqlen_plus_one_input_tokens as mcore's add_extra_token

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* set train_valid_test_num_samples[1] to None

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add test case when validation_drop_last is False

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* set validation_drop_last as True by default

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Update nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py

Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

* Update nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

---------

Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com>
Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
galv pushed a commit to galv/NeMo that referenced this pull request Jun 13, 2024
* add validation_drop_last and add_extra_token params support for mcore ds

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* pad samples with dummy tokens only

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

* use no_seqlen_plus_one_input_tokens as mcore's add_extra_token

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* set train_valid_test_num_samples[1] to None

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add test case when validation_drop_last is False

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* set validation_drop_last as True by default

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Update nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py

Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

* Update nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

---------

Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com>
Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
JesusPaz pushed a commit to JesusPaz/NeMo that referenced this pull request Jun 18, 2024
* add validation_drop_last and add_extra_token params support for mcore ds

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* pad samples with dummy tokens only

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

* use no_seqlen_plus_one_input_tokens as mcore's add_extra_token

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* set train_valid_test_num_samples[1] to None

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add test case when validation_drop_last is False

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* set validation_drop_last as True by default

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Update nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py

Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

* Update nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

---------

Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com>
Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* add validation_drop_last and add_extra_token params support for mcore ds

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* pad samples with dummy tokens only

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

* use no_seqlen_plus_one_input_tokens as mcore's add_extra_token

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* set train_valid_test_num_samples[1] to None

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add test case when validation_drop_last is False

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert config

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* set validation_drop_last as True by default

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Update nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py

Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

* Update nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

---------

Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com>
Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>
@ko3n1g ko3n1g mentioned this pull request Jul 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants