Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for skipping validation on resume + extend saving last ckpt test #4922

Merged
merged 4 commits into from
Sep 12, 2022

Conversation

SeanNaren
Copy link
Collaborator

@SeanNaren SeanNaren commented Sep 12, 2022

What does this PR do ?

When restarting training from a saved checkpoint during validation, validation is re-run as a first step (as the checkpoint was saved just before validation). This PR introduces a new loop injected via the exp_manager that skips validation if we're restarting.

There was also a mistake with the test definition from #4905 and I extended the test further to check @yaoyu-33s' case.

cc @MaximumEntropy @ericharper @titu1994

Changelog

  • Skip initial validation by default when resuming from checkpoint which saved after a validation pass has occurred.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

…ing training at the last ckpt

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@SeanNaren SeanNaren added the bug Something isn't working label Sep 12, 2022
@SeanNaren SeanNaren self-assigned this Sep 12, 2022
@SeanNaren
Copy link
Collaborator Author

SeanNaren commented Sep 12, 2022

Would really appreciate people running various tests to check this; it would also be great if we're able to remove https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py#L646 and see if things work as expected.

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@SeanNaren
Copy link
Collaborator Author

@MaximumEntropy just to confirm, I don't see any tests that explicitly check to see if training can resume from a nemo megatron checkpoint?

@MaximumEntropy
Copy link
Contributor

@SeanNaren most megatron CI tests should have the same command twice where the first run saves a checkpoint and the subsequent one should load the saved checkpoint and continue training ex: https://github.com/NVIDIA/NeMo/blob/main/Jenkinsfile#L3261 and https://github.com/NVIDIA/NeMo/blob/main/Jenkinsfile#L3302

Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@SeanNaren SeanNaren merged commit 4f6aba1 into NVIDIA:main Sep 12, 2022
@SeanNaren SeanNaren deleted the fix/validation_resume branch September 12, 2022 22:27
jubick1337 pushed a commit to jubick1337/NeMo that referenced this pull request Sep 28, 2022
…pt test (NVIDIA#4922)

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
jubick1337 pushed a commit to jubick1337/NeMo that referenced this pull request Oct 3, 2022
…pt test (NVIDIA#4922)

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
jubick1337 pushed a commit to jubick1337/NeMo that referenced this pull request Oct 3, 2022
…pt test (NVIDIA#4922)

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
jubick1337 pushed a commit to jubick1337/NeMo that referenced this pull request Oct 3, 2022
…pt test (NVIDIA#4922)

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
jubick1337 pushed a commit to jubick1337/NeMo that referenced this pull request Oct 4, 2022
…pt test (NVIDIA#4922)

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
jubick1337 pushed a commit to jubick1337/NeMo that referenced this pull request Oct 4, 2022
…pt test (NVIDIA#4922)

Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
…pt test (NVIDIA#4922)

Signed-off-by: Hainan Xu <hainanx@nvidia.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
…pt test (NVIDIA#4922)

Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants