-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for skipping validation on resume + extend saving last ckpt test #4922
Conversation
…ing training at the last ckpt Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Would really appreciate people running various tests to check this; it would also be great if we're able to remove https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py#L646 and see if things work as expected. |
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@MaximumEntropy just to confirm, I don't see any tests that explicitly check to see if training can resume from a nemo megatron checkpoint? |
@SeanNaren most megatron CI tests should have the same command twice where the first run saves a checkpoint and the subsequent one should load the saved checkpoint and continue training ex: https://github.com/NVIDIA/NeMo/blob/main/Jenkinsfile#L3261 and https://github.com/NVIDIA/NeMo/blob/main/Jenkinsfile#L3302 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
…pt test (NVIDIA#4922) Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
…pt test (NVIDIA#4922) Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
…pt test (NVIDIA#4922) Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
…pt test (NVIDIA#4922) Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
…pt test (NVIDIA#4922) Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
…pt test (NVIDIA#4922) Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
…pt test (NVIDIA#4922) Signed-off-by: Hainan Xu <hainanx@nvidia.com>
…pt test (NVIDIA#4922) Signed-off-by: Hainan Xu <hainanx@nvidia.com>
What does this PR do ?
When restarting training from a saved checkpoint during validation, validation is re-run as a first step (as the checkpoint was saved just before validation). This PR introduces a new loop injected via the
exp_manager
that skips validation if we're restarting.There was also a mistake with the test definition from #4905 and I extended the test further to check @yaoyu-33s' case.
cc @MaximumEntropy @ericharper @titu1994
Changelog
Before your PR is "Ready for review"
Pre checks:
PR Type:
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.