Fixes how num_layers relates to pipeline_model_parallel_size in ESM2#829
Merged
Conversation
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
1bf1ed5 to
d9526dc
Compare
pstjohn
reviewed
Apr 17, 2025
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
gagank1
commented
May 21, 2025
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
pstjohn
reviewed
May 21, 2025
pstjohn
reviewed
May 21, 2025
dorotat-nv
suggested changes
May 21, 2025
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
pstjohn
approved these changes
May 22, 2025
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
dorotat-nv
approved these changes
May 22, 2025
Contributor
Author
|
/ok to test |
@gagank1, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
Contributor
Author
|
/ok to test 8b2e151 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #829 +/- ##
=======================================
Coverage ? 84.18%
=======================================
Files ? 142
Lines ? 8926
Branches ? 0
=======================================
Hits ? 7514
Misses ? 1412
Partials ? 0
🚀 New features to boost your workflow:
|
camirr-nv
pushed a commit
that referenced
this pull request
Jun 26, 2025
…829) ### Description Fixes broken ESM2 training for pipeline parallel > 1 case. It was caused by NeMo changing how it handles the case where num_layers is not divisible by pp. Successful jet pipeline: https://prod.blsm.nvidia.com/bionemo-external-bionemo-fw/job/branch_pipeline_jet/337/ Closes #784 ### Type of changes - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [x] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [x] All existing tests pass successfully --------- Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com> Signed-off-by: Ubuntu <camirr@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes broken ESM2 training for pipeline parallel > 1 case. It was caused by NeMo changing how it handles the case where num_layers is not divisible by pp.
Successful jet pipeline: https://prod.blsm.nvidia.com/bionemo-external-bionemo-fw/job/branch_pipeline_jet/337/
Closes #784
Type of changes
Pre-submit Checklist