Skip to content

Fixes how num_layers relates to pipeline_model_parallel_size in ESM2#829

Merged
gagank1 merged 18 commits into
mainfrom
gkaushik/BIONEMO-1554
May 22, 2025
Merged

Fixes how num_layers relates to pipeline_model_parallel_size in ESM2#829
gagank1 merged 18 commits into
mainfrom
gkaushik/BIONEMO-1554

Conversation

@gagank1
Copy link
Copy Markdown
Contributor

@gagank1 gagank1 commented Apr 14, 2025

Description

Fixes broken ESM2 training for pipeline parallel > 1 case. It was caused by NeMo changing how it handles the case where num_layers is not divisible by pp.

Successful jet pipeline: https://prod.blsm.nvidia.com/bionemo-external-bionemo-fw/job/branch_pipeline_jet/337/

Closes #784

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@gagank1 gagank1 requested a review from trvachov April 15, 2025 05:37
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 force-pushed the gkaushik/BIONEMO-1554 branch from 1bf1ed5 to d9526dc Compare April 15, 2025 05:53
Comment thread sub-packages/bionemo-esm2/src/bionemo/esm2/model/model.py Outdated
gagank1 added 13 commits May 12, 2025 10:39
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Comment thread ci/benchmarks/perf/esm2_pretrain.yaml Outdated
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 requested a review from pstjohn May 21, 2025 11:42
Comment thread ci/benchmarks/perf/esm2_pretrain.yaml
Comment thread sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py
Comment thread ci/benchmarks/perf/esm2_pretrain.yaml
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 requested a review from dorotat-nv May 21, 2025 17:13
@gagank1 gagank1 enabled auto-merge May 22, 2025 08:20
@gagank1
Copy link
Copy Markdown
Contributor Author

gagank1 commented May 22, 2025

/ok to test

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2025

/ok to test

@gagank1, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@gagank1
Copy link
Copy Markdown
Contributor Author

gagank1 commented May 22, 2025

/ok to test 8b2e151

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 22, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@ce85125). Learn more about missing BASE report.
⚠️ Report is 414 commits behind head on main.

Files with missing lines Patch % Lines
...ionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #829   +/-   ##
=======================================
  Coverage        ?   84.18%           
=======================================
  Files           ?      142           
  Lines           ?     8926           
  Branches        ?        0           
=======================================
  Hits            ?     7514           
  Misses          ?     1412           
  Partials        ?        0           
Files with missing lines Coverage Δ
...ionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py 93.18% <66.66%> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gagank1 gagank1 added this pull request to the merge queue May 22, 2025
Merged via the queue into main with commit 2d8923d May 22, 2025
14 checks passed
@gagank1 gagank1 deleted the gkaushik/BIONEMO-1554 branch May 22, 2025 11:06
camirr-nv pushed a commit that referenced this pull request Jun 26, 2025
…829)

### Description

Fixes broken ESM2 training for pipeline parallel > 1 case. It was caused
by NeMo changing how it handles the case where num_layers is not
divisible by pp.

Successful jet pipeline:
https://prod.blsm.nvidia.com/bionemo-external-bionemo-fw/job/branch_pipeline_jet/337/

Closes #784

### Type of changes

- [x]  Bug fix (non-breaking change which fixes an issue)
- [ ]  New feature (non-breaking change which adds functionality)
- [ ]  Refactor
- [ ]  Documentation update
- [ ]  Other (please describe):

### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->

 - [x] I have tested these changes locally
 - [ ] I have updated the documentation accordingly
 - [ ] I have added/updated tests as needed
 - [x] All existing tests pass successfully

---------

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Ubuntu <camirr@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ESM2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Broken pipeline parallel for ESM2

4 participants