unify the implementation of early training termination across BioNeMo subpackages and update benchmarks by dorotat-nv · Pull Request #803 · NVIDIA-BioNeMo/bionemo-framework

dorotat-nv · 2025-04-07T13:28:57Z

Description

Current implementations (Evo2 and ESM2) use different approaches to stop training at specific steps while maintaining the full learning rate schedule or other characteristics. Trying to unify it

Evo2: Uses checkpoint mechanism to stop training after K steps
ESM2: Implements a different solution in train_esm2.py

Evo2: https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py

ESM2 https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py

Addressing issue: #749

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

SKIP_CI - Skip all continuous integration tests
INCLUDE_NOTEBOOKS_TESTS - Execute notebook validation tests in pytest
INCLUDE_SLOW_TESTS - Execute tests labelled as slow in pytest for extensive testing

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

TODO: Add code snippet

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

copy-pr-bot · 2025-04-07T13:29:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

sichu2023

LGTM

dorotat-nv · 2025-04-09T16:49:51Z

/ok to test

…eps-esm2-benchmarks

dorotat-nv · 2025-04-10T11:26:36Z

/ok to test

codecov-commenter · 2025-04-10T12:37:45Z

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.34%. Comparing base (f320920) to head (a25f6a1).
⚠️ Report is 441 commits behind head on main.

Files with missing lines	Patch %	Lines
...ackages/bionemo-evo2/src/bionemo/evo2/run/train.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #803   +/-   ##
=======================================
  Coverage   84.33%   84.34%           
=======================================
  Files         137      137           
  Lines        8627     8626    -1     
=======================================
  Hits         7276     7276           
+ Misses       1351     1350    -1

Files with missing lines	Coverage Δ
...ionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py	`93.79% <100.00%> (+0.04%)`	⬆️
...ackages/bionemo-evo2/src/bionemo/evo2/run/train.py	`16.66% <0.00%> (-0.37%)`	⬇️

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dorotat-nv · 2025-04-11T09:03:48Z

/ok to test

… subpackages and update benchmarks (#803) ### Description Current implementations (Evo2 and ESM2) use different approaches to stop training at specific steps while maintaining the full learning rate schedule or other characteristics. Trying to unify it Evo2: Uses checkpoint mechanism to stop training after K steps ESM2: Implements a different solution in train_esm2.py Evo2: https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py ESM2 https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py Addressing issue: #749 ### Type of changes  - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage  ```python TODO: Add code snippet ``` ### Pre-submit Checklist  - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully Signed-off-by: Cory Ye <cye@nvidia.com>

… subpackages and update benchmarks (#803) ### Description Current implementations (Evo2 and ESM2) use different approaches to stop training at specific steps while maintaining the full learning rate schedule or other characteristics. Trying to unify it Evo2: Uses checkpoint mechanism to stop training after K steps ESM2: Implements a different solution in train_esm2.py Evo2: https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py ESM2 https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py Addressing issue: #749 ### Type of changes  - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage  ```python TODO: Add code snippet ``` ### Pre-submit Checklist  - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>

dorotat-nv added 2 commits April 7, 2025 13:17

added full training max steps to scheduler lr

b7acca5

commented broken part

13dc060

dorotat-nv added the SKIP_CI label Apr 7, 2025

dorotat-nv requested review from malcolmgreaves and pstjohn as code owners April 7, 2025 13:28

pstjohn approved these changes Apr 7, 2025

View reviewed changes

sichu2023 approved these changes Apr 7, 2025

View reviewed changes

dorotat-nv added 2 commits April 8, 2025 13:33

added possibility of an early stopping to trainer

f57266b

updated jet configs

42c6376

dorotat-nv requested review from cspades, farhadrgh, jomitchellnv, jstjohn, jwilber, skothenhill-nv and trvachov as code owners April 8, 2025 12:34

dorotat-nv added 2 commits April 8, 2025 13:37

added evo2 perf training

7659a55

added perf benchmarks

c442eef

dorotat-nv changed the title ~~Dorotat/update num steps esm2 benchmarks~~ Dorotat/update early stop num steps for Evo2 and ESM2 benchmarks Apr 8, 2025

dorotat-nv changed the title ~~Dorotat/update early stop num steps for Evo2 and ESM2 benchmarks~~ unify the implementation of early training termination across BioNeMo subpackages and update benchmarks Apr 8, 2025

fixed number of steps for perf logging

016a0c0

dorotat-nv removed the SKIP_CI label Apr 8, 2025

dorotat-nv added 4 commits April 8, 2025 16:19

fixed number of steps for perf logging

881bac4

temporary decreasing wall time

6a66fb0

added labels for tp and pp

ea823dc

removed limit

67436d7

dorotat-nv enabled auto-merge April 9, 2025 17:03

dorotat-nv disabled auto-merge April 9, 2025 17:31

dorotat-nv added 3 commits April 9, 2025 18:54

updated perf benchmarks with issues

4088815

added better wandb run name option

50d4e30

Merge remote-tracking branch 'origin/main' into dorotat/update-num-st…

1015251

…eps-esm2-benchmarks

Merge branch 'main' into dorotat/update-num-steps-esm2-benchmarks

a25f6a1

dorotat-nv added the INCLUDE_SLOW_TESTS label Apr 11, 2025

dorotat-nv enabled auto-merge April 11, 2025 09:04

dorotat-nv added this pull request to the merge queue Apr 11, 2025

Merged via the queue into main with commit e7c1089 Apr 11, 2025

dorotat-nv deleted the dorotat/update-num-steps-esm2-benchmarks branch April 11, 2025 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unify the implementation of early training termination across BioNeMo subpackages and update benchmarks#803

unify the implementation of early training termination across BioNeMo subpackages and update benchmarks#803
dorotat-nv merged 15 commits into
mainfrom
dorotat/update-num-steps-esm2-benchmarks

dorotat-nv commented Apr 7, 2025 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 7, 2025

Uh oh!

sichu2023 left a comment

Uh oh!

dorotat-nv commented Apr 9, 2025

Uh oh!

dorotat-nv commented Apr 10, 2025

Uh oh!

codecov-commenter commented Apr 10, 2025 •

edited by codecov Bot

Loading

Uh oh!

dorotat-nv commented Apr 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dorotat-nv commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Usage

Pre-submit Checklist

Uh oh!

copy-pr-bot Bot commented Apr 7, 2025

Uh oh!

sichu2023 left a comment

Choose a reason for hiding this comment

Uh oh!

dorotat-nv commented Apr 9, 2025

Uh oh!

dorotat-nv commented Apr 10, 2025

Uh oh!

codecov-commenter commented Apr 10, 2025 • edited by codecov Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dorotat-nv commented Apr 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dorotat-nv commented Apr 7, 2025 •

edited

Loading

codecov-commenter commented Apr 10, 2025 •

edited by codecov Bot

Loading