Skip to content

unify the implementation of early training termination across BioNeMo subpackages and update benchmarks#803

Merged
dorotat-nv merged 15 commits into
mainfrom
dorotat/update-num-steps-esm2-benchmarks
Apr 11, 2025
Merged

unify the implementation of early training termination across BioNeMo subpackages and update benchmarks#803
dorotat-nv merged 15 commits into
mainfrom
dorotat/update-num-steps-esm2-benchmarks

Conversation

@dorotat-nv
Copy link
Copy Markdown
Collaborator

@dorotat-nv dorotat-nv commented Apr 7, 2025

Description

Current implementations (Evo2 and ESM2) use different approaches to stop training at specific steps while maintaining the full learning rate schedule or other characteristics. Trying to unify it

Evo2: Uses checkpoint mechanism to stop training after K steps
ESM2: Implements a different solution in train_esm2.py

Evo2: https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py

ESM2 https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py

Addressing issue: #749

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

TODO: Add code snippet

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

@sichu2023 sichu2023 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dorotat-nv dorotat-nv changed the title Dorotat/update num steps esm2 benchmarks Dorotat/update early stop num steps for Evo2 and ESM2 benchmarks Apr 8, 2025
@dorotat-nv dorotat-nv changed the title Dorotat/update early stop num steps for Evo2 and ESM2 benchmarks unify the implementation of early training termination across BioNeMo subpackages and update benchmarks Apr 8, 2025
@dorotat-nv dorotat-nv removed the SKIP_CI label Apr 8, 2025
@dorotat-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test

@dorotat-nv dorotat-nv enabled auto-merge April 9, 2025 17:03
@dorotat-nv dorotat-nv disabled auto-merge April 9, 2025 17:31
@dorotat-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 10, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.34%. Comparing base (f320920) to head (a25f6a1).
⚠️ Report is 441 commits behind head on main.

Files with missing lines Patch % Lines
...ackages/bionemo-evo2/src/bionemo/evo2/run/train.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #803   +/-   ##
=======================================
  Coverage   84.33%   84.34%           
=======================================
  Files         137      137           
  Lines        8627     8626    -1     
=======================================
  Hits         7276     7276           
+ Misses       1351     1350    -1     
Files with missing lines Coverage Δ
...ionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py 93.79% <100.00%> (+0.04%) ⬆️
...ackages/bionemo-evo2/src/bionemo/evo2/run/train.py 16.66% <0.00%> (-0.37%) ⬇️
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dorotat-nv
Copy link
Copy Markdown
Collaborator Author

/ok to test

@dorotat-nv dorotat-nv enabled auto-merge April 11, 2025 09:04
@dorotat-nv dorotat-nv added this pull request to the merge queue Apr 11, 2025
Merged via the queue into main with commit e7c1089 Apr 11, 2025
@dorotat-nv dorotat-nv deleted the dorotat/update-num-steps-esm2-benchmarks branch April 11, 2025 11:46
cspades pushed a commit that referenced this pull request May 4, 2025
… subpackages and update benchmarks (#803)

### Description
Current implementations (Evo2 and ESM2) use different approaches to stop
training at specific steps while maintaining the full learning rate
schedule or other characteristics. Trying to unify it

Evo2: Uses checkpoint mechanism to stop training after K steps
ESM2: Implements a different solution in train_esm2.py

Evo2:
https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py

ESM2
https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py

Addressing issue: #749

### Type of changes
<!-- Mark the relevant option with an [x] -->

- [ ]  Bug fix (non-breaking change which fixes an issue)
- [ ]  New feature (non-breaking change which adds functionality)
- [x]  Refactor
- [ ]  Documentation update
- [ ]  Other (please describe):

### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:

-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing

> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

* If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
* If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```

### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->

 - [ ] I have tested these changes locally
 - [ ] I have updated the documentation accordingly
 - [ ] I have added/updated tests as needed
 - [ ] All existing tests pass successfully

Signed-off-by: Cory Ye <cye@nvidia.com>
farhadrgh pushed a commit that referenced this pull request May 5, 2025
… subpackages and update benchmarks (#803)

### Description
Current implementations (Evo2 and ESM2) use different approaches to stop
training at specific steps while maintaining the full learning rate
schedule or other characteristics. Trying to unify it

Evo2: Uses checkpoint mechanism to stop training after K steps
ESM2: Implements a different solution in train_esm2.py

Evo2:
https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]87c1d8e/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py

ESM2
https://github.com/NVIDIA/bionemo-framework/blob/a643d8c073262a542c7b94cc39648173d[…]ub-packages/bionemo-esm2/src/bionemo/esm2/scripts/train_esm2.py

Addressing issue: #749

### Type of changes
<!-- Mark the relevant option with an [x] -->

- [ ]  Bug fix (non-breaking change which fixes an issue)
- [ ]  New feature (non-breaking change which adds functionality)
- [x]  Refactor
- [ ]  Documentation update
- [ ]  Other (please describe):

### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:

-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing

> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.

#### Authorizing CI Runs

We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.

* If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
* If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.

### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```

### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->

 - [ ] I have tested these changes locally
 - [ ] I have updated the documentation accordingly
 - [ ] I have added/updated tests as needed
 - [ ] All existing tests pass successfully

Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants