Add ESM2 Finetuning Benchmark Configuration #964

nvmvle · 2025-07-01T08:50:00Z

Description

This PR adds comprehensive benchmark configurations for ESM2 finetuning to support performance testing and validation. The changes introduce two new benchmark configurations (partial-conv and perf) along with enhanced finetuning capabilities including checkpointing control, TensorBoard logging, and TFLOPS measurement callbacks.

Key enhancements include:

Added ESM2 finetuning YAML configurations for partial-conv and performance benchmarks
Implemented checkpointing control with --disable-checkpointing option for faster benchmark runs
Added TensorBoard logging support for training metrics visualization
Introduced TFLOPS callback option to measure and log computational performance
Enhanced training control parameters including max_steps, early stopping, and batch size configurations

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

SKIP_CI - Skip all continuous integration tests
INCLUDE_NOTEBOOKS_TESTS - Execute notebook validation tests in pytest
INCLUDE_SLOW_TESTS - Execute tests labelled as slow in pytest for extensive testing

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Signed-off-by: My Le mvle@nvidia.com

copy-pr-bot · 2025-07-01T08:50:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ci/benchmarks/partial-conv/esm2_finetuning.yaml

nvmvle · 2025-07-22T07:16:20Z

/ok to test

copy-pr-bot · 2025-07-22T07:16:23Z

/ok to test

@nvmvle, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

nvmvle · 2025-07-22T07:19:43Z

/ok to test 6a7c952

3rdparty/Megatron-LM

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py

3rdparty/NeMo

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py

ci/benchmarks/perf/esm2_finetuning.yaml

…benchmarks to enhance monitoring capabilities during training.

…sure proper event file existence and validity, improving error reporting during training monitoring.

… enable or disable checkpoint creation via a new parameter. Update assertions to verify checkpoint behavior based on this parameter, enhancing test coverage and flexibility.

…ontrol. Increase time limit to 14400 seconds, adjust GPU and node settings, and introduce early stopping functionality via a new argument. Modify related test cases to validate early stopping behavior and ensure proper argument parsing.

…0 for extended training duration.

…nce benchmarks. Remove unused parameters, streamline script arguments, and ensure consistency in training settings across both configurations. Introduce new parameters for better control over training processes.

…ttings for improved performance. Adjust training parameters for consistency and enhance experiment naming convention to include tensor parallelism and pipeline parallelism details.

…pt. Update YAML files to include new argument for tflops calculation, and modify training logic to support this feature. Enhance tests to validate argument parsing for the new callback.

…micolon from early stop argument

…2e4.

…estoration argument and updating YAML files to include this parameter. Adjust time limit for performance benchmarks and modify test cases to ensure proper argument parsing for checkpoint restoration.

…ecute based on checkpoint availability. Ensure proper validation of prediction outputs for classification and regression tasks. Update path handling for restored checkpoints.

yzhang123 · 2025-07-25T15:27:32Z

ci/benchmarks/partial-conv/esm2_finetuning.yaml

+  workspace: /workspace/bionemo2
+  data_base_path: /data/FLIP
+  restore_from_checkpoint_path: /data/esm2_650M_nemo2
+  nodes: [4]


why do you need so many nodes?

updated to 1 node

ci/benchmarks/perf/esm2_finetuning.yaml

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py

…s from 2 to 1 for improved training efficiency.

jwilber · 2025-07-29T18:05:47Z

/ok to test 4a32bc1

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py

codecov-commenter · 2025-07-29T19:58:44Z

Codecov Report

❌ Patch coverage is 82.35294% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.30%. Comparing base (612ea21) to head (ef89eb5).
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...emo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py	82.35%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #964      +/-   ##
==========================================
- Coverage   83.32%   83.30%   -0.03%     
==========================================
  Files         148      148              
  Lines        9746     9758      +12     
==========================================
+ Hits         8121     8129       +8     
- Misses       1625     1629       +4

Files with missing lines	Coverage Δ
...emo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py	`90.90% <82.35%> (-1.17%)`	⬇️

... and 1 file with indirect coverage changes

trvachov · 2025-07-30T02:29:08Z

Bypassing rules to merge to unblock My -- we just released so low risk of screwing up any timelines.

trvachov · 2025-07-30T02:30:02Z

/ok to test ef89eb5

dorotat-nv reviewed Jul 4, 2025

View reviewed changes

ci/benchmarks/partial-conv/esm2_finetuning.yaml Outdated Show resolved Hide resolved

dorotat-nv reviewed Jul 4, 2025

View reviewed changes

ci/benchmarks/partial-conv/esm2_finetuning.yaml Outdated Show resolved Hide resolved

dorotat-nv reviewed Jul 4, 2025

View reviewed changes

ci/benchmarks/partial-conv/esm2_finetuning.yaml Outdated Show resolved Hide resolved

dorotat-nv reviewed Jul 4, 2025

View reviewed changes

ci/benchmarks/partial-conv/esm2_finetuning.yaml Show resolved Hide resolved

nvmvle force-pushed the mvle/onboarding-esm2-finetuning branch 2 times, most recently from 9ca500f to 3ae0439 Compare July 15, 2025 04:17

nvmvle force-pushed the mvle/onboarding-esm2-finetuning branch from a94c575 to 6a7c952 Compare July 22, 2025 07:01

nvmvle changed the title ~~[DRAFT] Add ESM2 Finetuning Benchmark Configuration~~ Add ESM2 Finetuning Benchmark Configuration Jul 22, 2025

nvmvle marked this pull request as ready for review July 22, 2025 07:17

nvmvle requested review from cspades, farhadrgh, jomitchellnv, jstjohn, jwilber, pstjohn, sichu2023, skothenhill-nv and trvachov as code owners July 22, 2025 07:17

pstjohn reviewed Jul 22, 2025

View reviewed changes

3rdparty/Megatron-LM Show resolved Hide resolved

pstjohn reviewed Jul 22, 2025

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py Show resolved Hide resolved

jstjohn reviewed Jul 22, 2025

View reviewed changes

3rdparty/NeMo Show resolved Hide resolved

jstjohn reviewed Jul 22, 2025

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py Outdated Show resolved Hide resolved

jstjohn reviewed Jul 22, 2025

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py Show resolved Hide resolved

jwilber reviewed Jul 22, 2025

View reviewed changes

ci/benchmarks/perf/esm2_finetuning.yaml Show resolved Hide resolved

nvmvle force-pushed the mvle/onboarding-esm2-finetuning branch from e5ce3b9 to 0a8ee44 Compare July 23, 2025 07:23

jwilber approved these changes Jul 24, 2025

View reviewed changes

nvmvle added 13 commits July 24, 2025 19:56

Enable tensorboard logger in esm2 finetuning script for partial conv …

45616db

…benchmarks to enhance monitoring capabilities during training.

Enhance tensorboard logging assertions in esm2 finetuning tests to en…

8e1fd7d

…sure proper event file existence and validity, improving error reporting during training monitoring.

Add checkpointing control to esm2 finetuning tests, allowing users to…

64c3bac

… enable or disable checkpoint creation via a new parameter. Update assertions to verify checkpoint behavior based on this parameter, enhancing test coverage and flexibility.

update max_steps

cd1d110

Increase stop_steps in esm2 finetuning configuration from 1500 to 300…

e095aa0

…0 for extended training duration.

Update esm2 finetuning configuration to modify node and batch size se…

560cfb8

…ttings for improved performance. Adjust training parameters for consistency and enhance experiment naming convention to include tensor parallelism and pipeline parallelism details.

Add tflops callback option to esm2 finetuning configurations and scri…

af85741

…pt. Update YAML files to include new argument for tflops calculation, and modify training logic to support this feature. Enhance tests to validate argument parsing for the new callback.

Fix formatting in esm2 finetuning YAML script by removing trailing se…

1fd9257

…micolon from early stop argument

Update submodule reference for Megatron-LM to the latest commit f8d3e…

c43081c

…2e4.

Update submodule reference for NeMo to the latest commit 164d12b7.

5855e2f

Enhance esm2 finetuning configuration by adding required checkpoint r…

c05f863

…estoration argument and updating YAML files to include this parameter. Adjust time limit for performance benchmarks and modify test cases to ensure proper argument parsing for checkpoint restoration.

nvmvle force-pushed the mvle/onboarding-esm2-finetuning branch from 18dd98a to c05f863 Compare July 25, 2025 02:59

Refactor inference logic in esm2 finetuning tests to conditionally ex…

4fa46b7

…ecute based on checkpoint availability. Ensure proper validation of prediction outputs for classification and regression tasks. Update path handling for restored checkpoints.

yzhang123 reviewed Jul 25, 2025

View reviewed changes

ci/benchmarks/perf/esm2_finetuning.yaml Show resolved Hide resolved

yzhang123 reviewed Jul 25, 2025

View reviewed changes

ci/benchmarks/perf/esm2_finetuning.yaml Outdated Show resolved Hide resolved

yzhang123 reviewed Jul 25, 2025

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py Show resolved Hide resolved

nvmvle and others added 3 commits July 28, 2025 01:53

Update esm2 finetuning configuration to change node count from 4 to 1.

f15ee52

Adjust esm2 finetuning configuration to change accumulate grad batche…

fbc865e

…s from 2 to 1 for improved training efficiency.

Merge branch 'main' into mvle/onboarding-esm2-finetuning

4a32bc1

trvachov approved these changes Jul 29, 2025

View reviewed changes

sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py Show resolved Hide resolved

Merge branch 'main' into mvle/onboarding-esm2-finetuning

ef89eb5

trvachov merged commit 1a1edf0 into main Jul 30, 2025
21 checks passed

trvachov deleted the mvle/onboarding-esm2-finetuning branch July 30, 2025 04:18

Add ESM2 Finetuning Benchmark Configuration #964

Add ESM2 Finetuning Benchmark Configuration #964

Uh oh!

Conversation

nvmvle commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Pre-submit Checklist

Uh oh!

copy-pr-bot bot commented Jul 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvmvle commented Jul 22, 2025

Uh oh!

copy-pr-bot bot commented Jul 22, 2025

Uh oh!

nvmvle commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzhang123 Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

nvmvle Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jwilber commented Jul 29, 2025

Uh oh!

Uh oh!

codecov-commenter commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

trvachov commented Jul 30, 2025

Uh oh!

trvachov commented Jul 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

nvmvle commented Jul 1, 2025 •

edited

Loading

codecov-commenter commented Jul 29, 2025 •

edited

Loading