Skip to content

Conversation

@nvmvle
Copy link
Collaborator

@nvmvle nvmvle commented Jul 1, 2025

Description

This PR adds comprehensive benchmark configurations for ESM2 finetuning to support performance testing and validation. The changes introduce two new benchmark configurations (partial-conv and perf) along with enhanced finetuning capabilities including checkpointing control, TensorBoard logging, and TFLOPS measurement callbacks.

Key enhancements include:

  • Added ESM2 finetuning YAML configurations for partial-conv and performance benchmarks
  • Implemented checkpointing control with --disable-checkpointing option for faster benchmark runs
  • Added TensorBoard logging support for training metrics visualization
  • Introduced TFLOPS callback option to measure and log computational performance
  • Enhanced training control parameters including max_steps, early stopping, and batch size configurations

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Signed-off-by: My Le mvle@nvidia.com

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 1, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nvmvle nvmvle force-pushed the mvle/onboarding-esm2-finetuning branch 2 times, most recently from 9ca500f to 3ae0439 Compare July 15, 2025 04:17
@nvmvle nvmvle force-pushed the mvle/onboarding-esm2-finetuning branch from a94c575 to 6a7c952 Compare July 22, 2025 07:01
@nvmvle nvmvle changed the title [DRAFT] Add ESM2 Finetuning Benchmark Configuration Add ESM2 Finetuning Benchmark Configuration Jul 22, 2025
@nvmvle
Copy link
Collaborator Author

nvmvle commented Jul 22, 2025

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 22, 2025

/ok to test

@nvmvle, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@nvmvle nvmvle marked this pull request as ready for review July 22, 2025 07:17
@nvmvle
Copy link
Collaborator Author

nvmvle commented Jul 22, 2025

/ok to test 6a7c952

@nvmvle nvmvle force-pushed the mvle/onboarding-esm2-finetuning branch from e5ce3b9 to 0a8ee44 Compare July 23, 2025 07:23
nvmvle added 13 commits July 24, 2025 19:56
…benchmarks to enhance monitoring capabilities during training.
…sure proper event file existence and validity, improving error reporting during training monitoring.
… enable or disable checkpoint creation via a new parameter. Update assertions to verify checkpoint behavior based on this parameter, enhancing test coverage and flexibility.
…ontrol. Increase time limit to 14400 seconds, adjust GPU and node settings, and introduce early stopping functionality via a new argument. Modify related test cases to validate early stopping behavior and ensure proper argument parsing.
…nce benchmarks. Remove unused parameters, streamline script arguments, and ensure consistency in training settings across both configurations. Introduce new parameters for better control over training processes.
…ttings for improved performance. Adjust training parameters for consistency and enhance experiment naming convention to include tensor parallelism and pipeline parallelism details.
…pt. Update YAML files to include new argument for tflops calculation, and modify training logic to support this feature. Enhance tests to validate argument parsing for the new callback.
…estoration argument and updating YAML files to include this parameter. Adjust time limit for performance benchmarks and modify test cases to ensure proper argument parsing for checkpoint restoration.
@nvmvle nvmvle force-pushed the mvle/onboarding-esm2-finetuning branch from 18dd98a to c05f863 Compare July 25, 2025 02:59
…ecute based on checkpoint availability. Ensure proper validation of prediction outputs for classification and regression tasks. Update path handling for restored checkpoints.
workspace: /workspace/bionemo2
data_base_path: /data/FLIP
restore_from_checkpoint_path: /data/esm2_650M_nemo2
nodes: [4]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need so many nodes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to 1 node

@jwilber
Copy link
Collaborator

jwilber commented Jul 29, 2025

/ok to test 4a32bc1

@codecov-commenter
Copy link

codecov-commenter commented Jul 29, 2025

Codecov Report

❌ Patch coverage is 82.35294% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.30%. Comparing base (612ea21) to head (ef89eb5).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...emo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py 82.35% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #964      +/-   ##
==========================================
- Coverage   83.32%   83.30%   -0.03%     
==========================================
  Files         148      148              
  Lines        9746     9758      +12     
==========================================
+ Hits         8121     8129       +8     
- Misses       1625     1629       +4     
Files with missing lines Coverage Δ
...emo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py 90.90% <82.35%> (-1.17%) ⬇️

... and 1 file with indirect coverage changes

@trvachov
Copy link
Collaborator

Bypassing rules to merge to unblock My -- we just released so low risk of screwing up any timelines.

@trvachov
Copy link
Collaborator

/ok to test ef89eb5

@trvachov trvachov merged commit 1a1edf0 into main Jul 30, 2025
21 checks passed
@trvachov trvachov deleted the mvle/onboarding-esm2-finetuning branch July 30, 2025 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants