-
Notifications
You must be signed in to change notification settings - Fork 108
Add ESM2 Finetuning Benchmark Configuration #964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9ca500f to
3ae0439
Compare
a94c575 to
6a7c952
Compare
|
/ok to test |
@nvmvle, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 6a7c952 |
sub-packages/bionemo-esm2/src/bionemo/esm2/scripts/finetune_esm2.py
Outdated
Show resolved
Hide resolved
e5ce3b9 to
0a8ee44
Compare
…benchmarks to enhance monitoring capabilities during training.
…sure proper event file existence and validity, improving error reporting during training monitoring.
… enable or disable checkpoint creation via a new parameter. Update assertions to verify checkpoint behavior based on this parameter, enhancing test coverage and flexibility.
…ontrol. Increase time limit to 14400 seconds, adjust GPU and node settings, and introduce early stopping functionality via a new argument. Modify related test cases to validate early stopping behavior and ensure proper argument parsing.
…0 for extended training duration.
…nce benchmarks. Remove unused parameters, streamline script arguments, and ensure consistency in training settings across both configurations. Introduce new parameters for better control over training processes.
…ttings for improved performance. Adjust training parameters for consistency and enhance experiment naming convention to include tensor parallelism and pipeline parallelism details.
…pt. Update YAML files to include new argument for tflops calculation, and modify training logic to support this feature. Enhance tests to validate argument parsing for the new callback.
…micolon from early stop argument
…estoration argument and updating YAML files to include this parameter. Adjust time limit for performance benchmarks and modify test cases to ensure proper argument parsing for checkpoint restoration.
18dd98a to
c05f863
Compare
…ecute based on checkpoint availability. Ensure proper validation of prediction outputs for classification and regression tasks. Update path handling for restored checkpoints.
| workspace: /workspace/bionemo2 | ||
| data_base_path: /data/FLIP | ||
| restore_from_checkpoint_path: /data/esm2_650M_nemo2 | ||
| nodes: [4] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need so many nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to 1 node
|
/ok to test 4a32bc1 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #964 +/- ##
==========================================
- Coverage 83.32% 83.30% -0.03%
==========================================
Files 148 148
Lines 9746 9758 +12
==========================================
+ Hits 8121 8129 +8
- Misses 1625 1629 +4
|
|
Bypassing rules to merge to unblock My -- we just released so low risk of screwing up any timelines. |
|
/ok to test ef89eb5 |
Description
This PR adds comprehensive benchmark configurations for ESM2 finetuning to support performance testing and validation. The changes introduce two new benchmark configurations (partial-conv and perf) along with enhanced finetuning capabilities including checkpointing control, TensorBoard logging, and TFLOPS measurement callbacks.
Key enhancements include:
--disable-checkpointingoption for faster benchmark runsType of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:
Note
By default, the notebooks validation tests are skipped unless explicitly enabled.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Pre-submit Checklist
Signed-off-by: My Le mvle@nvidia.com