feat: added lora to evo2 by gabenavarro · Pull Request #980 · NVIDIA-BioNeMo/bionemo-framework

gabenavarro · 2025-07-07T22:54:13Z

Description

This PR adds Low-Rank Adaptation (LoRA) support to Evo2, enabling parameter-efficient fine-tuning of Evo2/Hyena models for protein sequence modeling and generative tasks. Related to issue #884

Key highlights:

Adds Evo2LoRA class in peft.py:
- Targeted LoRA injection into Evo2-specific modules (linear_qkv, hyena_filter, short_filter, etc.)
- Selective freezing of encoder and embedding layers while allowing adapters and output layers to remain trainable.
- Logging and debug summaries for parameter counts and adapter coverage.
Integrates LoRA with Evo2 training via:
- --lora-finetune flag to enable LoRA-based fine-tuning.
- Optional --lora-checkpoint-path to resume from a pre-trained LoRA checkpoint.
- Automatic ModelTransform callback handling for training when LoRA is active.
Supports advanced, memory-efficient fine-tuning workflows on large Evo2 models with reduced GPU memory consumption.

Type of changes

New feature (non-breaking change which adds functionality)

CI Pipeline Configuration

Please add:

[SKIP_CI] Skip all continuous integration tests

Usage

Example:

train_evo2 \
    --experiment-name lora_test \
    --model-size 1b \
    --devices 8 \
    --num-nodes 1 \
    --seq-length 8192 \
    --micro-batch-size 2 \
    --lr 0.000015 \
    --min-lr 0.0000149 \
    --warmup-steps 500 \
    --grad-acc-batches 4 \
    --max-steps 10000 \
    --ckpt-dir <YOUR>/<CKPT>/<DIR> \
    --clip-grad 5 \
    --wd 0.001 \
    --attention-dropout 0.01 \
    --hidden-dropout 0.01 \
    --val-check-interval 50 \
    --activation-checkpoint-recompute-num-layers  5 \
    --result-dir <YOUR>/<RESULT>/<DIR> \
    --ckpt-async-save \
    --lora-finetune \
    --mock-data

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly (docstrings + CLI flags)
I have added/updated tests as needed
All existing tests pass successfully

copy-pr-bot · 2025-07-07T22:54:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

gabenavarro · 2025-07-07T22:55:25Z

@jstjohn, tagging you on this PR.

jwilber · 2025-07-10T19:39:56Z

Thanks for this! Will take some time to review but it is on our radar!

trvachov · 2025-07-11T21:20:41Z

/ok to test 7dd28a2

trvachov · 2025-07-11T21:33:10Z

@gabenavarro could you setup and run the pre-commit hooks on your branch, then I can re-run CI? It's just failing on the linter at the moment.

Thanks for the contribution!

gabenavarro · 2025-07-14T19:12:34Z

@gabenavarro could you setup and run the pre-commit hooks on your branch, then I can re-run CI? It's just failing on the linter at the moment.

@trvachov, yeah, I can get it this week. Glad to contribute!

gabenavarro · 2025-07-17T01:28:15Z

@trvachov, just got around to setting up and running the pre-commit hooks on my branch. Can you please can re-run CI? Apologies for not setting them up earlier!

trvachov · 2025-07-17T01:51:32Z

/ok to test 1e81b3d

Signed-off-by: Gabriel Navarro <gchinonavarro@gmail.com>

gabenavarro · 2025-07-17T04:19:40Z

@trvachov, I think I disrupted the testing with signing off on the code. Do you mind restarting tests? Thank you!

trvachov · 2025-07-17T16:00:38Z

/ok to test ba748e0

trvachov · 2025-07-18T02:19:10Z

Not sure why CI failed -- will update branch and rerun.

trvachov · 2025-07-18T02:19:38Z

/ok to test 6c47582

jstjohn · 2025-07-21T17:22:52Z

FYI there were some pretty significant refactors to train.py to support the new mamba variant of evo2, let me know if you have any issues resolving conflicts there. There are no other major changes to train.py planned in the near term.

jstjohn

Approved with comments on the merge.

trvachov · 2025-07-23T16:28:18Z

I'm also reviewing this -- ETA 07/25

trvachov · 2025-07-24T16:04:25Z

@gabenavarro I've reviewed the PR -- looks good! I had to do some conflict resolution due to a recent change from @jstjohn , but it's not too bad.

Here's the diff you'll need to apply to train.py

diff --git a/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py b/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py
index ddb79d63..7b2fb238 100644
--- a/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py
+++ b/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py
@@ -38,6 +38,7 @@ from nemo.collections.llm.recipes.tp_overlap_configs.userbuffers import (
 )
 from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
 from nemo.lightning.pytorch import callbacks as nl_callbacks
+from nemo.lightning.pytorch.callbacks import ModelTransform
 from nemo.lightning.pytorch.callbacks.flops_callback import FLOPsMeasurementCallback
 from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback
 from nemo.lightning.pytorch.optim import CosineAnnealingScheduler
@@ -48,6 +49,9 @@ from nemo.utils.exp_manager import TimingCallback
 # Add import for Mamba models
 from bionemo.evo2.models.mamba import MAMBA_MODEL_OPTIONS, MambaModel, mamba_no_weight_decay_cond_with_embeddings
 from bionemo.evo2.utils.logging.callbacks import TEVCallback
+
+from bionemo.evo2.run.peft import Evo2LoRA
+
 from bionemo.llm.utils.datamodule_utils import infer_global_batch_size
 from bionemo.llm.utils.logger_utils import WandbConfig, setup_nemo_lightning_logger

@@ -476,6 +480,8 @@ def parse_args(args: Optional[List[str]] = None) -> argparse.Namespace:
         default=True,
         help="Disable saving the last checkpoint.",
     )
+    parser.add_argument("--lora-finetune", action="store_true", help="Use LoRA fine-tuning", default=False)
+    parser.add_argument("--lora-checkpoint-path", type=Path, default=None, help="LoRA checkpoint path")
     recompute_group = parser.add_mutually_exclusive_group(required=False)
     recompute_group.add_argument("--no-activation-checkpointing", action="store_true", default=False)
     recompute_group.add_argument("--selective-activation-checkpointing", action="store_true", default=False)
@@ -579,7 +585,12 @@ def train(args: argparse.Namespace) -> nl.Trainer:
         if args.model_size not in HYENA_MODEL_OPTIONS:
             raise ValueError(f"Invalid model size for Hyena: {args.model_size}")
         model_config = HYENA_MODEL_OPTIONS[args.model_size](**config_modifiers_init)
-        model = llm.HyenaModel(model_config, tokenizer=data_module.tokenizer)
+        # Lora adaptors configuration
+        lora_transform = None
+        if args.lora_finetune:
+            lora_transform = Evo2LoRA(peft_ckpt_path=args.lora_checkpoint_path)
+
+        model = llm.HyenaModel(model_config, tokenizer=data_module.tokenizer, model_transform=lora_transform)
     else:  # mamba
         if args.no_weight_decay_embeddings:
             config_modifiers_init["hyena_no_weight_decay_cond_fn"] = mamba_no_weight_decay_cond_with_embeddings
@@ -601,6 +612,8 @@ def train(args: argparse.Namespace) -> nl.Trainer:
         TEVCallback(),
     ]

+    if args.lora_finetune:
+        callbacks.append(ModelTransform())
     if args.enable_preemption:
         callbacks.append(nl_callbacks.PreemptionCallback())
     if args.debug_ddp_parity_freq > 0:

Pending that, I think this is good to merge.

Some high level before-after numbers:

No LoRA finetuning of 1B model:

Training epoch 0, iteration 21/99 | lr: 3.15e-06 | global_batch_size: 8 | global_step: 21 | reduced_train_loss: 1.071 | train_step_timing in s: 2.453 | consumed_samples: 176
Wed Jul 23 13:48:26 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.9     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:19:00.0 Off |                    0 |
| N/A   60C    P0             635W / 700W |  34005MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

With your LoRA change:

[NeMo I 2025-07-23 13:52:55 nemo_logging:393] LoRA Summary:
[NeMo I 2025-07-23 13:52:55 nemo_logging:393]   Total parameters: 1,125,039,360
[NeMo I 2025-07-23 13:52:55 nemo_logging:393]   Trainable parameters: 16,840,320
[NeMo I 2025-07-23 13:52:55 nemo_logging:393]   Adapter parameters: 16,834,560
[NeMo I 2025-07-23 13:52:55 nemo_logging:393]   Percentage trainable: 1.50%
[NeMo I 2025-07-23 13:52:55 nemo_logging:393]   Percentage adapters: 1.50%

Training epoch 0, iteration 49/99 | lr: 7.35e-06 | global_batch_size: 8 | global_step: 49 | reduced_train_loss: 1.112 | train_step_timing in s: 0.7389 | consumed_samples: 400

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.9     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:19:00.0 Off |                    0 |
| N/A   53C    P0             605W / 700W |  25335MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
`

Note, we will want to test Evo2 LoRA functionality (possibly taking inspiration from @yzhang123 's ESM2 testing here #996), but this is non-blocking for merge of this PR -- I think this is fine as is. We'll just note it as "experimental" LoRA finetuning of Evo2 in the release notes.

Thanks again!

Signed-off-by: gabenavarro <40647204+gabenavarro@users.noreply.github.com>

trvachov · 2025-07-25T19:03:28Z

/ok to test 61e4e57

trvachov · 2025-08-01T14:39:43Z

@gabenavarro This is ready to merge -- can you just resolve the merge conflict (my prior comment) and make it one commit relative to top-of-tree main?

I can also do all this myself and force merge but I'd like you to get the authorship credit into the git history 😄

PS. Dont forget to run the pre-commit (see .pre-commit-config.yaml) so that the tests dont trivially fail on linter errors.

Signed-off-by: gabenavarro <40647204+gabenavarro@users.noreply.github.com>

Signed-off-by: Gabriel Navarro <gchinonavarro@gmail.com>

gabenavarro · 2025-08-01T16:14:57Z

@trvachov , thanks for the reminder. Should be done now!

trvachov · 2025-08-01T17:05:31Z

/ok to test 2a3221d

gabenavarro requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, pstjohn, sichu2023, skothenhill-nv and trvachov as code owners July 7, 2025 22:54

gabenavarro mentioned this pull request Jul 7, 2025

[Feature] Enable LoRA Fine-Tuning for Evo2 to Optimize Training Efficiency on Long-Context Models #884

Closed

gabenavarro added 2 commits July 16, 2025 19:10

feat: added lora to evo2

21d6d7b

Signed-off-by: Gabriel Navarro <gchinonavarro@gmail.com>

Fix lint/format issues

ba748e0

Signed-off-by: Gabriel Navarro <gchinonavarro@gmail.com>

gabenavarro force-pushed the gabriel/lora-evo2 branch from 1e81b3d to ba748e0 Compare July 17, 2025 02:11

Merge branch 'main' into gabriel/lora-evo2

6c47582

jstjohn reviewed Jul 21, 2025

View reviewed changes

Comment thread sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py Outdated

jstjohn reviewed Jul 21, 2025

View reviewed changes

Comment thread sub-packages/bionemo-evo2/src/bionemo/evo2/run/peft.py

jstjohn approved these changes Jul 21, 2025

View reviewed changes

trvachov approved these changes Jul 24, 2025

View reviewed changes

Merge branch 'main' into gabriel/lora-evo2

61e4e57

Signed-off-by: gabenavarro <40647204+gabenavarro@users.noreply.github.com>

Merge branch 'main' into gabriel/lora-evo2

926a065

Signed-off-by: gabenavarro <40647204+gabenavarro@users.noreply.github.com>

gabenavarro requested review from broland-hat, polinabinder1 and yzhang123 as code owners August 1, 2025 15:54

gabenavarro added 2 commits August 1, 2025 08:59

Fix lint/format issues

c08d987

Sign-off: Gabriel Navarro

2a3221d

Signed-off-by: Gabriel Navarro <gchinonavarro@gmail.com>

trvachov enabled auto-merge August 1, 2025 20:50

trvachov added this pull request to the merge queue Aug 1, 2025

Merged via the queue into NVIDIA-BioNeMo:main with commit 824455d Aug 1, 2025
10 checks passed

dorotat-nv mentioned this pull request Sep 9, 2025

[BUG] Evo2 finetune with LoRA does not work #1136

Open

Conversation

gabenavarro commented Jul 7, 2025

Description

Type of changes

CI Pipeline Configuration

Usage

Pre-submit Checklist

Uh oh!

copy-pr-bot Bot commented Jul 7, 2025

Uh oh!

gabenavarro commented Jul 7, 2025

Uh oh!

jwilber commented Jul 10, 2025

Uh oh!

trvachov commented Jul 11, 2025

Uh oh!

trvachov commented Jul 11, 2025

Uh oh!

gabenavarro commented Jul 14, 2025

Uh oh!

gabenavarro commented Jul 17, 2025

Uh oh!

trvachov commented Jul 17, 2025

Uh oh!

gabenavarro commented Jul 17, 2025

Uh oh!

trvachov commented Jul 17, 2025

Uh oh!

trvachov commented Jul 18, 2025

Uh oh!

trvachov commented Jul 18, 2025

Uh oh!

jstjohn commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

jstjohn left a comment

Choose a reason for hiding this comment

Uh oh!

trvachov commented Jul 23, 2025

Uh oh!

trvachov commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trvachov commented Jul 25, 2025

Uh oh!

trvachov commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabenavarro commented Aug 1, 2025

Uh oh!

trvachov commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

trvachov commented Jul 24, 2025 •

edited

Loading

trvachov commented Aug 1, 2025 •

edited

Loading