Skip to content

feat: added lora to evo2#980

Merged
trvachov merged 7 commits into
NVIDIA-BioNeMo:mainfrom
gabenavarro:gabriel/lora-evo2
Aug 1, 2025
Merged

feat: added lora to evo2#980
trvachov merged 7 commits into
NVIDIA-BioNeMo:mainfrom
gabenavarro:gabriel/lora-evo2

Conversation

@gabenavarro
Copy link
Copy Markdown
Contributor

Description

This PR adds Low-Rank Adaptation (LoRA) support to Evo2, enabling parameter-efficient fine-tuning of Evo2/Hyena models for protein sequence modeling and generative tasks. Related to issue #884

Key highlights:

  • Adds Evo2LoRA class in peft.py:

    • Targeted LoRA injection into Evo2-specific modules (linear_qkv, hyena_filter, short_filter, etc.)
    • Selective freezing of encoder and embedding layers while allowing adapters and output layers to remain trainable.
    • Logging and debug summaries for parameter counts and adapter coverage.
  • Integrates LoRA with Evo2 training via:

    • --lora-finetune flag to enable LoRA-based fine-tuning.
    • Optional --lora-checkpoint-path to resume from a pre-trained LoRA checkpoint.
    • Automatic ModelTransform callback handling for training when LoRA is active.
  • Supports advanced, memory-efficient fine-tuning workflows on large Evo2 models with reduced GPU memory consumption.


Type of changes

  • New feature (non-breaking change which adds functionality)

CI Pipeline Configuration

Please add:

  • [SKIP_CI] Skip all continuous integration tests

Usage

Example:

train_evo2 \
    --experiment-name lora_test \
    --model-size 1b \
    --devices 8 \
    --num-nodes 1 \
    --seq-length 8192 \
    --micro-batch-size 2 \
    --lr 0.000015 \
    --min-lr 0.0000149 \
    --warmup-steps 500 \
    --grad-acc-batches 4 \
    --max-steps 10000 \
    --ckpt-dir <YOUR>/<CKPT>/<DIR> \
    --clip-grad 5 \
    --wd 0.001 \
    --attention-dropout 0.01 \
    --hidden-dropout 0.01 \
    --val-check-interval 50 \
    --activation-checkpoint-recompute-num-layers  5 \
    --result-dir <YOUR>/<RESULT>/<DIR> \
    --ckpt-async-save \
    --lora-finetune \
    --mock-data

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly (docstrings + CLI flags)
  • I have added/updated tests as needed
  • All existing tests pass successfully

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jul 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@gabenavarro
Copy link
Copy Markdown
Contributor Author

@jstjohn, tagging you on this PR.

@jwilber
Copy link
Copy Markdown
Collaborator

jwilber commented Jul 10, 2025

Thanks for this! Will take some time to review but it is on our radar!

@trvachov
Copy link
Copy Markdown
Collaborator

/ok to test 7dd28a2

@trvachov
Copy link
Copy Markdown
Collaborator

@gabenavarro could you setup and run the pre-commit hooks on your branch, then I can re-run CI? It's just failing on the linter at the moment.

Thanks for the contribution!

@gabenavarro
Copy link
Copy Markdown
Contributor Author

@gabenavarro could you setup and run the pre-commit hooks on your branch, then I can re-run CI? It's just failing on the linter at the moment.

@trvachov, yeah, I can get it this week. Glad to contribute!

@gabenavarro
Copy link
Copy Markdown
Contributor Author

@trvachov, just got around to setting up and running the pre-commit hooks on my branch. Can you please can re-run CI? Apologies for not setting them up earlier!

@trvachov
Copy link
Copy Markdown
Collaborator

/ok to test 1e81b3d

Signed-off-by: Gabriel Navarro <gchinonavarro@gmail.com>
Signed-off-by: Gabriel Navarro <gchinonavarro@gmail.com>
@gabenavarro
Copy link
Copy Markdown
Contributor Author

@trvachov, I think I disrupted the testing with signing off on the code. Do you mind restarting tests? Thank you!

@trvachov
Copy link
Copy Markdown
Collaborator

/ok to test ba748e0

@trvachov
Copy link
Copy Markdown
Collaborator

Not sure why CI failed -- will update branch and rerun.

@trvachov
Copy link
Copy Markdown
Collaborator

/ok to test 6c47582

@jstjohn
Copy link
Copy Markdown
Collaborator

jstjohn commented Jul 21, 2025

FYI there were some pretty significant refactors to train.py to support the new mamba variant of evo2, let me know if you have any issues resolving conflicts there. There are no other major changes to train.py planned in the near term.

Comment thread sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py Outdated
Comment thread sub-packages/bionemo-evo2/src/bionemo/evo2/run/peft.py
Copy link
Copy Markdown
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with comments on the merge.

@trvachov
Copy link
Copy Markdown
Collaborator

I'm also reviewing this -- ETA 07/25

@trvachov
Copy link
Copy Markdown
Collaborator

trvachov commented Jul 24, 2025

@gabenavarro I've reviewed the PR -- looks good! I had to do some conflict resolution due to a recent change from @jstjohn , but it's not too bad.

Here's the diff you'll need to apply to train.py

diff --git a/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py b/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py
index ddb79d63..7b2fb238 100644
--- a/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py
+++ b/sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py
@@ -38,6 +38,7 @@ from nemo.collections.llm.recipes.tp_overlap_configs.userbuffers import (
 )
 from nemo.collections.nlp.modules.common.tokenizer_utils import get_nmt_tokenizer
 from nemo.lightning.pytorch import callbacks as nl_callbacks
+from nemo.lightning.pytorch.callbacks import ModelTransform
 from nemo.lightning.pytorch.callbacks.flops_callback import FLOPsMeasurementCallback
 from nemo.lightning.pytorch.callbacks.megatron_comm_overlap import MegatronCommOverlapCallback
 from nemo.lightning.pytorch.optim import CosineAnnealingScheduler
@@ -48,6 +49,9 @@ from nemo.utils.exp_manager import TimingCallback
 # Add import for Mamba models
 from bionemo.evo2.models.mamba import MAMBA_MODEL_OPTIONS, MambaModel, mamba_no_weight_decay_cond_with_embeddings
 from bionemo.evo2.utils.logging.callbacks import TEVCallback
+
+from bionemo.evo2.run.peft import Evo2LoRA
+
 from bionemo.llm.utils.datamodule_utils import infer_global_batch_size
 from bionemo.llm.utils.logger_utils import WandbConfig, setup_nemo_lightning_logger

@@ -476,6 +480,8 @@ def parse_args(args: Optional[List[str]] = None) -> argparse.Namespace:
         default=True,
         help="Disable saving the last checkpoint.",
     )
+    parser.add_argument("--lora-finetune", action="store_true", help="Use LoRA fine-tuning", default=False)
+    parser.add_argument("--lora-checkpoint-path", type=Path, default=None, help="LoRA checkpoint path")
     recompute_group = parser.add_mutually_exclusive_group(required=False)
     recompute_group.add_argument("--no-activation-checkpointing", action="store_true", default=False)
     recompute_group.add_argument("--selective-activation-checkpointing", action="store_true", default=False)
@@ -579,7 +585,12 @@ def train(args: argparse.Namespace) -> nl.Trainer:
         if args.model_size not in HYENA_MODEL_OPTIONS:
             raise ValueError(f"Invalid model size for Hyena: {args.model_size}")
         model_config = HYENA_MODEL_OPTIONS[args.model_size](**config_modifiers_init)
-        model = llm.HyenaModel(model_config, tokenizer=data_module.tokenizer)
+        # Lora adaptors configuration
+        lora_transform = None
+        if args.lora_finetune:
+            lora_transform = Evo2LoRA(peft_ckpt_path=args.lora_checkpoint_path)
+
+        model = llm.HyenaModel(model_config, tokenizer=data_module.tokenizer, model_transform=lora_transform)
     else:  # mamba
         if args.no_weight_decay_embeddings:
             config_modifiers_init["hyena_no_weight_decay_cond_fn"] = mamba_no_weight_decay_cond_with_embeddings
@@ -601,6 +612,8 @@ def train(args: argparse.Namespace) -> nl.Trainer:
         TEVCallback(),
     ]

+    if args.lora_finetune:
+        callbacks.append(ModelTransform())
     if args.enable_preemption:
         callbacks.append(nl_callbacks.PreemptionCallback())
     if args.debug_ddp_parity_freq > 0:

Pending that, I think this is good to merge.

Some high level before-after numbers:

No LoRA finetuning of 1B model:

Training epoch 0, iteration 21/99 | lr: 3.15e-06 | global_batch_size: 8 | global_step: 21 | reduced_train_loss: 1.071 | train_step_timing in s: 2.453 | consumed_samples: 176
Wed Jul 23 13:48:26 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.9     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:19:00.0 Off |                    0 |
| N/A   60C    P0             635W / 700W |  34005MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

With your LoRA change:

[NeMo I 2025-07-23 13:52:55 nemo_logging:393] LoRA Summary:
[NeMo I 2025-07-23 13:52:55 nemo_logging:393]   Total parameters: 1,125,039,360
[NeMo I 2025-07-23 13:52:55 nemo_logging:393]   Trainable parameters: 16,840,320
[NeMo I 2025-07-23 13:52:55 nemo_logging:393]   Adapter parameters: 16,834,560
[NeMo I 2025-07-23 13:52:55 nemo_logging:393]   Percentage trainable: 1.50%
[NeMo I 2025-07-23 13:52:55 nemo_logging:393]   Percentage adapters: 1.50%

Training epoch 0, iteration 49/99 | lr: 7.35e-06 | global_batch_size: 8 | global_step: 49 | reduced_train_loss: 1.112 | train_step_timing in s: 0.7389 | consumed_samples: 400

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.9     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:19:00.0 Off |                    0 |
| N/A   53C    P0             605W / 700W |  25335MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
`

Note, we will want to test Evo2 LoRA functionality (possibly taking inspiration from @yzhang123 's ESM2 testing here #996), but this is non-blocking for merge of this PR -- I think this is fine as is. We'll just note it as "experimental" LoRA finetuning of Evo2 in the release notes.

Thanks again!

Signed-off-by: gabenavarro <40647204+gabenavarro@users.noreply.github.com>
@trvachov
Copy link
Copy Markdown
Collaborator

/ok to test 61e4e57

@trvachov
Copy link
Copy Markdown
Collaborator

trvachov commented Aug 1, 2025

@gabenavarro This is ready to merge -- can you just resolve the merge conflict (my prior comment) and make it one commit relative to top-of-tree main?

I can also do all this myself and force merge but I'd like you to get the authorship credit into the git history 😄

PS. Dont forget to run the pre-commit (see .pre-commit-config.yaml) so that the tests dont trivially fail on linter errors.

Signed-off-by: gabenavarro <40647204+gabenavarro@users.noreply.github.com>
Signed-off-by: Gabriel Navarro <gchinonavarro@gmail.com>
@gabenavarro
Copy link
Copy Markdown
Contributor Author

@trvachov , thanks for the reminder. Should be done now!

@trvachov
Copy link
Copy Markdown
Collaborator

trvachov commented Aug 1, 2025

/ok to test 2a3221d

@trvachov trvachov enabled auto-merge August 1, 2025 20:50
@trvachov trvachov added this pull request to the merge queue Aug 1, 2025
Merged via the queue into NVIDIA-BioNeMo:main with commit 824455d Aug 1, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants