Support subquadratic-ops kernels in evo2 autoregressive inference by farhadrgh · Pull Request #1565 · NVIDIA-BioNeMo/bionemo-framework

farhadrgh · 2026-04-30T00:01:21Z

Description

Closes the gap noted in hyena_mixer.py (# todo: support inference_context for b2b_kernel) and the README caveat that --use-subquadratic-ops "does not apply to autoregressive inference (infer_evo2)". After this PR, the same fused kernels that accelerate training and batch prediction also accelerate the prefill phase of autoregressive inference.

Summary of change:

engine.parallel_fir now accepts use_subquadratic_ops and routes to fft_causal_conv1d (filters ≥ 128) or causal_conv1d (short filters), wired through both call sites in hyena_utils.py.
HyenaMixer.forward detects prefill (no FIR cache yet) and runs b2b_causal_conv1d for the fused proj+mixer convolution. The kernel doesn't expose its intermediate, so we run a tiny windowed proj-conv on the last K_proj + K_mixer − 2 input positions to materialize the (x2*v) tail and seed the mixer's FIR cache. Works for both hyena_short_conv and hyena_medium_conv.
Removed the del self._parameters["short_conv_weight"] micro-optimization in ParallelCausalDepthwiseConv1dWithState._get_weight() — B2BCausalConv1dModule reads that raw param on every prefill, so deleting it after first decode broke multi-prompt inference. Memory cost is ~4 MB for a 1B model.

infer_evo2 gets a --use-subquadratic-ops flag.

Testing

New parametrization test_forward_manual[1b-8k-bf16-subquadratic-ops-flash] covers the (flash_decode=True, subquadratic_ops=True) combination that was previously skipped.
New test_subquadratic_ops_matches_baseline runs greedy autoregressive generation with and without --use-subquadratic-ops and asserts identical output — this is the strict check that Phase 2 state population is correct (a wrong cache would diverge during decode).
Existing kernel comparison tests (test_hyena_mixer_kernel.py) and inference-context unit tests pass unchanged.

Performance

infer_evo2, evo2/1b-8k-bf16, single A6000, multiple identical prompts in one process to amortize the one-time JIT compile cost (~15 s the first time each subq-ops kernel sees a new shape). Steady-state numbers from batches 3+:

Prompt	Generation	Baseline	Subq-ops	Speedup
4 096 tokens	5 tokens	0.57 s	0.51 s	~10%
8 000 tokens	1 token	1.02 s	0.87 s	~15%

The speedup is concentrated in prefill. The relative improvement grows with prompt length and shrinks as more decode tokens are amortized in.

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow
ciflow:all - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

@coderabbitai review - Triggers a standard review
@coderabbitai full review - Triggers a comprehensive review

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Summary by CodeRabbit

New Features
- Added --use-subquadratic-ops CLI option to optimize prompt/prefill processing during inference while leaving per-token decode unchanged.
Documentation
- Clarified subquadratic-ops kernel behavior and performance impact on prefill throughput.
Tests
- Added end-to-end test confirming subquadratic-ops generates identical inference results as baseline.

Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>

…eline Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>

Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>

copy-pr-bot · 2026-04-30T00:01:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-30T00:01:29Z

📝 Walkthrough

Walkthrough

This pull request introduces use_subquadratic_ops support throughout the Evo2 Megatron inference pipeline. Changes include updating documentation to clarify prefill behavior, modifying the FIR convolution engine to conditionally route to optimized subquadratic kernels, updating the HyenaMixer to handle prefill-specific state computation, and threading the new flag through CLI and setup functions with corresponding tests.

Changes

Cohort / File(s)	Summary
Documentation `README.md`	Updated `--use-subquadratic-ops` guidance to specify exact kernel usage across prefill, batch scoring, and training; clarifies flag affects only prefill portion of autoregressive inference.
Core FIR Engine `src/bionemo/evo2/models/megatron/hyena/engine.py`	Adds `use_subquadratic_ops` parameter to `parallel_fir()` with conditional routing to `subquadratic_ops_torch` implementations; handles FFT and non-FFT cases separately with explicit bias handling and dtype casting.
Hyena Mixer `src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py`	Extends `b2b_kernel` execution to prefill phase; introduces `_populate_b2b_inference_state()` helper to compute and store FIR tails in inference context for both projection convolution and mixer components.
Hyena Utilities `src/bionemo/evo2/models/megatron/hyena/hyena_utils.py`	Propagates `use_subquadratic_ops` into `parallel_fir()` calls; preserves `short_conv_weight` parameter instead of deleting after caching.
Inference Interface `src/bionemo/evo2/run/infer.py`	Adds `use_subquadratic_ops` boolean option to CLI and function signatures; threads parameter through `main()` → `infer()` → `setup_inference_engine()` to `model_provider`.
Test Infrastructure `tests/bionemo/evo2/run/test_infer.py`	Adds `use_subquadratic_ops` parameter to subprocess helper; introduces end-to-end test comparing greedy generation output between baseline and subquadratic-ops paths.
Forward Test `tests/bionemo/evo2/test_evo2.py`	Removes conditional skip for combined `subquadratic_ops=True` and `flash_decode=True` settings; now executes this previously-skipped parameterized test configuration.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant Infer as infer()
    participant Setup as setup_inference_engine()
    participant Provider as ModelProvider
    participant Mixer as HyenaMixer.forward()
    participant Engine as parallel_fir()
    participant SubqOps as subquadratic_ops_torch
    participant BaseOps as F.conv1d/FFTConv

    User->>Infer: --use-subquadratic-ops flag
    Infer->>Setup: use_subquadratic_ops=True
    Setup->>Provider: set use_subquadratic_ops
    
    rect rgb(100, 150, 200, 0.5)
    Note over Infer,Provider: Prefill Phase
    Infer->>Mixer: forward(prefill_tokens)
    Mixer->>Engine: parallel_fir(..., use_subquadratic_ops=True)
    
    alt use_subquadratic_ops enabled
        Engine->>SubqOps: _subq_fft_causal_conv1d (fir_len ≥ 128)
        SubqOps-->>Engine: optimized output
    else standard path
        Engine->>BaseOps: fftconv_func/F.conv1d
        BaseOps-->>Engine: output
    end
    
    Engine-->>Mixer: FIR result + inference_state
    Mixer->>Mixer: _populate_b2b_inference_state()
    Mixer-->>Infer: prefill hidden states
    end
    
    rect rgb(200, 150, 100, 0.5)
    Note over Infer,Mixer: Decode Phase (unchanged)
    Infer->>Mixer: forward(next_token)
    Mixer->>Mixer: use cached inference_state
    Mixer-->>Infer: decoded token
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Subquadratic speeds in prefill's grace,
Through kernels fused we quicken pace,
While decode hops untouched remain,
FFT and conv unite their reign,
Hop faster still, the inference flight! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding support for subquadratic-ops kernels in evo2 autoregressive inference, which is the central objective of the PR.
Description check	✅ Passed	The description comprehensively covers the main changes (engine.parallel_fir routing, HyenaMixer.forward prefill detection, parameter deletion removal, CLI flag addition), includes testing strategy with specific test names, provides performance metrics, and completes all relevant pre-submit checklist items.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch farhadr/infer_subq_ops

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

farhadrgh · 2026-04-30T00:04:56Z

@coderabbitai review

coderabbitai · 2026-04-30T00:05:01Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py (2)

524-562: ⚡ Quick win

Cover the prompt-segmentation handoff too.

This only asserts equivalence for a single-shot prefill. The new cache-seeding logic is most likely to regress when --prompt-segmentation-threshold forces a prefill→decode handoff inside the prompt, so I'd add one companion case with a threshold below len(PROMPT_1) and require the same equality there.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py`
around lines 524 - 562, Add a companion assertion that verifies subquadratic ops
match baseline when prompt segmentation forces a prefill→decode handoff: call
run_infer_subprocess twice (once with use_subquadratic_ops=False and once with
True) using the same mbridge_checkpoint_path, PROMPT_1, seed, top_k=1, etc., but
pass prompt_segmentation_threshold set to a value smaller than len(PROMPT_1)
(e.g., 1 less than the prompt length) and the same other args (max_new_tokens,
temperature), then assert both outputs are non-empty and equal (use the same
failure message pattern); reference run_infer_subprocess, PROMPT_1, and
prompt_segmentation_threshold to locate where to add this companion case.

306-334: ⚡ Quick win

Point this helper at the source infer.py.

run_infer_subprocess_parallel() already shells out to _infer_script_path() with src/ prepended to PYTHONPATH, but this helper still uses -m bionemo.evo2.run.infer. That means the new subquadratic regression test can validate an older installed package instead of the code under review.

Suggested fix

-        "-m",
-        "bionemo.evo2.run.infer",
+        str(_infer_script_path()),
         "--ckpt-dir",
         str(mbridge_checkpoint_path),
         "--prompt",
         prompt,
@@
-    env = copy.deepcopy(PRETEST_ENV)
+    env = copy.deepcopy(PRETEST_ENV)
+    src_dir = str(_recipe_root() / "src")
+    env["PYTHONPATH"] = src_dir + os.pathsep + env.get("PYTHONPATH", "")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py`
around lines 306 - 334, The helper builds cmd using "-m bionemo.evo2.run.infer",
which lets the installed package be executed instead of the local source; change
the command construction in run_infer_subprocess_parallel() to invoke the actual
source script returned by _infer_script_path() (replace the "-m
bionemo.evo2.run.infer" element with str(_infer_script_path()) and ensure the
subprocess uses the same Python executable so the prepended "src/" PYTHONPATH is
applied), leaving the rest of the args (ckpt-dir, prompt, etc.) unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py`:
- Around line 381-385: The mixer FIR cache is seeded incorrectly: the helper
always stores x2 * v in mixer_input_tail even though
ParallelShortHyenaOperator.forward uses v (not x2*v) when self.mixer.pregate is
False. Update the logic that sets mixer_input_tail (currently "mixer_input_tail
= (x2 * v).contiguous()") to check self.mixer.pregate and store v.contiguous()
when pregate is False, otherwise keep (x2 * v).contiguous(); reference
ParallelShortHyenaOperator.forward, the mixer.pregate flag, and variables x2, v,
and mixer_input_tail to locate the change.

---

Nitpick comments:
In `@bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py`:
- Around line 524-562: Add a companion assertion that verifies subquadratic ops
match baseline when prompt segmentation forces a prefill→decode handoff: call
run_infer_subprocess twice (once with use_subquadratic_ops=False and once with
True) using the same mbridge_checkpoint_path, PROMPT_1, seed, top_k=1, etc., but
pass prompt_segmentation_threshold set to a value smaller than len(PROMPT_1)
(e.g., 1 less than the prompt length) and the same other args (max_new_tokens,
temperature), then assert both outputs are non-empty and equal (use the same
failure message pattern); reference run_infer_subprocess, PROMPT_1, and
prompt_segmentation_threshold to locate where to add this companion case.
- Around line 306-334: The helper builds cmd using "-m bionemo.evo2.run.infer",
which lets the installed package be executed instead of the local source; change
the command construction in run_infer_subprocess_parallel() to invoke the actual
source script returned by _infer_script_path() (replace the "-m
bionemo.evo2.run.infer" element with str(_infer_script_path()) and ensure the
subprocess uses the same Python executable so the prepended "src/" PYTHONPATH is
applied), leaving the rest of the args (ckpt-dir, prompt, etc.) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0e318f54-a8af-4548-aa42-88bf6f61e018

📥 Commits

Reviewing files that changed from the base of the PR and between 85d0c16 and 05e1356.

📒 Files selected for processing (7)

bionemo-recipes/recipes/evo2_megatron/README.md
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/engine.py
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_utils.py
bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py

jstjohn

🚢

farhadrgh · 2026-04-30T14:37:26Z

/ok to test 05e1356

copy-pr-bot · 2026-04-30T14:37:30Z

/ok to test

@farhadrgh, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

farhadrgh · 2026-04-30T14:39:38Z

/ok to test 05e1356

farhadrgh · 2026-04-30T17:44:49Z

/ok to test 05e1356

moradza · 2026-04-30T20:40:19Z

LGTM @farhadrgh

farhadrgh added 5 commits April 29, 2026 23:39

route engine.parallel_fir through subq-ops kernels in inference

1817c1e

Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>

use b2b_causal_conv1d during inference prefill

17c42cf

Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>

add infer_evo2 --use-subquadratic-ops flag with test for matching bas…

8b4bd11

…eline Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>

update test

c0a3e30

Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>

update README

05e1356

Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>

farhadrgh requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, pstjohn, savitha-eng and trvachov as code owners April 30, 2026 00:01

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py

jstjohn approved these changes Apr 30, 2026

View reviewed changes

farhadrgh added the ciflow:all Run all tests (unit tests, slow tests, and notebooks) for bionemo2 or enforce running all tests label Apr 30, 2026

moradza approved these changes Apr 30, 2026

View reviewed changes

farhadrgh added this pull request to the merge queue Apr 30, 2026

Merged via the queue into main with commit 5497faa Apr 30, 2026
22 checks passed

farhadrgh deleted the farhadr/infer_subq_ops branch April 30, 2026 21:36

farhadrgh mentioned this pull request May 22, 2026

Short sequence prefix-invariant evo2 implementation #1580

Open

Conversation

farhadrgh commented Apr 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Performance

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Triggering Code Rabbit AI Review

Pre-submit Checklist

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

farhadrgh commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jstjohn left a comment

Choose a reason for hiding this comment

Uh oh!

farhadrgh commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

farhadrgh commented Apr 30, 2026

Uh oh!

farhadrgh commented Apr 30, 2026

Uh oh!

moradza commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

farhadrgh commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

farhadrgh commented Apr 30, 2026 •

edited

Loading