Skip to content

Support subquadratic-ops kernels in evo2 autoregressive inference#1565

Merged
farhadrgh merged 5 commits into
mainfrom
farhadr/infer_subq_ops
Apr 30, 2026
Merged

Support subquadratic-ops kernels in evo2 autoregressive inference#1565
farhadrgh merged 5 commits into
mainfrom
farhadr/infer_subq_ops

Conversation

@farhadrgh
Copy link
Copy Markdown
Collaborator

@farhadrgh farhadrgh commented Apr 30, 2026

Description

Closes the gap noted in hyena_mixer.py (# todo: support inference_context for b2b_kernel) and the README caveat that --use-subquadratic-ops "does not apply to autoregressive inference (infer_evo2)". After this PR, the same fused kernels that accelerate training and batch prediction also accelerate the prefill phase of autoregressive inference.

Summary of change:

  1. engine.parallel_fir now accepts use_subquadratic_ops and routes to fft_causal_conv1d (filters ≥ 128) or causal_conv1d (short filters), wired through both call sites in hyena_utils.py.
  2. HyenaMixer.forward detects prefill (no FIR cache yet) and runs b2b_causal_conv1d for the fused proj+mixer convolution. The kernel doesn't expose its intermediate, so we run a tiny windowed proj-conv on the last K_proj + K_mixer − 2 input positions to materialize the (x2*v) tail and seed the mixer's FIR cache. Works for both hyena_short_conv and hyena_medium_conv.
  3. Removed the del self._parameters["short_conv_weight"] micro-optimization in ParallelCausalDepthwiseConv1dWithState._get_weight()B2BCausalConv1dModule reads that raw param on every prefill, so deleting it after first decode broke multi-prompt inference. Memory cost is ~4 MB for a 1B model.

infer_evo2 gets a --use-subquadratic-ops flag.

Testing

  • New parametrization test_forward_manual[1b-8k-bf16-subquadratic-ops-flash] covers the (flash_decode=True, subquadratic_ops=True) combination that was previously skipped.
  • New test_subquadratic_ops_matches_baseline runs greedy autoregressive generation with and without --use-subquadratic-ops and asserts identical output — this is the strict check that Phase 2 state population is correct (a wrong cache would diverge during decode).
  • Existing kernel comparison tests (test_hyena_mixer_kernel.py) and inference-context unit tests pass unchanged.

Performance

infer_evo2, evo2/1b-8k-bf16, single A6000, multiple identical prompts in one process to amortize the one-time JIT compile cost (~15 s the first time each subq-ops kernel sees a new shape). Steady-state numbers from batches 3+:

Prompt Generation Baseline Subq-ops Speedup
4 096 tokens 5 tokens 0.57 s 0.51 s ~10%
8 000 tokens 1 token 1.02 s 0.87 s ~15%

The speedup is concentrated in prefill. The relative improvement grows with prompt length and shrinks as more decode tokens are amortized in.

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks). This label can be used to enforce running all framework tests.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Summary by CodeRabbit

  • New Features

    • Added --use-subquadratic-ops CLI option to optimize prompt/prefill processing during inference while leaving per-token decode unchanged.
  • Documentation

    • Clarified subquadratic-ops kernel behavior and performance impact on prefill throughput.
  • Tests

    • Added end-to-end test confirming subquadratic-ops generates identical inference results as baseline.

Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
…eline

Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
Signed-off-by: Farhad Ramezanghorbani <farhadr@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

This pull request introduces use_subquadratic_ops support throughout the Evo2 Megatron inference pipeline. Changes include updating documentation to clarify prefill behavior, modifying the FIR convolution engine to conditionally route to optimized subquadratic kernels, updating the HyenaMixer to handle prefill-specific state computation, and threading the new flag through CLI and setup functions with corresponding tests.

Changes

Cohort / File(s) Summary
Documentation
README.md
Updated --use-subquadratic-ops guidance to specify exact kernel usage across prefill, batch scoring, and training; clarifies flag affects only prefill portion of autoregressive inference.
Core FIR Engine
src/bionemo/evo2/models/megatron/hyena/engine.py
Adds use_subquadratic_ops parameter to parallel_fir() with conditional routing to subquadratic_ops_torch implementations; handles FFT and non-FFT cases separately with explicit bias handling and dtype casting.
Hyena Mixer
src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py
Extends b2b_kernel execution to prefill phase; introduces _populate_b2b_inference_state() helper to compute and store FIR tails in inference context for both projection convolution and mixer components.
Hyena Utilities
src/bionemo/evo2/models/megatron/hyena/hyena_utils.py
Propagates use_subquadratic_ops into parallel_fir() calls; preserves short_conv_weight parameter instead of deleting after caching.
Inference Interface
src/bionemo/evo2/run/infer.py
Adds use_subquadratic_ops boolean option to CLI and function signatures; threads parameter through main()infer()setup_inference_engine() to model_provider.
Test Infrastructure
tests/bionemo/evo2/run/test_infer.py
Adds use_subquadratic_ops parameter to subprocess helper; introduces end-to-end test comparing greedy generation output between baseline and subquadratic-ops paths.
Forward Test
tests/bionemo/evo2/test_evo2.py
Removes conditional skip for combined subquadratic_ops=True and flash_decode=True settings; now executes this previously-skipped parameterized test configuration.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant Infer as infer()
    participant Setup as setup_inference_engine()
    participant Provider as ModelProvider
    participant Mixer as HyenaMixer.forward()
    participant Engine as parallel_fir()
    participant SubqOps as subquadratic_ops_torch
    participant BaseOps as F.conv1d/FFTConv

    User->>Infer: --use-subquadratic-ops flag
    Infer->>Setup: use_subquadratic_ops=True
    Setup->>Provider: set use_subquadratic_ops
    
    rect rgb(100, 150, 200, 0.5)
    Note over Infer,Provider: Prefill Phase
    Infer->>Mixer: forward(prefill_tokens)
    Mixer->>Engine: parallel_fir(..., use_subquadratic_ops=True)
    
    alt use_subquadratic_ops enabled
        Engine->>SubqOps: _subq_fft_causal_conv1d (fir_len ≥ 128)
        SubqOps-->>Engine: optimized output
    else standard path
        Engine->>BaseOps: fftconv_func/F.conv1d
        BaseOps-->>Engine: output
    end
    
    Engine-->>Mixer: FIR result + inference_state
    Mixer->>Mixer: _populate_b2b_inference_state()
    Mixer-->>Infer: prefill hidden states
    end
    
    rect rgb(200, 150, 100, 0.5)
    Note over Infer,Mixer: Decode Phase (unchanged)
    Infer->>Mixer: forward(next_token)
    Mixer->>Mixer: use cached inference_state
    Mixer-->>Infer: decoded token
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Subquadratic speeds in prefill's grace,
Through kernels fused we quicken pace,
While decode hops untouched remain,
FFT and conv unite their reign,
Hop faster still, the inference flight! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding support for subquadratic-ops kernels in evo2 autoregressive inference, which is the central objective of the PR.
Description check ✅ Passed The description comprehensively covers the main changes (engine.parallel_fir routing, HyenaMixer.forward prefill detection, parameter deletion removal, CLI flag addition), includes testing strategy with specific test names, provides performance metrics, and completes all relevant pre-submit checklist items.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch farhadr/infer_subq_ops

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@farhadrgh
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py (2)

524-562: ⚡ Quick win

Cover the prompt-segmentation handoff too.

This only asserts equivalence for a single-shot prefill. The new cache-seeding logic is most likely to regress when --prompt-segmentation-threshold forces a prefill→decode handoff inside the prompt, so I'd add one companion case with a threshold below len(PROMPT_1) and require the same equality there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py`
around lines 524 - 562, Add a companion assertion that verifies subquadratic ops
match baseline when prompt segmentation forces a prefill→decode handoff: call
run_infer_subprocess twice (once with use_subquadratic_ops=False and once with
True) using the same mbridge_checkpoint_path, PROMPT_1, seed, top_k=1, etc., but
pass prompt_segmentation_threshold set to a value smaller than len(PROMPT_1)
(e.g., 1 less than the prompt length) and the same other args (max_new_tokens,
temperature), then assert both outputs are non-empty and equal (use the same
failure message pattern); reference run_infer_subprocess, PROMPT_1, and
prompt_segmentation_threshold to locate where to add this companion case.

306-334: ⚡ Quick win

Point this helper at the source infer.py.

run_infer_subprocess_parallel() already shells out to _infer_script_path() with src/ prepended to PYTHONPATH, but this helper still uses -m bionemo.evo2.run.infer. That means the new subquadratic regression test can validate an older installed package instead of the code under review.

Suggested fix
-        "-m",
-        "bionemo.evo2.run.infer",
+        str(_infer_script_path()),
         "--ckpt-dir",
         str(mbridge_checkpoint_path),
         "--prompt",
         prompt,
@@
-    env = copy.deepcopy(PRETEST_ENV)
+    env = copy.deepcopy(PRETEST_ENV)
+    src_dir = str(_recipe_root() / "src")
+    env["PYTHONPATH"] = src_dir + os.pathsep + env.get("PYTHONPATH", "")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py`
around lines 306 - 334, The helper builds cmd using "-m bionemo.evo2.run.infer",
which lets the installed package be executed instead of the local source; change
the command construction in run_infer_subprocess_parallel() to invoke the actual
source script returned by _infer_script_path() (replace the "-m
bionemo.evo2.run.infer" element with str(_infer_script_path()) and ensure the
subprocess uses the same Python executable so the prepended "src/" PYTHONPATH is
applied), leaving the rest of the args (ckpt-dir, prompt, etc.) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py`:
- Around line 381-385: The mixer FIR cache is seeded incorrectly: the helper
always stores x2 * v in mixer_input_tail even though
ParallelShortHyenaOperator.forward uses v (not x2*v) when self.mixer.pregate is
False. Update the logic that sets mixer_input_tail (currently "mixer_input_tail
= (x2 * v).contiguous()") to check self.mixer.pregate and store v.contiguous()
when pregate is False, otherwise keep (x2 * v).contiguous(); reference
ParallelShortHyenaOperator.forward, the mixer.pregate flag, and variables x2, v,
and mixer_input_tail to locate the change.

---

Nitpick comments:
In `@bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py`:
- Around line 524-562: Add a companion assertion that verifies subquadratic ops
match baseline when prompt segmentation forces a prefill→decode handoff: call
run_infer_subprocess twice (once with use_subquadratic_ops=False and once with
True) using the same mbridge_checkpoint_path, PROMPT_1, seed, top_k=1, etc., but
pass prompt_segmentation_threshold set to a value smaller than len(PROMPT_1)
(e.g., 1 less than the prompt length) and the same other args (max_new_tokens,
temperature), then assert both outputs are non-empty and equal (use the same
failure message pattern); reference run_infer_subprocess, PROMPT_1, and
prompt_segmentation_threshold to locate where to add this companion case.
- Around line 306-334: The helper builds cmd using "-m bionemo.evo2.run.infer",
which lets the installed package be executed instead of the local source; change
the command construction in run_infer_subprocess_parallel() to invoke the actual
source script returned by _infer_script_path() (replace the "-m
bionemo.evo2.run.infer" element with str(_infer_script_path()) and ensure the
subprocess uses the same Python executable so the prepended "src/" PYTHONPATH is
applied), leaving the rest of the args (ckpt-dir, prompt, etc.) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0e318f54-a8af-4548-aa42-88bf6f61e018

📥 Commits

Reviewing files that changed from the base of the PR and between 85d0c16 and 05e1356.

📒 Files selected for processing (7)
  • bionemo-recipes/recipes/evo2_megatron/README.md
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/engine.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_mixer.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/models/megatron/hyena/hyena_utils.py
  • bionemo-recipes/recipes/evo2_megatron/src/bionemo/evo2/run/infer.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/run/test_infer.py
  • bionemo-recipes/recipes/evo2_megatron/tests/bionemo/evo2/test_evo2.py

Copy link
Copy Markdown
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@farhadrgh
Copy link
Copy Markdown
Collaborator Author

farhadrgh commented Apr 30, 2026

/ok to test 05e1356

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 30, 2026

/ok to test

@farhadrgh, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@farhadrgh
Copy link
Copy Markdown
Collaborator Author

/ok to test 05e1356

@farhadrgh farhadrgh added the ciflow:all Run all tests (unit tests, slow tests, and notebooks) for bionemo2 or enforce running all tests label Apr 30, 2026
@farhadrgh
Copy link
Copy Markdown
Collaborator Author

/ok to test 05e1356

@moradza
Copy link
Copy Markdown
Collaborator

moradza commented Apr 30, 2026

LGTM @farhadrgh

@farhadrgh farhadrgh added this pull request to the merge queue Apr 30, 2026
Merged via the queue into main with commit 5497faa Apr 30, 2026
22 checks passed
@farhadrgh farhadrgh deleted the farhadr/infer_subq_ops branch April 30, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow:all Run all tests (unit tests, slow tests, and notebooks) for bionemo2 or enforce running all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants