[#13580][fix] AutoDeploy: Support Gemma3n/4 E2B variants by bmarimuthu-nv · Pull Request #13630 · NVIDIA/TensorRT-LLM

bmarimuthu-nv · 2026-04-30T01:39:58Z

Summary by CodeRabbit

Release Notes

New Features
- Added support for Gemma 3n and Gemma 4 models with auto-deploy configurations.
- Implemented shared KV attention for improved inference efficiency.
- Added per-layer input handling capabilities in Gemma 4.
- Enhanced Triton paged attention with improved per-sequence KV length handling and memory optimization.
- Extended CUDA graph compilation with resource-input awareness for better capture and replay.
Improvements
- Fixed KV dtype handling in flash decode stage.
Documentation
- Added accuracy benchmarks for Gemma 3n and Gemma 4 on standard evaluation datasets.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

bmarimuthu-nv · 2026-04-30T01:43:26Z

@coderabbitai summary

coderabbitai · 2026-04-30T01:43:31Z

✅ Actions performed

Summary regeneration triggered.

bmarimuthu-nv · 2026-04-30T01:53:49Z

Gemma4 E2B Status

Latest E2E Run

Command:

bash -ic "f4 && python examples/auto_deploy/build_and_run_ad.py --model google/gemma-4-E2B-it --args.yaml-extra examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml"

Result: passed with exit code 0.

Key signals:

Used attn_backend: triton_paged.
Used compile_backend: torch-cudagraph.
Captured monolithic CUDA graphs successfully.
Captured piecewise CUDA graphs successfully.
Ran all 10/10 example prompts.
Generations were coherent; the prior multilingual/gibberish failure was not reproduced.

Workspace after the run had no tracked changes. Only local untracked artifacts remained: .maglev-review/ and agentlog/.

Example Prompt Responses

Prompt: How big is the universe?

Response: That is one of the biggest and most fascinating questions in science! The short answer is: We don't know for sure, but current scientific measurements suggest the observable universe is staggeringly vast. The answer depends entirely on what you mean by "universe": are you talking about the observable universe, what we can currently see with our telescopes, or the entire universe, including dark matter, dark energy, and everything else?
Prompt: In simple words and a single sentence, explain the concept of gravity:

Response: Gravity is the force that pulls objects with mass toward each other.
Prompt: How to fix slicing in golf?

Response: Slicing in golf is a common and frustrating issue where the clubface is oriented in a way that causes the ball to fly off to the right, for a right-handed golfer, or left, for a left-handed golfer, usually resulting in a severe slice. Fixing a slice is a combination of diagnosing the root cause, which is often mechanical, and implementing specific drills and adjustments.
Prompt: Where is the capital of Iceland?

Response: The capital of Iceland is Reykjavik.
Prompt: What are the three laws of thermodynamics?

Response: The three laws of thermodynamics are fundamental principles that describe how energy and heat behave in physical systems. The response began by explaining the Zeroth Law: if two thermodynamic systems are each in thermal equilibrium with a third system, then they are in thermal equilibrium with each other, establishing the concept of temperature.
Prompt: Summarize the plot of Romeo and Juliet in two sentences:

Response: In Verona, two young lovers from feuding families fall in love despite their parents' opposition. They secretly marry, but a tragic misunderstanding leads to both Romeo and Juliet to die, ultimately reconciling their families.

Prompt: Write a Python function that checks if a number is prime.

Response:

def is_prime(n):
    """
    Checks if a given integer is a prime number.

    A prime number is a natural number greater than 1 that has no
    positive divisors other than 1 and itself.

    Args:
        n (int): The number to check.

    Returns:
        bool: True if the number is prime, False otherwise.
    """
    # 1. Handle edge cases: Numbers

Prompt: Explain the difference between a compiler and an interpreter:

Response: The fundamental difference between a compiler and an interpreter lies in how they translate and execute source code written in a high-level programming language into machine-readable code. The response began a detailed breakdown of a compiler as a program that reads the entire source code before the program is executed.
Prompt: What causes the northern lights?

Response: The Northern Lights, also known as the Aurora Borealis in the North or Aurora Australis in the South, are a spectacular natural light display caused by interactions between charged particles from the sun and the Earth's magnetic field. The response explained that the process begins with charged particles emitted by the Sun.
Prompt: What are the health benefits of drinking green tea?

Response: Green tea is a popular beverage enjoyed worldwide for its potential health benefits, largely due to its rich content of antioxidants, particularly catechins like EGCG. These compounds help combat oxidative stress in the body, which is linked to aging and many chronic diseases.

bmarimuthu-nv · 2026-04-30T01:53:54Z

Gemma3n E2B Status

Latest E2E Run

Command:

bash -ic "f4 && python examples/auto_deploy/build_and_run_ad.py --model google/gemma-3n-E2B-it --args.yaml-extra examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml"

Result: passed with exit code 0.

Key signals:

Used attn_backend: flashinfer.
Used compile_backend: torch-cudagraph.
Used custom model implementation Gemma3nForConditionalGeneration.
Loaded all 3 checkpoint shards.
Captured CUDA graphs for batch sizes 8 through 1.
Ran all 10/10 example prompts.
Generations were coherent; no multilingual/gibberish failure was observed.

Workspace after the run had no tracked changes. Only local untracked artifacts remained: .maglev-review/, agentlog/, and gemma4-e2b-status.md.

Example Prompt Responses

Prompt: How big is the universe?

Response: Okay, this is a HUGE question! And the answer is: we don't know for sure, but we have the best estimates we have. The response explained the observable universe as the portion of the universe from which light has had time to reach us since the Big Bang, limited by the speed of light.
Prompt: In simple words and a single sentence, explain the concept of gravity:

Response: Gravity is the force that pulls everything with mass towards each other, keeping us on the ground and planets orbiting the sun.
Prompt: How to fix slicing in golf?

Response: The response explained that slicing is when the ball curves sharply to the right for a right-handed golfer, or left for a left-handed golfer. It then began a comprehensive guide covering causes, solutions, and drills.
Prompt: Where is the capital of Iceland?

Response: The capital of Iceland is Reykjavik.
Prompt: What are the three laws of thermodynamics?

Response: The response described the three laws of thermodynamics as fundamental principles governing energy and heat transfer. It began with the Zeroth Law: if two systems are each in thermal equilibrium with a third system, then they are in thermal equilibrium with each other.
Prompt: Summarize the plot of Romeo and Juliet in two sentences:

Response: In Verona, the Montague and Capulet families are locked in a bitter feud, and the young Romeo Montague and Juliet Capulet fall deeply in love at first sight. They secretly marry, but their happiness is tragically cut short when a series of unfortunate events and family loyalties lead to a devastating double suicide.

Prompt: Write a Python function that checks if a number is prime.

Response:

def is_prime(number):
  """
  Checks if a given number is prime.

  Args:
    number: An integer to check for primality.

  Returns:
    True if the number is prime, False otherwise.
    Returns False for numbers less than 2.
  """

  if number < 2:
    return False  # Numbers less than 2 are not prime

  #Optimization: Only check divis

Prompt: Explain the difference between a compiler and an interpreter:

Response: The response explained that compilers and interpreters are both tools for translating source code into executable instructions, but they operate differently. A compiler takes the entire source code as input and translates it into machine code, or sometimes an intermediate language, before the program is run.
Prompt: What causes the northern lights?

Response: The Northern Lights, also known as the Aurora Borealis, are caused by solar wind and charged particles from the Sun interacting with Earth's magnetic field and atmosphere. The response began by explaining solar flares, coronal mass ejections, and charged particles such as electrons and protons.
Prompt: What are the health benefits of drinking green tea?

Response: The response described green tea as a beverage derived from Camellia sinensis and explained that it is rich in antioxidants. It highlighted polyphenols and catechins, especially EGCG, as key beneficial compounds.

coderabbitai · 2026-04-30T01:54:04Z

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f4c0b023-6054-42c7-a89c-3ca31655d44b

📥 Commits

Reviewing files that changed from the base of the PR and between 28b3471 and a2a7d5d.

📒 Files selected for processing (24)

examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml
examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3n.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma4.py
tensorrt_llm/_torch/auto_deploy/shim/interface.py
tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py
tests/integration/defs/accuracy/references/gsm8k.yaml
tests/integration/defs/accuracy/references/mmlu.yaml
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
tests/integration/test_lists/test-db/l0_h100.yml
tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py
tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_triton_paged_attention.py
tests/unittest/auto_deploy/singlegpu/models/test_gemma3n_modeling.py
tests/unittest/auto_deploy/singlegpu/models/test_gemma4_modeling.py
tests/unittest/auto_deploy/singlegpu/models/test_minimax_m2_modeling.py
tests/unittest/auto_deploy/singlegpu/models/test_mistral3_modeling.py
tests/unittest/auto_deploy/singlegpu/models/test_mla_rope_utils.py
tests/unittest/auto_deploy/singlegpu/test_graph_canonicalize.py
tests/unittest/auto_deploy/singlegpu/test_hf_export_info.py
tests/unittest/auto_deploy/singlegpu/test_mistral_small_4_tokenizer_bridge.py
tests/unittest/auto_deploy/singlegpu/test_pattern_matcher.py
tests/unittest/auto_deploy/singlegpu/transformations/library/test_shared_kv_attention.py

📝 Walkthrough

Walkthrough

This PR introduces Gemma3n and Gemma4 AutoDeploy support with shared-KV attention, per-layer inputs, and dynamic MLP scaling. It enhances the Triton paged attention backend to handle per-sequence cache metadata and read-only cache access, strengthens CUDA graph compilation with resource-input awareness, and provides comprehensive test coverage and accuracy benchmarks for the new models.

Changes

Gemma Model Support and Attention Optimization

Layer / File(s)	Summary
Model Registry & Config `examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml`, `examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml`	Registers Gemma3n and Gemma4 E2B AutoDeploy configurations with model factories, tokenizers, and CUDA graph settings.
Gemma3n Tokenizer Wrapper `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3n.py`	Adds `ADGemma3nTokenizer` and `Gemma3nForConditionalGenerationFactory` to handle custom tokenizer loading and chat template integration.
Gemma4 Core Modeling `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma4.py`	Extends `Gemma4TextMLP` with per-layer configuration and dynamic intermediate sizing; implements shared-KV attention in `Gemma4TextAttention`; adds per-layer input injection in `Gemma4TextDecoderLayer`; routes per-layer inputs through `Gemma4TextModel` with embedding and projection paths; propagates per-layer inputs in causal and conditional generation wrappers and export info.
Triton Paged Attention Backend `tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py`	Enhances flash decode stage 1 to cast V loads to K/V dtype; extends SDPA path with per-sequence KV pointers and per-sequence max KV length computation; adds per-sequence masking and token validity; introduces `read_cache_only` flag to skip KV cache updates; adds `supports_shared_kv()` method returning `True`; extends `get_constants()` to return shared-KV source layer availability.
CUDA Graph Compilation `tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py`	Adds `_is_resource_input()` and `_order_kwargs_runtime_then_resources()` helpers; extends `CapturedGraph.__init__()` with `resource_input_names`; introduces `_normalize_args_kwargs()` and `_resolve_num_batched_inputs()` methods for stable argument ordering; updates `TorchCudagraphCompiler` to thread `resource_input_names` through monolithic and piecewise capture paths; adjusts inner-kwargs capture to preserve resource input ordering.
Interface & Config Updates `tensorrt_llm/_torch/auto_deploy/shim/interface.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py`	Adds `resource_names` property to `CachedSequenceInterface`; changes `CompileModelConfig.num_batched_inputs` from `int` (default 2) to `Optional[int]` (default `None`, `ge=1`); populates `extra_kwargs` with `resource_input_names` in compile transform.
Unit Tests `tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_triton_paged_attention.py`, `tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py`, `tests/unittest/auto_deploy/singlegpu/models/test_gemma4_modeling.py`, `tests/unittest/auto_deploy/singlegpu/transformations/library/test_shared_kv_attention.py`	Adds tests for Triton paged attention out-buffer and cache-prefix handling; validates CUDA graph resource handling and batching inference; introduces 10 tests covering shared-KV metadata, export behavior, per-layer input paths, CUDA graph capturability, and state dict compatibility; adds 3 tests for Triton paged backend support in shared-KV scenarios.
Integration Tests & Benchmarks `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`, `tests/integration/test_lists/test-db/l0_h100.yml`, `tests/integration/defs/accuracy/references/gsm8k.yaml`, `tests/integration/defs/accuracy/references/mmlu.yaml`	Adds `TestGemmaE2B` class with end-to-end tests for Gemma3n and Gemma4 E2B; introduces IR sharding path tests for Nemotron Super V3 and Qwen3.5 MoE; registers new test entries in test list; adds accuracy benchmarks (Gemma3n E2B: GSM8K 72.176, MMLU 59.527; Gemma4 E2B: GSM8K 85.709, MMLU 56.823).

Sequence Diagram

sequenceDiagram
    participant Client
    participant Gemma4ForCausalLM
    participant Gemma4TextModel
    participant Gemma4TextDecoderLayer
    participant Gemma4TextAttention
    participant TritonPagedAttention

    Client->>Gemma4ForCausalLM: forward(input_ids, per_layer_inputs)
    Gemma4ForCausalLM->>Gemma4TextModel: get_per_layer_inputs(input_ids)
    Gemma4TextModel-->>Gemma4ForCausalLM: per_layer_inputs (projected)
    Gemma4ForCausalLM->>Gemma4TextModel: forward(per_layer_inputs=...)
    
    Gemma4TextModel->>Gemma4TextModel: embed tokens & compute per-layer contributions
    
    loop for each decoder layer
        Gemma4TextModel->>Gemma4TextDecoderLayer: forward(hidden_states, per_layer_input, shared_kv_states)
        
        Gemma4TextDecoderLayer->>Gemma4TextAttention: forward(hidden_states, shared_kv_states)
        alt is KV-shared layer
            Gemma4TextAttention->>Gemma4TextAttention: fetch (k, v) from shared_kv_states
        else compute new KV
            Gemma4TextAttention->>Gemma4TextAttention: compute (k, v) with RoPE
            Gemma4TextAttention->>Gemma4TextAttention: store (k, v) in shared_kv_states
        end
        
        Gemma4TextAttention->>TritonPagedAttention: triton_paged_mha_with_cache(...)
        TritonPagedAttention-->>Gemma4TextAttention: attention output
        
        Gemma4TextDecoderLayer->>Gemma4TextDecoderLayer: inject per_layer_input via gated/projection/norm
        Gemma4TextDecoderLayer-->>Gemma4TextModel: updated hidden_states
    end
    
    Gemma4TextModel-->>Gemma4ForCausalLM: final hidden_states
    Gemma4ForCausalLM-->>Client: logits

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is almost entirely a template with only the checklist item checked; the Description and Test Coverage sections are empty, lacking actual explanation of the issue, solution, and test details.	Fill in the Description section explaining what was broken for Gemma3n/4 E2B and how it was fixed, and populate the Test Coverage section with relevant test files that validate the changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.35% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[`#13580`][fix] AutoDeploy: Fix Gemma3n/4 E2B variants' directly relates to the main changes in the PR, which involve fixing AutoDeploy for Gemma3n and Gemma4 E2B variants.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml (1)
7-30: 🏗️ Heavy lift

Add a registry-level smoke test for the Gemma E2B configs.

This PR changes deployment entry points, but the added coverage only exercises lower-level model/attention pieces. A small AutoDeploy smoke that resolves both examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml and examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml through the model registry would catch config/factory/tokenizer wiring regressions before release. If that test lives under tests/integration/defs/, please also register it in the appropriate QA functional list.

As per coding guidelines, “Coverage expectations: Assess whether new/changed tests cover happy path...” and “If the Gemma3n/Gemma4 AutoDeploy fixes require end-to-end functional coverage ... add the corresponding new/updated test cases here ... so they execute in the scheduled GPU functional QA run.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml` around lines 7 -
30, Add a registry-level smoke test that resolves the Gemma E2B configs via the
model registry: create a new integration test under tests/integration/defs
(e.g., test_gemma_e2b_registry_smoke) that loads the model registry and attempts
to resolve both examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml and
examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml, asserting
successful factory lookup and tokenizer resolution for model_factory
Gemma4ForConditionalGeneration and the tokenizer google/gemma-4-E2B-it; register
this test in the QA functional list so it runs in scheduled GPU functional QA.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py (1)
955-1029: 🏗️ Heavy lift

Add perf coverage for the SDPA/shared-KV dispatch rewrite.

This path changes gather shape, masking, and the read-only shared-KV execution flow in a latency-sensitive attention kernel, but the PR only adds functional unit coverage. Please add or update a perf sanity case and wire it into tests/integration/test_lists/test-db/l0_perf.yml; add a QA llm_perf_* entry as well if this should run in scheduled coverage.

As per coding guidelines, “If the PR touches performance-sensitive paths ... check whether a perf test entry is present or updated in: (a) tests/integration/test_lists/test-db/l0_perf.yml ... and (b) tests/integration/test_lists/qa/llm_perf_*.yml ...”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py`
around lines 955 - 1029, The SDPA/shared-KV dispatch added in
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
(look for use_sdpa, _fast_gather_sdpa_kernel, k_sdpa/v_sdpa and the
scaled_dot_product_attention SDPA path) needs perf test coverage: add or update
a perf sanity case exercising the new SDPA/shared-KV path and its altered
gather/mask behavior, then wire that test into
tests/integration/test_lists/test-db/l0_perf.yml and, if this should run in
scheduled QA, add a corresponding entry under
tests/integration/test_lists/qa/llm_perf_*.yml (use a descriptive name like
llm_perf_sdpa_shared_kv) so the latency-sensitive path is included in perf runs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3n.py`:
- Around line 886-923: The code only sets tokenizer.chat_template from a
chat_template.jinja file via cached_file(_CHAT_TEMPLATE_FILE); add a fallback to
read the chat template from the already-loaded tokenizer config (config) when
template_path is None: after the existing template_path check, if template_path
is None and config.get("chat_template") is truthy, set tokenizer.chat_template =
config["chat_template"]; ensure this logic lives alongside the existing
cached_file call (referencing cached_file, _CHAT_TEMPLATE_FILE, config, cls, and
tokenizer.chat_template) so file-based template still takes precedence over
tokenizer_config.json.

---

Nitpick comments:
In `@examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml`:
- Around line 7-30: Add a registry-level smoke test that resolves the Gemma E2B
configs via the model registry: create a new integration test under
tests/integration/defs (e.g., test_gemma_e2b_registry_smoke) that loads the
model registry and attempts to resolve both
examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml and
examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml, asserting
successful factory lookup and tokenizer resolution for model_factory
Gemma4ForConditionalGeneration and the tokenizer google/gemma-4-E2B-it; register
this test in the QA functional list so it runs in scheduled GPU functional QA.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py`:
- Around line 955-1029: The SDPA/shared-KV dispatch added in
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
(look for use_sdpa, _fast_gather_sdpa_kernel, k_sdpa/v_sdpa and the
scaled_dot_product_attention SDPA path) needs perf test coverage: add or update
a perf sanity case exercising the new SDPA/shared-KV path and its altered
gather/mask behavior, then wire that test into
tests/integration/test_lists/test-db/l0_perf.yml and, if this should run in
scheduled QA, add a corresponding entry under
tests/integration/test_lists/qa/llm_perf_*.yml (use a descriptive name like
llm_perf_sdpa_shared_kv) so the latency-sensitive path is included in perf runs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f6647ce2-c5a7-4e0c-9e25-7412170a6dcd

📥 Commits

Reviewing files that changed from the base of the PR and between 5a985d5 and 9616eb3.

📒 Files selected for processing (8)

examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml
examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3n.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma4.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_gemma4_modeling.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_shared_kv_attention.py
tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_triton_paged_attention.py

bmarimuthu-nv · 2026-05-05T23:37:30Z

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1, DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --disable-fail-fast

bmarimuthu-nv · 2026-05-05T23:51:57Z

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1, DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --disable-fail-fast

bmarimuthu-nv · 2026-05-06T04:00:13Z

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1, DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2026-05-06T04:06:33Z

PR_Github #46908 [ run ] triggered by Bot. Commit: ad48358 Link to invocation

bmarimuthu-nv · 2026-05-06T04:39:40Z

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1, DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --disable-fail-fast

bmarimuthu-nv · 2026-05-06T04:40:20Z

@coderabbitai summary

coderabbitai · 2026-05-06T04:40:27Z

✅ Actions performed

Summary regeneration triggered.

tensorrt-cicd · 2026-05-06T04:45:08Z

PR_Github #46917 [ run ] triggered by Bot. Commit: a2a7d5d Link to invocation

tensorrt-cicd · 2026-05-06T04:45:12Z

PR_Github #46908 [ run ] completed with state ABORTED. Commit: ad48358

Link to invocation

tensorrt-cicd · 2026-05-06T10:27:17Z

PR_Github #46917 [ run ] completed with state SUCCESS. Commit: a2a7d5d
/LLM/main/L0_MergeRequest_PR pipeline #36923 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv · 2026-05-12T18:13:12Z

/bot run

tensorrt-cicd · 2026-05-12T18:19:52Z

PR_Github #48006 [ run ] triggered by Bot. Commit: 81ab8d4 Link to invocation

tensorrt-cicd · 2026-05-12T20:41:52Z

PR_Github #48006 [ run ] completed with state SUCCESS. Commit: 81ab8d4
/LLM/main/L0_MergeRequest_PR pipeline #37842 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bmarimuthu-nv · 2026-05-12T23:15:16Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-12T23:21:00Z

PR_Github #48044 [ run ] triggered by Bot. Commit: 81ab8d4 Link to invocation

tensorrt-cicd · 2026-05-13T07:53:39Z

PR_Github #48044 [ run ] completed with state SUCCESS. Commit: 81ab8d4
/LLM/main/L0_MergeRequest_PR pipeline #37878 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv · 2026-05-14T19:53:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-14T19:59:54Z

PR_Github #48424 [ run ] triggered by Bot. Commit: c8d88b5 Link to invocation

tensorrt-cicd · 2026-05-14T20:00:59Z

PR_Github #48424 [ run ] completed with state FAILURE. Commit: c8d88b5

Link to invocation

bmarimuthu-nv · 2026-05-14T20:11:13Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-14T20:17:12Z

PR_Github #48429 [ run ] triggered by Bot. Commit: c8d88b5 Link to invocation

tensorrt-cicd · 2026-05-15T06:24:40Z

PR_Github #48429 [ run ] completed with state SUCCESS. Commit: c8d88b5
/LLM/main/L0_MergeRequest_PR pipeline #38229 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bmarimuthu-nv · 2026-05-15T18:19:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-15T18:26:15Z

PR_Github #48619 [ run ] triggered by Bot. Commit: c8d88b5 Link to invocation

tensorrt-cicd · 2026-05-15T23:11:01Z

PR_Github #48619 [ run ] completed with state SUCCESS. Commit: c8d88b5
/LLM/main/L0_MergeRequest_PR pipeline #38402 completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv · 2026-05-18T23:38:28Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-18T23:45:05Z

PR_Github #48997 [ run ] triggered by Bot. Commit: 1bdba70 Link to invocation

tensorrt-cicd · 2026-05-19T16:13:45Z

PR_Github #48997 [ run ] completed with state SUCCESS. Commit: 1bdba70
/LLM/main/L0_MergeRequest_PR pipeline #38736 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bmarimuthu-nv · 2026-05-19T17:51:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-19T17:59:06Z

PR_Github #49247 [ run ] triggered by Bot. Commit: 1bdba70 Link to invocation

tensorrt-cicd · 2026-05-19T19:26:31Z

PR_Github #49247 [ run ] completed with state SUCCESS. Commit: 1bdba70
/LLM/main/L0_MergeRequest_PR pipeline #38916 completed with status: 'SUCCESS'

CI Report

Link to invocation

github-actions Bot assigned bmarimuthu-nv Apr 30, 2026

bmarimuthu-nv changed the title ~~[None][fix] AutoDeploy: Fix Gemma3n/4 E2B variants~~ [13580][fix] AutoDeploy: Fix Gemma3n/4 E2B variants Apr 30, 2026

bmarimuthu-nv marked this pull request as ready for review April 30, 2026 01:42

bmarimuthu-nv requested a review from a team as a code owner April 30, 2026 01:42

bmarimuthu-nv requested review from nvchenghaoz and taylor-yb-lee April 30, 2026 01:42

bmarimuthu-nv changed the title ~~[13580][fix] AutoDeploy: Fix Gemma3n/4 E2B variants~~ [#13580][fix] AutoDeploy: Fix Gemma3n/4 E2B variants Apr 30, 2026

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3n.py Outdated

bmarimuthu-nv requested a review from a team as a code owner May 1, 2026 03:36

bmarimuthu-nv force-pushed the bala/fix-gemma2b branch 2 times, most recently from 7476a5c to ad48358 Compare May 5, 2026 23:30

bmarimuthu-nv mentioned this pull request May 6, 2026

Move model tests to correct location #13626

Open

galagam reviewed May 6, 2026

View reviewed changes

Comment thread examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml

galagam reviewed May 6, 2026

View reviewed changes

Comment thread tests/integration/defs/accuracy/test_llm_api_autodeploy.py

StanleySun639 approved these changes May 6, 2026

View reviewed changes

bmarimuthu-nv added 3 commits May 11, 2026 15:06

CI and review fixes

5ad4a21

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

lint fixes

65ad2c1

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

[None][fix] use native Gemma tokenizers in AutoDeploy

81ab8d4

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv force-pushed the bala/fix-gemma2b branch from f27c805 to 81ab8d4 Compare May 12, 2026 02:39

fix cudagraph extent

c8d88b5

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

nvchenghaoz approved these changes May 18, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py Outdated

Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py Outdated

address review comments

1bdba70

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv merged commit b892451 into NVIDIA:main May 19, 2026
7 checks passed

coderabbitai Bot mentioned this pull request May 20, 2026

[#14173][tests] move autodeploy accuracy tests to post merge and use model registry #14352

Open

1 task

Conversation

bmarimuthu-nv commented Apr 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

bmarimuthu-nv commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Uh oh!

bmarimuthu-nv commented Apr 30, 2026

Gemma4 E2B Status

Latest E2E Run

Example Prompt Responses

Uh oh!

bmarimuthu-nv commented Apr 30, 2026

Gemma3n E2B Status

Latest E2E Run

Example Prompt Responses

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bmarimuthu-nv commented May 5, 2026

Uh oh!

bmarimuthu-nv commented May 5, 2026

Uh oh!

bmarimuthu-nv commented May 6, 2026

Uh oh!

tensorrt-cicd commented May 6, 2026

Uh oh!

bmarimuthu-nv commented May 6, 2026

Uh oh!

bmarimuthu-nv commented May 6, 2026

Uh oh!

coderabbitai Bot commented May 6, 2026

Uh oh!

tensorrt-cicd commented May 6, 2026

Uh oh!

tensorrt-cicd commented May 6, 2026

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented May 6, 2026

Uh oh!

bmarimuthu-nv commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

bmarimuthu-nv commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

bmarimuthu-nv commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

bmarimuthu-nv commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

bmarimuthu-nv commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading