Skip to content

[#13580][fix] AutoDeploy: Support Gemma3n/4 E2B variants#13630

Merged
bmarimuthu-nv merged 16 commits into
NVIDIA:mainfrom
nv-auto-deploy:bala/fix-gemma2b
May 19, 2026
Merged

[#13580][fix] AutoDeploy: Support Gemma3n/4 E2B variants#13630
bmarimuthu-nv merged 16 commits into
NVIDIA:mainfrom
nv-auto-deploy:bala/fix-gemma2b

Conversation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator

@bmarimuthu-nv bmarimuthu-nv commented Apr 30, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for Gemma 3n and Gemma 4 models with auto-deploy configurations.
    • Implemented shared KV attention for improved inference efficiency.
    • Added per-layer input handling capabilities in Gemma 4.
    • Enhanced Triton paged attention with improved per-sequence KV length handling and memory optimization.
    • Extended CUDA graph compilation with resource-input awareness for better capture and replay.
  • Improvements

    • Fixed KV dtype handling in flash decode stage.
  • Documentation

    • Added accuracy benchmarks for Gemma 3n and Gemma 4 on standard evaluation datasets.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@bmarimuthu-nv bmarimuthu-nv changed the title [None][fix] AutoDeploy: Fix Gemma3n/4 E2B variants [13580][fix] AutoDeploy: Fix Gemma3n/4 E2B variants Apr 30, 2026
@bmarimuthu-nv bmarimuthu-nv marked this pull request as ready for review April 30, 2026 01:42
@bmarimuthu-nv bmarimuthu-nv requested a review from a team as a code owner April 30, 2026 01:42
@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

@coderabbitai summary

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

✅ Actions performed

Summary regeneration triggered.

@bmarimuthu-nv bmarimuthu-nv changed the title [13580][fix] AutoDeploy: Fix Gemma3n/4 E2B variants [#13580][fix] AutoDeploy: Fix Gemma3n/4 E2B variants Apr 30, 2026
@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

Gemma4 E2B Status

Latest E2E Run

Command:

bash -ic "f4 && python examples/auto_deploy/build_and_run_ad.py --model google/gemma-4-E2B-it --args.yaml-extra examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml"

Result: passed with exit code 0.

Key signals:

  • Used attn_backend: triton_paged.
  • Used compile_backend: torch-cudagraph.
  • Captured monolithic CUDA graphs successfully.
  • Captured piecewise CUDA graphs successfully.
  • Ran all 10/10 example prompts.
  • Generations were coherent; the prior multilingual/gibberish failure was not reproduced.

Workspace after the run had no tracked changes. Only local untracked artifacts remained: .maglev-review/ and agentlog/.

Example Prompt Responses

  1. Prompt: How big is the universe?

    Response: That is one of the biggest and most fascinating questions in science! The short answer is: We don't know for sure, but current scientific measurements suggest the observable universe is staggeringly vast. The answer depends entirely on what you mean by "universe": are you talking about the observable universe, what we can currently see with our telescopes, or the entire universe, including dark matter, dark energy, and everything else?

  2. Prompt: In simple words and a single sentence, explain the concept of gravity:

    Response: Gravity is the force that pulls objects with mass toward each other.

  3. Prompt: How to fix slicing in golf?

    Response: Slicing in golf is a common and frustrating issue where the clubface is oriented in a way that causes the ball to fly off to the right, for a right-handed golfer, or left, for a left-handed golfer, usually resulting in a severe slice. Fixing a slice is a combination of diagnosing the root cause, which is often mechanical, and implementing specific drills and adjustments.

  4. Prompt: Where is the capital of Iceland?

    Response: The capital of Iceland is Reykjavik.

  5. Prompt: What are the three laws of thermodynamics?

    Response: The three laws of thermodynamics are fundamental principles that describe how energy and heat behave in physical systems. The response began by explaining the Zeroth Law: if two thermodynamic systems are each in thermal equilibrium with a third system, then they are in thermal equilibrium with each other, establishing the concept of temperature.

  6. Prompt: Summarize the plot of Romeo and Juliet in two sentences:

    Response: In Verona, two young lovers from feuding families fall in love despite their parents' opposition. They secretly marry, but a tragic misunderstanding leads to both Romeo and Juliet to die, ultimately reconciling their families.

  7. Prompt: Write a Python function that checks if a number is prime.

    Response:

    def is_prime(n):
        """
        Checks if a given integer is a prime number.
    
        A prime number is a natural number greater than 1 that has no
        positive divisors other than 1 and itself.
    
        Args:
            n (int): The number to check.
    
        Returns:
            bool: True if the number is prime, False otherwise.
        """
        # 1. Handle edge cases: Numbers
  8. Prompt: Explain the difference between a compiler and an interpreter:

    Response: The fundamental difference between a compiler and an interpreter lies in how they translate and execute source code written in a high-level programming language into machine-readable code. The response began a detailed breakdown of a compiler as a program that reads the entire source code before the program is executed.

  9. Prompt: What causes the northern lights?

    Response: The Northern Lights, also known as the Aurora Borealis in the North or Aurora Australis in the South, are a spectacular natural light display caused by interactions between charged particles from the sun and the Earth's magnetic field. The response explained that the process begins with charged particles emitted by the Sun.

  10. Prompt: What are the health benefits of drinking green tea?

    Response: Green tea is a popular beverage enjoyed worldwide for its potential health benefits, largely due to its rich content of antioxidants, particularly catechins like EGCG. These compounds help combat oxidative stress in the body, which is linked to aging and many chronic diseases.

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

Gemma3n E2B Status

Latest E2E Run

Command:

bash -ic "f4 && python examples/auto_deploy/build_and_run_ad.py --model google/gemma-3n-E2B-it --args.yaml-extra examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml"

Result: passed with exit code 0.

Key signals:

  • Used attn_backend: flashinfer.
  • Used compile_backend: torch-cudagraph.
  • Used custom model implementation Gemma3nForConditionalGeneration.
  • Loaded all 3 checkpoint shards.
  • Captured CUDA graphs for batch sizes 8 through 1.
  • Ran all 10/10 example prompts.
  • Generations were coherent; no multilingual/gibberish failure was observed.

Workspace after the run had no tracked changes. Only local untracked artifacts remained: .maglev-review/, agentlog/, and gemma4-e2b-status.md.

Example Prompt Responses

  1. Prompt: How big is the universe?

    Response: Okay, this is a HUGE question! And the answer is: we don't know for sure, but we have the best estimates we have. The response explained the observable universe as the portion of the universe from which light has had time to reach us since the Big Bang, limited by the speed of light.

  2. Prompt: In simple words and a single sentence, explain the concept of gravity:

    Response: Gravity is the force that pulls everything with mass towards each other, keeping us on the ground and planets orbiting the sun.

  3. Prompt: How to fix slicing in golf?

    Response: The response explained that slicing is when the ball curves sharply to the right for a right-handed golfer, or left for a left-handed golfer. It then began a comprehensive guide covering causes, solutions, and drills.

  4. Prompt: Where is the capital of Iceland?

    Response: The capital of Iceland is Reykjavik.

  5. Prompt: What are the three laws of thermodynamics?

    Response: The response described the three laws of thermodynamics as fundamental principles governing energy and heat transfer. It began with the Zeroth Law: if two systems are each in thermal equilibrium with a third system, then they are in thermal equilibrium with each other.

  6. Prompt: Summarize the plot of Romeo and Juliet in two sentences:

    Response: In Verona, the Montague and Capulet families are locked in a bitter feud, and the young Romeo Montague and Juliet Capulet fall deeply in love at first sight. They secretly marry, but their happiness is tragically cut short when a series of unfortunate events and family loyalties lead to a devastating double suicide.

  7. Prompt: Write a Python function that checks if a number is prime.

    Response:

    def is_prime(number):
      """
      Checks if a given number is prime.
    
      Args:
        number: An integer to check for primality.
    
      Returns:
        True if the number is prime, False otherwise.
        Returns False for numbers less than 2.
      """
    
      if number < 2:
        return False  # Numbers less than 2 are not prime
    
      #Optimization: Only check divis
  8. Prompt: Explain the difference between a compiler and an interpreter:

    Response: The response explained that compilers and interpreters are both tools for translating source code into executable instructions, but they operate differently. A compiler takes the entire source code as input and translates it into machine code, or sometimes an intermediate language, before the program is run.

  9. Prompt: What causes the northern lights?

    Response: The Northern Lights, also known as the Aurora Borealis, are caused by solar wind and charged particles from the Sun interacting with Earth's magnetic field and atmosphere. The response began by explaining solar flares, coronal mass ejections, and charged particles such as electrons and protons.

  10. Prompt: What are the health benefits of drinking green tea?

    Response: The response described green tea as a beverage derived from Camellia sinensis and explained that it is rich in antioxidants. It highlighted polyphenols and catechins, especially EGCG, as key beneficial compounds.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f4c0b023-6054-42c7-a89c-3ca31655d44b

📥 Commits

Reviewing files that changed from the base of the PR and between 28b3471 and a2a7d5d.

📒 Files selected for processing (24)
  • examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml
  • examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3n.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma4.py
  • tensorrt_llm/_torch/auto_deploy/shim/interface.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py
  • tests/integration/defs/accuracy/references/gsm8k.yaml
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_triton_paged_attention.py
  • tests/unittest/auto_deploy/singlegpu/models/test_gemma3n_modeling.py
  • tests/unittest/auto_deploy/singlegpu/models/test_gemma4_modeling.py
  • tests/unittest/auto_deploy/singlegpu/models/test_minimax_m2_modeling.py
  • tests/unittest/auto_deploy/singlegpu/models/test_mistral3_modeling.py
  • tests/unittest/auto_deploy/singlegpu/models/test_mla_rope_utils.py
  • tests/unittest/auto_deploy/singlegpu/test_graph_canonicalize.py
  • tests/unittest/auto_deploy/singlegpu/test_hf_export_info.py
  • tests/unittest/auto_deploy/singlegpu/test_mistral_small_4_tokenizer_bridge.py
  • tests/unittest/auto_deploy/singlegpu/test_pattern_matcher.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_shared_kv_attention.py

📝 Walkthrough

Walkthrough

This PR introduces Gemma3n and Gemma4 AutoDeploy support with shared-KV attention, per-layer inputs, and dynamic MLP scaling. It enhances the Triton paged attention backend to handle per-sequence cache metadata and read-only cache access, strengthens CUDA graph compilation with resource-input awareness, and provides comprehensive test coverage and accuracy benchmarks for the new models.

Changes

Gemma Model Support and Attention Optimization

Layer / File(s) Summary
Model Registry & Config
examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml, examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml
Registers Gemma3n and Gemma4 E2B AutoDeploy configurations with model factories, tokenizers, and CUDA graph settings.
Gemma3n Tokenizer Wrapper
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3n.py
Adds ADGemma3nTokenizer and Gemma3nForConditionalGenerationFactory to handle custom tokenizer loading and chat template integration.
Gemma4 Core Modeling
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma4.py
Extends Gemma4TextMLP with per-layer configuration and dynamic intermediate sizing; implements shared-KV attention in Gemma4TextAttention; adds per-layer input injection in Gemma4TextDecoderLayer; routes per-layer inputs through Gemma4TextModel with embedding and projection paths; propagates per-layer inputs in causal and conditional generation wrappers and export info.
Triton Paged Attention Backend
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
Enhances flash decode stage 1 to cast V loads to K/V dtype; extends SDPA path with per-sequence KV pointers and per-sequence max KV length computation; adds per-sequence masking and token validity; introduces read_cache_only flag to skip KV cache updates; adds supports_shared_kv() method returning True; extends get_constants() to return shared-KV source layer availability.
CUDA Graph Compilation
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Adds _is_resource_input() and _order_kwargs_runtime_then_resources() helpers; extends CapturedGraph.__init__() with resource_input_names; introduces _normalize_args_kwargs() and _resolve_num_batched_inputs() methods for stable argument ordering; updates TorchCudagraphCompiler to thread resource_input_names through monolithic and piecewise capture paths; adjusts inner-kwargs capture to preserve resource input ordering.
Interface & Config Updates
tensorrt_llm/_torch/auto_deploy/shim/interface.py, tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py
Adds resource_names property to CachedSequenceInterface; changes CompileModelConfig.num_batched_inputs from int (default 2) to Optional[int] (default None, ge=1); populates extra_kwargs with resource_input_names in compile transform.
Unit Tests
tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_triton_paged_attention.py, tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py, tests/unittest/auto_deploy/singlegpu/models/test_gemma4_modeling.py, tests/unittest/auto_deploy/singlegpu/transformations/library/test_shared_kv_attention.py
Adds tests for Triton paged attention out-buffer and cache-prefix handling; validates CUDA graph resource handling and batching inference; introduces 10 tests covering shared-KV metadata, export behavior, per-layer input paths, CUDA graph capturability, and state dict compatibility; adds 3 tests for Triton paged backend support in shared-KV scenarios.
Integration Tests & Benchmarks
tests/integration/defs/accuracy/test_llm_api_autodeploy.py, tests/integration/test_lists/test-db/l0_h100.yml, tests/integration/defs/accuracy/references/gsm8k.yaml, tests/integration/defs/accuracy/references/mmlu.yaml
Adds TestGemmaE2B class with end-to-end tests for Gemma3n and Gemma4 E2B; introduces IR sharding path tests for Nemotron Super V3 and Qwen3.5 MoE; registers new test entries in test list; adds accuracy benchmarks (Gemma3n E2B: GSM8K 72.176, MMLU 59.527; Gemma4 E2B: GSM8K 85.709, MMLU 56.823).

Sequence Diagram

sequenceDiagram
    participant Client
    participant Gemma4ForCausalLM
    participant Gemma4TextModel
    participant Gemma4TextDecoderLayer
    participant Gemma4TextAttention
    participant TritonPagedAttention

    Client->>Gemma4ForCausalLM: forward(input_ids, per_layer_inputs)
    Gemma4ForCausalLM->>Gemma4TextModel: get_per_layer_inputs(input_ids)
    Gemma4TextModel-->>Gemma4ForCausalLM: per_layer_inputs (projected)
    Gemma4ForCausalLM->>Gemma4TextModel: forward(per_layer_inputs=...)
    
    Gemma4TextModel->>Gemma4TextModel: embed tokens & compute per-layer contributions
    
    loop for each decoder layer
        Gemma4TextModel->>Gemma4TextDecoderLayer: forward(hidden_states, per_layer_input, shared_kv_states)
        
        Gemma4TextDecoderLayer->>Gemma4TextAttention: forward(hidden_states, shared_kv_states)
        alt is KV-shared layer
            Gemma4TextAttention->>Gemma4TextAttention: fetch (k, v) from shared_kv_states
        else compute new KV
            Gemma4TextAttention->>Gemma4TextAttention: compute (k, v) with RoPE
            Gemma4TextAttention->>Gemma4TextAttention: store (k, v) in shared_kv_states
        end
        
        Gemma4TextAttention->>TritonPagedAttention: triton_paged_mha_with_cache(...)
        TritonPagedAttention-->>Gemma4TextAttention: attention output
        
        Gemma4TextDecoderLayer->>Gemma4TextDecoderLayer: inject per_layer_input via gated/projection/norm
        Gemma4TextDecoderLayer-->>Gemma4TextModel: updated hidden_states
    end
    
    Gemma4TextModel-->>Gemma4ForCausalLM: final hidden_states
    Gemma4ForCausalLM-->>Client: logits
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is almost entirely a template with only the checklist item checked; the Description and Test Coverage sections are empty, lacking actual explanation of the issue, solution, and test details. Fill in the Description section explaining what was broken for Gemma3n/4 E2B and how it was fixed, and populate the Test Coverage section with relevant test files that validate the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 32.35% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title '[#13580][fix] AutoDeploy: Fix Gemma3n/4 E2B variants' directly relates to the main changes in the PR, which involve fixing AutoDeploy for Gemma3n and Gemma4 E2B variants.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml (1)

7-30: 🏗️ Heavy lift

Add a registry-level smoke test for the Gemma E2B configs.

This PR changes deployment entry points, but the added coverage only exercises lower-level model/attention pieces. A small AutoDeploy smoke that resolves both examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml and examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml through the model registry would catch config/factory/tokenizer wiring regressions before release. If that test lives under tests/integration/defs/, please also register it in the appropriate QA functional list.

As per coding guidelines, “Coverage expectations: Assess whether new/changed tests cover happy path...” and “If the Gemma3n/Gemma4 AutoDeploy fixes require end-to-end functional coverage ... add the corresponding new/updated test cases here ... so they execute in the scheduled GPU functional QA run.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml` around lines 7 -
30, Add a registry-level smoke test that resolves the Gemma E2B configs via the
model registry: create a new integration test under tests/integration/defs
(e.g., test_gemma_e2b_registry_smoke) that loads the model registry and attempts
to resolve both examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml and
examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml, asserting
successful factory lookup and tokenizer resolution for model_factory
Gemma4ForConditionalGeneration and the tokenizer google/gemma-4-E2B-it; register
this test in the QA functional list so it runs in scheduled GPU functional QA.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py (1)

955-1029: 🏗️ Heavy lift

Add perf coverage for the SDPA/shared-KV dispatch rewrite.

This path changes gather shape, masking, and the read-only shared-KV execution flow in a latency-sensitive attention kernel, but the PR only adds functional unit coverage. Please add or update a perf sanity case and wire it into tests/integration/test_lists/test-db/l0_perf.yml; add a QA llm_perf_* entry as well if this should run in scheduled coverage.

As per coding guidelines, “If the PR touches performance-sensitive paths ... check whether a perf test entry is present or updated in: (a) tests/integration/test_lists/test-db/l0_perf.yml ... and (b) tests/integration/test_lists/qa/llm_perf_*.yml ...”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py`
around lines 955 - 1029, The SDPA/shared-KV dispatch added in
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
(look for use_sdpa, _fast_gather_sdpa_kernel, k_sdpa/v_sdpa and the
scaled_dot_product_attention SDPA path) needs perf test coverage: add or update
a perf sanity case exercising the new SDPA/shared-KV path and its altered
gather/mask behavior, then wire that test into
tests/integration/test_lists/test-db/l0_perf.yml and, if this should run in
scheduled QA, add a corresponding entry under
tests/integration/test_lists/qa/llm_perf_*.yml (use a descriptive name like
llm_perf_sdpa_shared_kv) so the latency-sensitive path is included in perf runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3n.py`:
- Around line 886-923: The code only sets tokenizer.chat_template from a
chat_template.jinja file via cached_file(_CHAT_TEMPLATE_FILE); add a fallback to
read the chat template from the already-loaded tokenizer config (config) when
template_path is None: after the existing template_path check, if template_path
is None and config.get("chat_template") is truthy, set tokenizer.chat_template =
config["chat_template"]; ensure this logic lives alongside the existing
cached_file call (referencing cached_file, _CHAT_TEMPLATE_FILE, config, cls, and
tokenizer.chat_template) so file-based template still takes precedence over
tokenizer_config.json.

---

Nitpick comments:
In `@examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml`:
- Around line 7-30: Add a registry-level smoke test that resolves the Gemma E2B
configs via the model registry: create a new integration test under
tests/integration/defs (e.g., test_gemma_e2b_registry_smoke) that loads the
model registry and attempts to resolve both
examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml and
examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml, asserting
successful factory lookup and tokenizer resolution for model_factory
Gemma4ForConditionalGeneration and the tokenizer google/gemma-4-E2B-it; register
this test in the QA functional list so it runs in scheduled GPU functional QA.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py`:
- Around line 955-1029: The SDPA/shared-KV dispatch added in
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
(look for use_sdpa, _fast_gather_sdpa_kernel, k_sdpa/v_sdpa and the
scaled_dot_product_attention SDPA path) needs perf test coverage: add or update
a perf sanity case exercising the new SDPA/shared-KV path and its altered
gather/mask behavior, then wire that test into
tests/integration/test_lists/test-db/l0_perf.yml and, if this should run in
scheduled QA, add a corresponding entry under
tests/integration/test_lists/qa/llm_perf_*.yml (use a descriptive name like
llm_perf_sdpa_shared_kv) so the latency-sensitive path is included in perf runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f6647ce2-c5a7-4e0c-9e25-7412170a6dcd

📥 Commits

Reviewing files that changed from the base of the PR and between 5a985d5 and 9616eb3.

📒 Files selected for processing (8)
  • examples/auto_deploy/model_registry/configs/gemma3n_e2b_it.yaml
  • examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3n.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma4.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_gemma4_modeling.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_shared_kv_attention.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_triton_paged_attention.py

Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3n.py Outdated
@bmarimuthu-nv bmarimuthu-nv requested a review from a team as a code owner May 1, 2026 03:36
@bmarimuthu-nv bmarimuthu-nv force-pushed the bala/fix-gemma2b branch 2 times, most recently from 7476a5c to ad48358 Compare May 5, 2026 23:30
@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1, DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --disable-fail-fast

2 similar comments
@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1, DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --disable-fail-fast

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1, DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46908 [ run ] triggered by Bot. Commit: ad48358 Link to invocation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1, DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --disable-fail-fast

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

@coderabbitai summary

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

✅ Actions performed

Summary regeneration triggered.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46917 [ run ] triggered by Bot. Commit: a2a7d5d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46908 [ run ] completed with state ABORTED. Commit: ad48358

Link to invocation

Comment thread examples/auto_deploy/model_registry/configs/gemma4_e2b.yaml
Comment thread tests/integration/defs/accuracy/test_llm_api_autodeploy.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46917 [ run ] completed with state SUCCESS. Commit: a2a7d5d
/LLM/main/L0_MergeRequest_PR pipeline #36923 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48006 [ run ] triggered by Bot. Commit: 81ab8d4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48006 [ run ] completed with state SUCCESS. Commit: 81ab8d4
/LLM/main/L0_MergeRequest_PR pipeline #37842 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48044 [ run ] triggered by Bot. Commit: 81ab8d4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48044 [ run ] completed with state SUCCESS. Commit: 81ab8d4
/LLM/main/L0_MergeRequest_PR pipeline #37878 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48424 [ run ] triggered by Bot. Commit: c8d88b5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48424 [ run ] completed with state FAILURE. Commit: c8d88b5

Link to invocation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48429 [ run ] triggered by Bot. Commit: c8d88b5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48429 [ run ] completed with state SUCCESS. Commit: c8d88b5
/LLM/main/L0_MergeRequest_PR pipeline #38229 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48619 [ run ] triggered by Bot. Commit: c8d88b5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48619 [ run ] completed with state SUCCESS. Commit: c8d88b5
/LLM/main/L0_MergeRequest_PR pipeline #38402 completed with status: 'SUCCESS'

CI Report

Link to invocation

Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py Outdated
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48997 [ run ] triggered by Bot. Commit: 1bdba70 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48997 [ run ] completed with state SUCCESS. Commit: 1bdba70
/LLM/main/L0_MergeRequest_PR pipeline #38736 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49247 [ run ] triggered by Bot. Commit: 1bdba70 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49247 [ run ] completed with state SUCCESS. Commit: 1bdba70
/LLM/main/L0_MergeRequest_PR pipeline #38916 completed with status: 'SUCCESS'

CI Report

Link to invocation

@bmarimuthu-nv bmarimuthu-nv merged commit b892451 into NVIDIA:main May 19, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants