[None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model, including nvfp4 by suyoggupta · Pull Request #12866 · NVIDIA/TensorRT-LLM

suyoggupta · 2026-04-09T00:53:56Z

Onboard google/gemma-4-31B-it dense model:
- Add dense-variant unit tests (decoder layer, full model,
  ConditionalGeneration wrapper, torch.export) to test_gemma4_modeling.py
- Create gemma4_dense.yaml registry config (triton_paged, world_size=2)
- Add google/gemma-4-31B-it entry to models.yaml
Fix cudaErrorMisalignedAddress in Triton paged attention for head_dim>256:
- Force SDPA (cuDNN) path for layers with head_dim>256 by bypassing
  num_seq and max_q_len thresholds
- Replace tl.make_block_ptr with masked loads to avoid TMA alignment issues
- Fix pages_uniform check to use kv_indptr instead of oversized kv_indices
Onboard nvidia/Gemma-4-31B-IT-NVFP4 model:
- Add FP8 KV cache support to triton_paged attention: cast k/v from fp8
  to query dtype before tl.dot
- Fix SDPA gather path to cast k_sdpa/v_sdpa when kv cache dtype differs
- Fix *.embed_tokens exclusion in ModelOPTQuantConfigReader._ALWAYS_EXCLUDE
  so tied lm_head weights are not accidentally NVFP4-quantized

coderabbitai · 2026-04-09T00:57:35Z

📝 Walkthrough

Walkthrough

Added support for Gemma 4 dense (31B) model variant through configuration, model registry entry, and comprehensive unit tests validating decoder layers, full model outputs, conditional generation wrapper, and export functionality.

Changes

Cohort / File(s)	Summary
Configuration `examples/auto_deploy/model_registry/configs/gemma4_dense.yaml`	New Gemma 4 dense configuration file specifying model factory, tokenizer, Triton paged attention backend, CUDA graph compilation, inference limits, KV cache settings, and model transformation flags.
Model Registry `examples/auto_deploy/model_registry/models.yaml`	Added model entry for `google/gemma-4-31B-it` with references to dashboard defaults, world size, and dense configuration files.
Dense Model Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_gemma4_modeling.py`	Added dense-only text config helper function and four new test functions validating decoder layer equivalence, full model outputs, conditional generation wrapper, and model export with dynamic shapes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The pull request description provides a clear, structured overview of three main objectives with specific implementation details and test coverage information.
Title check	✅ Passed	The title accurately describes the primary change: onboarding the google/gemma-4-31B-it dense model to AutoDeploy with supporting configurations and tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

1. Onboard google/gemma-4-31B-it dense model: - Add dense-variant unit tests (decoder layer, full model, ConditionalGeneration wrapper, torch.export) to test_gemma4_modeling.py - Create gemma4_dense.yaml registry config (triton_paged, world_size=2) - Add google/gemma-4-31B-it entry to models.yaml 2. Fix cudaErrorMisalignedAddress in Triton paged attention for head_dim>256: - Force SDPA (cuDNN) path for layers with head_dim>256 by bypassing num_seq and max_q_len thresholds - Replace tl.make_block_ptr with masked loads to avoid TMA alignment issues - Fix pages_uniform check to use kv_indptr instead of oversized kv_indices 3. Onboard nvidia/Gemma-4-31B-IT-NVFP4 model: - Add FP8 KV cache support to triton_paged attention: cast k/v from fp8 to query dtype before tl.dot - Fix SDPA gather path to cast k_sdpa/v_sdpa when kv cache dtype differs - Fix *.embed_tokens exclusion in ModelOPTQuantConfigReader._ALWAYS_EXCLUDE so tied lm_head weights are not accidentally NVFP4-quantized Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

suyoggupta · 2026-04-09T07:19:09Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-09T07:25:00Z

PR_Github #42495 [ run ] triggered by Bot. Commit: 10f2774 Link to invocation

tensorrt-cicd · 2026-04-09T13:03:14Z

PR_Github #42495 [ run ] completed with state SUCCESS. Commit: 10f2774
/LLM/main/L0_MergeRequest_PR pipeline #33243 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…nces in triton paged attention The SDPA fast-path in triton_paged_context reshapes q from [total_tokens, n_heads, head_dim] to [num_seq, max_q_len, ...], which requires all sequences to have the same length. When multiple prefill sequences have different lengths (e.g., [924, 910, 923]), the reshape fails with "shape is invalid for input of size". Add an all_same_q_len guard (zero-overhead CPU check) so variable-length batches fall through to the paged Triton kernel. Also add unit tests covering: large head_dim SDPA forcing, variable-length fallback, oversized kv_indices buffers, uniform-sequence SDPA path verification, and FP8 KV cache dtype casting. Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

suyoggupta · 2026-04-09T17:24:57Z

@bmarimuthu-nv , @arysef PTAL

suyoggupta · 2026-04-09T17:25:08Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

bmarimuthu-nv

LGTM, thanks!

tensorrt-cicd · 2026-04-09T17:31:09Z

PR_Github #42563 [ run ] triggered by Bot. Commit: b43fd0d Link to invocation

tensorrt-cicd · 2026-04-09T22:29:29Z

PR_Github #42563 [ run ] completed with state SUCCESS. Commit: b43fd0d
/LLM/main/L0_MergeRequest_PR pipeline #33297 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

… support Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

suyoggupta · 2026-04-10T01:46:51Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-10T01:53:20Z

PR_Github #42611 [ run ] triggered by Bot. Commit: b1206c8 Link to invocation

tensorrt-cicd · 2026-04-10T04:15:33Z

PR_Github #42611 [ run ] completed with state SUCCESS. Commit: b1206c8
/LLM/main/L0_MergeRequest_PR pipeline #33332 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

suyoggupta · 2026-04-10T04:41:32Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-10T06:18:27Z

PR_Github #42664 [ run ] triggered by Bot. Commit: badeab5 Link to invocation

tensorrt-cicd · 2026-04-10T14:31:52Z

PR_Github #42664 [ run ] completed with state SUCCESS. Commit: badeab5
/LLM/main/L0_MergeRequest_PR pipeline #33373 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

suyoggupta · 2026-04-10T15:19:04Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-10T15:25:56Z

PR_Github #42707 [ run ] triggered by Bot. Commit: f812ade Link to invocation

tensorrt-cicd · 2026-04-10T20:30:24Z

PR_Github #42707 [ run ] completed with state SUCCESS. Commit: f812ade
/LLM/main/L0_MergeRequest_PR pipeline #33402 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

suyoggupta · 2026-04-10T22:02:25Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-10T22:08:24Z

PR_Github #42750 [ run ] triggered by Bot. Commit: 986cff1 Link to invocation

tensorrt-cicd · 2026-04-11T06:06:10Z

PR_Github #42750 [ run ] completed with state SUCCESS. Commit: 986cff1
/LLM/main/L0_MergeRequest_PR pipeline #33428 completed with status: 'SUCCESS'

CI Report

Link to invocation

suyoggupta requested a review from a team as a code owner April 9, 2026 00:53

suyoggupta requested a review from taylor-yb-lee April 9, 2026 00:53

github-actions bot assigned suyoggupta Apr 9, 2026

suyoggupta force-pushed the sg/gemma4-31B branch from de4ee96 to 3e1986f Compare April 9, 2026 06:43

suyoggupta force-pushed the sg/gemma4-31B branch from 3e1986f to 349921c Compare April 9, 2026 06:46

suyoggupta requested a review from a team as a code owner April 9, 2026 06:46

suyoggupta requested review from QiJune, arysef and kaiyux April 9, 2026 06:46

suyoggupta changed the title ~~[None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model~~ [None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model, including nvfp4 Apr 9, 2026

Merge branch 'main' into sg/gemma4-31B

10f2774

Merge branch 'main' into sg/gemma4-31B

7b530d9

bmarimuthu-nv reviewed Apr 9, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py

bmarimuthu-nv approved these changes Apr 9, 2026

View reviewed changes

venkywonka approved these changes Apr 9, 2026

View reviewed changes

suyoggupta added 2 commits April 9, 2026 18:44

[None][fix] Skip fp8 triton paged attention tests on GPUs without fp8…

f44f36d

… support Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

Merge branch 'main' into sg/gemma4-31B

b1206c8

Merge branch 'main' into sg/gemma4-31B

badeab5

Merge branch 'main' into sg/gemma4-31B

f812ade

Merge branch 'main' into sg/gemma4-31B

986cff1

bmarimuthu-nv merged commit 2f02816 into NVIDIA:main Apr 11, 2026
5 checks passed

Conversation

suyoggupta commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

suyoggupta commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

Uh oh!

suyoggupta commented Apr 9, 2026

Uh oh!

suyoggupta commented Apr 9, 2026

Uh oh!

bmarimuthu-nv left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

tensorrt-cicd commented Apr 9, 2026

Uh oh!

suyoggupta commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

suyoggupta commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

suyoggupta commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

suyoggupta commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 10, 2026

Uh oh!

tensorrt-cicd commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

suyoggupta commented Apr 9, 2026 •

edited

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading