Skip to content

[None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model, including nvfp4#12866

Merged
bmarimuthu-nv merged 9 commits intoNVIDIA:mainfrom
nv-auto-deploy:sg/gemma4-31B
Apr 11, 2026
Merged

[None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model, including nvfp4#12866
bmarimuthu-nv merged 9 commits intoNVIDIA:mainfrom
nv-auto-deploy:sg/gemma4-31B

Conversation

@suyoggupta
Copy link
Copy Markdown
Collaborator

@suyoggupta suyoggupta commented Apr 9, 2026

  1. Onboard google/gemma-4-31B-it dense model:

    • Add dense-variant unit tests (decoder layer, full model,
      ConditionalGeneration wrapper, torch.export) to test_gemma4_modeling.py
    • Create gemma4_dense.yaml registry config (triton_paged, world_size=2)
    • Add google/gemma-4-31B-it entry to models.yaml
  2. Fix cudaErrorMisalignedAddress in Triton paged attention for head_dim>256:

    • Force SDPA (cuDNN) path for layers with head_dim>256 by bypassing
      num_seq and max_q_len thresholds
    • Replace tl.make_block_ptr with masked loads to avoid TMA alignment issues
    • Fix pages_uniform check to use kv_indptr instead of oversized kv_indices
  3. Onboard nvidia/Gemma-4-31B-IT-NVFP4 model:

    • Add FP8 KV cache support to triton_paged attention: cast k/v from fp8
      to query dtype before tl.dot
    • Fix SDPA gather path to cast k_sdpa/v_sdpa when kv cache dtype differs
    • Fix *.embed_tokens exclusion in ModelOPTQuantConfigReader._ALWAYS_EXCLUDE
      so tied lm_head weights are not accidentally NVFP4-quantized

@suyoggupta suyoggupta requested a review from a team as a code owner April 9, 2026 00:53
@suyoggupta suyoggupta requested a review from taylor-yb-lee April 9, 2026 00:53
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Added support for Gemma 4 dense (31B) model variant through configuration, model registry entry, and comprehensive unit tests validating decoder layers, full model outputs, conditional generation wrapper, and export functionality.

Changes

Cohort / File(s) Summary
Configuration
examples/auto_deploy/model_registry/configs/gemma4_dense.yaml
New Gemma 4 dense configuration file specifying model factory, tokenizer, Triton paged attention backend, CUDA graph compilation, inference limits, KV cache settings, and model transformation flags.
Model Registry
examples/auto_deploy/model_registry/models.yaml
Added model entry for google/gemma-4-31B-it with references to dashboard defaults, world size, and dense configuration files.
Dense Model Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_gemma4_modeling.py
Added dense-only text config helper function and four new test functions validating decoder layer equivalence, full model outputs, conditional generation wrapper, and model export with dynamic shapes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description provides a clear, structured overview of three main objectives with specific implementation details and test coverage information.
Title check ✅ Passed The title accurately describes the primary change: onboarding the google/gemma-4-31B-it dense model to AutoDeploy with supporting configurations and tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

1. Onboard google/gemma-4-31B-it dense model:
   - Add dense-variant unit tests (decoder layer, full model,
     ConditionalGeneration wrapper, torch.export) to test_gemma4_modeling.py
   - Create gemma4_dense.yaml registry config (triton_paged, world_size=2)
   - Add google/gemma-4-31B-it entry to models.yaml

2. Fix cudaErrorMisalignedAddress in Triton paged attention for head_dim>256:
   - Force SDPA (cuDNN) path for layers with head_dim>256 by bypassing
     num_seq and max_q_len thresholds
   - Replace tl.make_block_ptr with masked loads to avoid TMA alignment issues
   - Fix pages_uniform check to use kv_indptr instead of oversized kv_indices

3. Onboard nvidia/Gemma-4-31B-IT-NVFP4 model:
   - Add FP8 KV cache support to triton_paged attention: cast k/v from fp8
     to query dtype before tl.dot
   - Fix SDPA gather path to cast k_sdpa/v_sdpa when kv cache dtype differs
   - Fix *.embed_tokens exclusion in ModelOPTQuantConfigReader._ALWAYS_EXCLUDE
     so tied lm_head weights are not accidentally NVFP4-quantized

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@suyoggupta suyoggupta requested a review from a team as a code owner April 9, 2026 06:46
@suyoggupta suyoggupta requested review from QiJune, arysef and kaiyux April 9, 2026 06:46
@suyoggupta suyoggupta changed the title [None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model [None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model, including nvfp4 Apr 9, 2026
@suyoggupta
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42495 [ run ] triggered by Bot. Commit: 10f2774 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42495 [ run ] completed with state SUCCESS. Commit: 10f2774
/LLM/main/L0_MergeRequest_PR pipeline #33243 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…nces in triton paged attention

The SDPA fast-path in triton_paged_context reshapes q from
[total_tokens, n_heads, head_dim] to [num_seq, max_q_len, ...], which
requires all sequences to have the same length. When multiple prefill
sequences have different lengths (e.g., [924, 910, 923]), the reshape
fails with "shape is invalid for input of size". Add an all_same_q_len
guard (zero-overhead CPU check) so variable-length batches fall through
to the paged Triton kernel. Also add unit tests covering: large head_dim
SDPA forcing, variable-length fallback, oversized kv_indices buffers,
uniform-sequence SDPA path verification, and FP8 KV cache dtype casting.

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@suyoggupta
Copy link
Copy Markdown
Collaborator Author

@bmarimuthu-nv , @arysef PTAL

@suyoggupta
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

Copy link
Copy Markdown
Collaborator

@bmarimuthu-nv bmarimuthu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42563 [ run ] triggered by Bot. Commit: b43fd0d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42563 [ run ] completed with state SUCCESS. Commit: b43fd0d
/LLM/main/L0_MergeRequest_PR pipeline #33297 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

… support

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@suyoggupta
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42611 [ run ] triggered by Bot. Commit: b1206c8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42611 [ run ] completed with state SUCCESS. Commit: b1206c8
/LLM/main/L0_MergeRequest_PR pipeline #33332 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@suyoggupta
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42664 [ run ] triggered by Bot. Commit: badeab5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42664 [ run ] completed with state SUCCESS. Commit: badeab5
/LLM/main/L0_MergeRequest_PR pipeline #33373 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@suyoggupta
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42707 [ run ] triggered by Bot. Commit: f812ade Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42707 [ run ] completed with state SUCCESS. Commit: f812ade
/LLM/main/L0_MergeRequest_PR pipeline #33402 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@suyoggupta
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42750 [ run ] triggered by Bot. Commit: 986cff1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42750 [ run ] completed with state SUCCESS. Commit: 986cff1
/LLM/main/L0_MergeRequest_PR pipeline #33428 completed with status: 'SUCCESS'

CI Report

Link to invocation

@bmarimuthu-nv bmarimuthu-nv merged commit 2f02816 into NVIDIA:main Apr 11, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants