[None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model, including nvfp4#12866
Conversation
📝 WalkthroughWalkthroughAdded support for Gemma 4 dense (31B) model variant through configuration, model registry entry, and comprehensive unit tests validating decoder layers, full model outputs, conditional generation wrapper, and export functionality. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
de4ee96 to
3e1986f
Compare
1. Onboard google/gemma-4-31B-it dense model:
- Add dense-variant unit tests (decoder layer, full model,
ConditionalGeneration wrapper, torch.export) to test_gemma4_modeling.py
- Create gemma4_dense.yaml registry config (triton_paged, world_size=2)
- Add google/gemma-4-31B-it entry to models.yaml
2. Fix cudaErrorMisalignedAddress in Triton paged attention for head_dim>256:
- Force SDPA (cuDNN) path for layers with head_dim>256 by bypassing
num_seq and max_q_len thresholds
- Replace tl.make_block_ptr with masked loads to avoid TMA alignment issues
- Fix pages_uniform check to use kv_indptr instead of oversized kv_indices
3. Onboard nvidia/Gemma-4-31B-IT-NVFP4 model:
- Add FP8 KV cache support to triton_paged attention: cast k/v from fp8
to query dtype before tl.dot
- Fix SDPA gather path to cast k_sdpa/v_sdpa when kv cache dtype differs
- Fix *.embed_tokens exclusion in ModelOPTQuantConfigReader._ALWAYS_EXCLUDE
so tied lm_head weights are not accidentally NVFP4-quantized
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
3e1986f to
349921c
Compare
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #42495 [ run ] triggered by Bot. Commit: |
|
PR_Github #42495 [ run ] completed with state
|
…nces in triton paged attention The SDPA fast-path in triton_paged_context reshapes q from [total_tokens, n_heads, head_dim] to [num_seq, max_q_len, ...], which requires all sequences to have the same length. When multiple prefill sequences have different lengths (e.g., [924, 910, 923]), the reshape fails with "shape is invalid for input of size". Add an all_same_q_len guard (zero-overhead CPU check) so variable-length batches fall through to the paged Triton kernel. Also add unit tests covering: large head_dim SDPA forcing, variable-length fallback, oversized kv_indices buffers, uniform-sequence SDPA path verification, and FP8 KV cache dtype casting. Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
|
@bmarimuthu-nv , @arysef PTAL |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #42563 [ run ] triggered by Bot. Commit: |
|
PR_Github #42563 [ run ] completed with state
|
… support Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #42611 [ run ] triggered by Bot. Commit: |
|
PR_Github #42611 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #42664 [ run ] triggered by Bot. Commit: |
|
PR_Github #42664 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #42707 [ run ] triggered by Bot. Commit: |
|
PR_Github #42707 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #42750 [ run ] triggered by Bot. Commit: |
|
PR_Github #42750 [ run ] completed with state |
Onboard google/gemma-4-31B-it dense model:
ConditionalGeneration wrapper, torch.export) to test_gemma4_modeling.py
Fix cudaErrorMisalignedAddress in Triton paged attention for head_dim>256:
num_seq and max_q_len thresholds
Onboard nvidia/Gemma-4-31B-IT-NVFP4 model:
to query dtype before tl.dot
so tied lm_head weights are not accidentally NVFP4-quantized