[https://nvbugs/6185173][fix] Set mamba ssm cache to fp32 for NemotronV2#14448
Conversation
…Nano-9B-v2
Nemotron-Nano-9B-v2 has mamba_head_dim=80, which is unsupported by the
FlashInfer SSM kernel (only {64, 128}), so AutoDeploy falls back to the
Triton SSM backend. On H20/Hopper, accumulating bf16 state across an
entire prefill in a single forward pass produces catastrophic accuracy
drop (0% MMLU for full prefill, ~65% for chunked prefill).
Pinning the SSM state cache to float32 keeps the recurrence numerically
stable. Same precision/perf trade-off Nano-V3's config documents
("use float32 for accuracy and default (auto) for speed").
Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
📝 WalkthroughWalkthroughThe Nemotron Nano 9B v2 model configuration is updated to specify ChangesNemotron SSM Cache Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
nvchenghaoz
left a comment
There was a problem hiding this comment.
Confirm that the fix in the PR fixed the issue. Tested on H20 from computelab
kv_cache_config.mamba_ssm_cache_dtype: float32 in the model registry YAML |
/bot run |
|
PR_Github #50636 [ run ] triggered by Bot. Commit: |
|
PR_Github #50636 [ run ] completed with state |
Summary
kv_cache_config.mamba_ssm_cache_dtype: float32in the model registry YAML for Nemotron-Nano-9B-v2 and remove the now-stale H20 waivers for the three TestNemotronV2 cases.Test plan
Links
Summary by CodeRabbit