Skip to content

[https://nvbugs/6185173][fix] Set mamba ssm cache to fp32 for NemotronV2#14448

Merged
nvchenghaoz merged 2 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6185173
May 28, 2026
Merged

[https://nvbugs/6185173][fix] Set mamba ssm cache to fp32 for NemotronV2#14448
nvchenghaoz merged 2 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6185173

Conversation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd tensorrt-cicd commented May 22, 2026

Summary

  • Root cause: Nemotron-Nano-9B-v2 has mamba_head_dim=80, unsupported by FlashInfer SSM kernel (only {64, 128}), forcing fallback to Triton SSM. With the default bf16 SSM state cache on H20/Hopper, accumulating recurrent state across an entire prefill in a single forward pass produces catastrophic numerical drift (0% MMLU on full prefill, ~65% on chunked prefill).
  • Fix: Set kv_cache_config.mamba_ssm_cache_dtype: float32 in the model registry YAML for Nemotron-Nano-9B-v2 and remove the now-stale H20 waivers for the three TestNemotronV2 cases.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Improvements
    • Updated Nemotron Nano 9B v2 model configuration settings.
    • Improved test reliability with removal of test waivers.

Review Change Stack

…Nano-9B-v2

Nemotron-Nano-9B-v2 has mamba_head_dim=80, which is unsupported by the
FlashInfer SSM kernel (only {64, 128}), so AutoDeploy falls back to the
Triton SSM backend. On H20/Hopper, accumulating bf16 state across an
entire prefill in a single forward pass produces catastrophic accuracy
drop (0% MMLU for full prefill, ~65% for chunked prefill).

Pinning the SSM state cache to float32 keeps the recurrence numerically
stable. Same precision/perf trade-off Nano-V3's config documents
("use float32 for accuracy and default (auto) for speed").

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

📝 Walkthrough

Walkthrough

The Nemotron Nano 9B v2 model configuration is updated to specify float32 as the dtype for the Mamba SSM state cache, with documentation explaining kernel compatibility and long-context behavior requirements. Three corresponding test waivers are removed to enable previously skipped accuracy tests.

Changes

Nemotron SSM Cache Configuration

Layer / File(s) Summary
SSM cache dtype configuration and test waiver removal
examples/auto_deploy/model_registry/configs/nemotron-nano-9b-v2.yaml, tests/integration/test_lists/waives.txt
mamba_ssm_cache_dtype: float32 is added to the model's kv_cache_config with comments explaining the Triton kernel constraint and avoidance of bf16 underflow in long-context cases. Three test waivers for TestNemotronV2 variants (test_auto_dtype[False], test_auto_dtype[True], test_fp8[True]) are removed, enabling the tests to run.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

  • crazydemo
  • mikeiovine
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description check ✅ Passed The description includes root cause analysis, the fix applied, test verification, and relevant bug links, meeting the essential information requirements despite missing some template sections.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and specifically refers to the main change: setting the mamba SSM cache to fp32 for NemotronV2 in the model registry configuration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@nvchenghaoz nvchenghaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm that the fix in the PR fixed the issue. Tested on H20 from computelab

@nvchenghaoz nvchenghaoz changed the title [https://nvbugs/6185173][fix] Set kv_cache_config.mamba_ssm_cache_dtype: float32 in the model registry YAML [https://nvbugs/6185173][fix] Set mamba ssm cache to fp32 for NemotronV2 May 27, 2026
@nvchenghaoz
Copy link
Copy Markdown
Collaborator

/bot run

@nvchenghaoz nvchenghaoz enabled auto-merge (squash) May 27, 2026 23:53
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #50636 [ run ] triggered by Bot. Commit: d3e52b9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #50636 [ run ] completed with state SUCCESS. Commit: d3e52b9
/LLM/main/L0_MergeRequest_PR pipeline #40129 completed with status: 'SUCCESS'

CI Report

Link to invocation

@nvchenghaoz nvchenghaoz merged commit fc3b69e into NVIDIA:main May 28, 2026
12 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants