[https://nvbugs/6094108][fix] Fix Qwen3-30B-A3B NVFP4 tep4 CUTLASS MoE test OOM on B300#13349
Conversation
…OM on B300 The test_nvfp4[tep4_latency_moe_cutlass] variant OOMs on B300 GPUs with the default KV cache memory fraction of 0.9, because the CUTLASS MoE NVFP4 backend with EP4 + CUDA graphs requires significant GPU memory for MoE workspaces, NCCL buffers, and CUDA graph captures, leaving insufficient headroom. Add KvCacheConfig(free_gpu_memory_fraction=0.8) to reduce KV cache allocation and prevent OOM. This matches the pattern used by other multi-GPU MoE tests in the same file. Verified: MMLU accuracy 79.459 (threshold 77.713) and GSM8K accuracy 85.52 (threshold 80.227) both pass on B300 4-GPU configuration. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAn NVFP4 accuracy test is updated to explicitly configure KV cache settings on LLM construction, replacing the prior default behavior with a Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run |
|
PR_Github #47345 [ run ] triggered by Bot. Commit: |
|
/bot kill |
|
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1,DGX_B200-4_GPUs-PyTorch-Post-Merge-2" |
|
PR_Github #47348 [ kill ] triggered by Bot. Commit: |
|
PR_Github #47349 [ run ] triggered by Bot. Commit: |
|
PR_Github #47348 [ kill ] completed with state |
|
PR_Github #47349 [ run ] completed with state |
|
@StanleySun639 The CI stages I do not have the permission to merge so I will leave the judgement to you. Thanks! |
|
/bot run |
|
PR_Github #48847 [ run ] triggered by Bot. Commit: |
|
PR_Github #48847 [ run ] completed with state
|
|
/bot run |
|
PR_Github #48881 [ run ] triggered by Bot. Commit: |
|
PR_Github #48881 [ run ] completed with state
|
|
/bot run |
|
PR_Github #49088 [ run ] triggered by Bot. Commit: |
|
PR_Github #49088 [ run ] completed with state
|
|
/bot run |
|
PR_Github #49210 [ run ] triggered by Bot. Commit: |
|
PR_Github #49210 [ run ] completed with state |
Summary
KvCacheConfig(free_gpu_memory_fraction=0.8)to cap KV cache memory usage at 80% of free GPU memory, resolving the OOM without reducingmax_batch_sizefrom 32. A previous repair attempt had halvedmax_batch_sizeto 16 alongside the memory fraction fix, which inadvertently changed scheduler batching behavior and degraded GSM8K accuracy from 85.52 to 75.25; this fix preserves the original batch size to maintain accuracy.Test plan
Links
Summary by CodeRabbit