Skip to content

[https://nvbugs/6094108][fix] Fix Qwen3-30B-A3B NVFP4 tep4 CUTLASS MoE test OOM on B300#13349

Merged
StanleySun639 merged 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6094108
May 20, 2026
Merged

[https://nvbugs/6094108][fix] Fix Qwen3-30B-A3B NVFP4 tep4 CUTLASS MoE test OOM on B300#13349
StanleySun639 merged 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6094108

Conversation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd tensorrt-cicd commented Apr 22, 2026

Summary

  • Fix for NVBugs 6094108: [TensorRT-LLM][main]: TestQwen3_30B_A3B::test_nvfp4[tep4_latency_moe_cutlass-torch_compile=False] is failure
  • Root cause: The Qwen3-30B-A3B nvfp4 test with tp=4/ep=4 CUTLASS MoE ran out of GPU memory on GB300, causing a segfault during MPI finalize. The default KV cache allocation consumed too much GPU memory, leaving insufficient space for model weights and activation buffers during execution.
  • Fix: Added an explicit KvCacheConfig(free_gpu_memory_fraction=0.8) to cap KV cache memory usage at 80% of free GPU memory, resolving the OOM without reducing max_batch_size from 32. A previous repair attempt had halved max_batch_size to 16 alongside the memory fraction fix, which inadvertently changed scheduler batching behavior and degraded GSM8K accuracy from 85.52 to 75.25; this fix preserves the original batch size to maintain accuracy.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Tests
    • Updated KV cache configuration in LLM accuracy tests to optimize memory utilization during testing.

…OM on B300

The test_nvfp4[tep4_latency_moe_cutlass] variant OOMs on B300 GPUs
with the default KV cache memory fraction of 0.9, because the CUTLASS
MoE NVFP4 backend with EP4 + CUDA graphs requires significant GPU
memory for MoE workspaces, NCCL buffers, and CUDA graph captures,
leaving insufficient headroom.

Add KvCacheConfig(free_gpu_memory_fraction=0.8) to reduce KV cache
allocation and prevent OOM. This matches the pattern used by other
multi-GPU MoE tests in the same file.

Verified: MMLU accuracy 79.459 (threshold 77.713) and GSM8K accuracy
85.52 (threshold 80.227) both pass on B300 4-GPU configuration.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@tensorrt-cicd tensorrt-cicd requested a review from a team as a code owner April 22, 2026 20:54
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 1e9bb79a-0178-4869-acc4-7d5e473d83c2

📥 Commits

Reviewing files that changed from the base of the PR and between 7a8bd87 and d1c9c7b.

📒 Files selected for processing (1)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py

📝 Walkthrough

Walkthrough

An NVFP4 accuracy test is updated to explicitly configure KV cache settings on LLM construction, replacing the prior default behavior with a KvCacheConfig specifying free_gpu_memory_fraction=0.8.

Changes

Cohort / File(s) Summary
Test KV Cache Configuration
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Updated LLM initialization to explicitly pass KvCacheConfig(free_gpu_memory_fraction=0.8) instead of relying on default KV cache behavior.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the specific issue (Qwen3-30B-A3B NVFP4 test OOM on B300), includes a proper NVBugs reference, and uses the correct '[fix]' type tag.
Description check ✅ Passed The PR description follows the template structure with clear Summary, Test plan, and Links sections. It explains the root cause, the fix implementation, and why a previous approach was avoided.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@rosenrodt
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47345 [ run ] triggered by Bot. Commit: d1c9c7b Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator

/bot kill

@rosenrodt
Copy link
Copy Markdown
Collaborator

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1,DGX_B200-4_GPUs-PyTorch-Post-Merge-2"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47348 [ kill ] triggered by Bot. Commit: d1c9c7b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47349 [ run ] triggered by Bot. Commit: d1c9c7b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47348 [ kill ] completed with state ABORTED. Commit: d1c9c7b

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #47349 [ run ] completed with state SUCCESS. Commit: d1c9c7b
/LLM/main/L0_MergeRequest_PR pipeline #37285 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator

@StanleySun639 The CI stages DGX_B200-4_GPUs-PyTorch-Post-Merge-1 & DGX_B200-4_GPUs-PyTorch-Post-Merge-2 passed. As this PR affects only the said stages, I think we can skip the rest and merge.

I do not have the permission to merge so I will leave the judgement to you. Thanks!

@rosenrodt
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #48847 [ run ] triggered by Bot. Commit: d1c9c7b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #48847 [ run ] completed with state SUCCESS. Commit: d1c9c7b
/LLM/main/L0_MergeRequest_PR pipeline #38601 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #48881 [ run ] triggered by Bot. Commit: d1c9c7b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #48881 [ run ] completed with state SUCCESS. Commit: d1c9c7b
/LLM/main/L0_MergeRequest_PR pipeline #38631 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #49088 [ run ] triggered by Bot. Commit: d1c9c7b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #49088 [ run ] completed with state SUCCESS. Commit: d1c9c7b
/LLM/main/L0_MergeRequest_PR pipeline #38806 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #49210 [ run ] triggered by Bot. Commit: d1c9c7b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #49210 [ run ] completed with state SUCCESS. Commit: d1c9c7b
/LLM/main/L0_MergeRequest_PR pipeline #38884 completed with status: 'SUCCESS'

CI Report

Link to invocation

@StanleySun639 StanleySun639 merged commit d724c68 into NVIDIA:main May 20, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants