[https://nvbugs/6168136][fix] Unwaive GPT-OSS test_w4_4gpus dp4-trtllm-fp8#14118
Conversation
…_cache-dp4-trtllm-fp8] The test was waived after a single round-4 pytest-timeout in pre-merge build #37652 (PR NVIDIA#13910). Investigation shows that's a false-positive auto-file: - Same build's rounds 1-3 all PASSED this test (14:43 / 16:21 / 19:01) but their results.xml were dropped at SLURM TimeLimit with the Jenkins log line 'Stage is interrupted, skip to upload test result.' Only round 4's results.xml reached the bot, and on round 4 the test happened to hit pytest's per-test --timeout=3600 instead of finishing in the usual ~10-20 min. - Reproduced on dlcluster 4x GB200: 11/11 PASS (job 1003130: 1 PASS at 13:29; job 1003278: 10-iter loop, all PASS, 8:42-16:53, mean 10:44). - PR NVIDIA#13910 is a 2-file gc-threshold refactor with no GPT-OSS / MoE / KV-cache / test-infra code touched, so it cannot be the cause; it was approved and merged to main ~10h after this NVBug was filed. Across all known executions of this exact parameterization on real GB200, 14/15 are PASS; the one cliff event was on nvl72129-T18, a node not reachable from the lease used for the repro. Signed-off-by: Xiao Wang <24860335+xwang233@users.noreply.github.com>
5e752d1 to
1aa55e7
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThis PR removes a single ChangesTest Waiver Update
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
/bot run --stage-list "GB200-4_GPUs-PyTorch-2" |
|
PR_Github #48303 [ run ] triggered by Bot. Commit: |
|
PR_Github #48303 [ run ] completed with state |
|
/bot reuse-pipeline |
|
PR_Github #48521 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #48521 [ reuse-pipeline ] completed with state |
Summary
TestGPTOSS::test_w4_4gpus[v1_kv_cache-dp4-trtllm-fp8]fromtests/integration/test_lists/waives.txt.Can't Reproduce.Test plan
Trigger pre-merge CI with
/bot run --stage-list "GB200-4_GPUs-PyTorch-2"to exercise the un-waived test.Summary by CodeRabbit