[None][test] promote DeepSeek-V4-Flash to MoE CI config subset#13964
Conversation
Swap DeepSeek-V3 and DeepSeek-V4-Flash in the MoE module test config lists in tests/unittest/_torch/modules/moe/test_moe_module.py. DeepSeek-V4-Flash now runs in the default CI subset (TRTLLM_TEST_MOE_CI=1), while DeepSeek-V3 is exercised only in the local full matrix (TRTLLM_TEST_MOE_CI=0). Signed-off-by: xxi <xxi@nvidia.com>
📝 WalkthroughWalkthroughTest model configuration matrices for MoE modules are rebalanced by swapping two expert/shape variants: CI runs now test DeepSeek-V4-Flash while local runs test DeepSeek-V3 instead. ChangesMoE Test Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
/bot run --disable-fail-fast |
|
PR_Github #47637 [ run ] triggered by Bot. Commit: |
|
PR_Github #47637 [ run ] completed with state
|
b14dc8b to
5629e0d
Compare
5629e0d to
4ceda0a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48131 [ run ] triggered by Bot. Commit: |
…RTLLMGen on B300 Two test-side fixes for failures from PR 13964's first CI run (L0_MergeRequest_PR/37541 -> child L0_Test-x86_64-Single-GPU/899): 1) DGX_B200-PyTorch-2 stage TIMEOUT in test_unittests.py::test_unittests_v2[unittest/_torch/modules/moe/ test_moe_module.py::test_configurable_moe_single_gpu -k "CUTLASS"]: Promoting DeepSeek-V4-Flash (256, 6, 4096, 2048) into CI_MOE_MODEL_CONFIGS caused the CI sub-test matrix to explode for the e256 path because moe_test_utils.should_skip_to_accelerate_ci() gated Rule-1 minimal coverage on hidden_size >= 7168 in addition to num_experts >= 256. V4-Flash has hidden_size=4096 < 7168, so it escaped Rule-1 and ran the full dtype x seq_len x swiglu x routing matrix (~60 CUTLASS sub-tests vs ~4 under Rule-1 for DeepSeek-V3), busting the per-stage Slurm wall-clock budget on B200. Drop the hidden_size threshold so any e256-class model triggers Rule-1 minimal coverage. V4-Flash stays in CI as the e256 signal, with the same minimal coverage envelope (DeepSeekV3 routing, bfloat16, seq=1, non-gptoss SwiGLU) that DeepSeek-V3 had before. 2) B300-PyTorch-1 stage FAILURE in test_unittests_v2[test_moe_backend.py::test_moe_backend -k "TRTLLM"]: TRTLLMGen MoE on B300 (SM103) hits an illegal memory access during tactic autotune. PR head commit 5629e0d partially mitigates by blacklisting tactic [tileN=32, configIndex=5] in cpp/.../trtllmGenKernels/blockScaleMoe/runner.h, but full coverage of other potentially failing tactics is not yet validated end-to-end. Skip all TRTLLMGen tests on B300 (SM103) via should_skip_trtllm() with a [Bug] marker until the fix is verified. Signed-off-by: xxi <xxi@nvidia.com>
Address review feedback on the prior commit's overly broad B300 skip: should_skip_trtllm() was skipping every TRTLLMGen test on SM103, which masks more cases than the actual bug. Code evidence narrows the failing case to W4A16_MXFP4 + bf16 activation: * PR NVIDIA#13964 head commit 5629e0d's Python-side diff only touches Bf16MxE2m1BlockScaleMoERunner.get_valid_tactics (arg order: (num_experts, num_tokens) was swapped vs the C++ getValidConfigIndices signature). That runner is reached only via bf16_mxe2m1_block_scale_moe_runner in moe_op_backend.py, i.e. the W4A16_MXFP4 path. TRTLLMGenFusedMoE.can_implement() hard-requires bf16 activation, so dtype is implicit. * The C++ side of 5629e0d (isKnownInvalidBlockScaleMoeTactic in runner.h, wired into fp4/fp8/mxFp4 BlockScaleMoe runners) already blacklists tactic [tileN=32, configIndex=5] for SM103 across all TRTLLMGen quant_algos. So other TRTLLMGen quants on B300 should no longer expose the IMA. Tighten the skip predicate to ``get_sm_version() == 103 and quant_algo == QuantAlgo.W4A16_MXFP4``. This keeps NVFP4 / FP8_BLOCK_SCALES / W4A8_MXFP4_MXFP8 / etc. running on B300 (where the C++ blacklist is now the only mitigation), and only mutes the W4A16_MXFP4 path that the Python fix targets, until the end-to-end fix is verified. Signed-off-by: xxi <xxi@nvidia.com>
Correct the prior skip predicate. Jenkins inner pytest stdout from build
L0_Test-x86_64-Single-GPU/899, stage B300-PyTorch-1, identifies the
deterministically failing sub-test as
test_moe_backend[e8_k1_h512_i512-seq=8-dtype=torch.bfloat16-
backend=TRTLLM-quant=FP8_BLOCK_SCALES-routing=Renormalize]
It fails on both the first run (at 47% of the shard) and the retry,
while:
* e8_k1_h512_i512 + seq=1 + FP8_BLOCK_SCALES PASSES
* e8_k1_h512_i512 + seq=1 + W4A16_MXFP4 PASSES
* e8_k1_h512_i512 + seq=8 + W4A16_MXFP4 PASSES on retry
* e8_k1_h512_i512 + seq=8 + W4A8_NVFP4_FP8 PASSES on retry
so the W4A16_MXFP4 / W4A8_NVFP4_FP8 first-run failures were cascading
IMA errors from the FP8_BLOCK_SCALES tactic that corrupted the CUDA
context. The Bf16MxE2m1BlockScaleMoERunner arg-order bug that PR head
commit 5629e0d also fixes is a separate latent issue, not the
trigger for the B300 stage failure.
Note: test_moe_backend.py's CI_MOE_MODEL_CONFIGS does not include
DeepSeek-V4-Flash (256, 6, 4096, 2048) -- V4-Flash is in LOCAL only,
so this failure is unrelated to the CI promotion of V4-Flash.
Replace the prior W4A16_MXFP4 skip predicate with the exact tuple
SM103 + TRTLLM + FP8_BLOCK_SCALES
+ MoeModelConfig(num_experts=8, top_k=1,
hidden_size=512, intermediate_size=512)
+ seq_len=8
matching the only deterministically failing sub-test. The C++ blacklist
in 5629e0d (isKnownInvalidBlockScaleMoeTactic in
fp8BlockScaleMoe.cpp for SM103 + tactic [tileN=32, configIndex=5])
should resolve this end-to-end; the skip stays only until that fix is
verified on B300.
Signed-off-by: xxi <xxi@nvidia.com>
|
PR_Github #48131 [ run ] completed with state
|
…4 case PR HEAD 5629e0d already covers the original FP8_BLOCK_SCALES failure reported by Jenkins L0_MergeRequest_PR/37541 (the FP8 case was a cascade victim, not the IMA origin). Empirically reproduced on a B300 (SM103) node that the actual first deterministic IMA on PR HEAD is `alpha=1.702_beta=1.0_limit=7.0-e128_k4_h2880_i2880-seq=8-W4A16_MXFP4` via the Bf16MxE2m1BlockScaleMoERunner path. The PR's C++ blacklist `tileN==32 && configIndex==5` is necessary but insufficient: extending it to `tileN==32` (all configIndex) on SM103 lets all 34 -k "TRTLLM" sub-tests pass, while the PR HEAD blacklist alone still cascades. Replace the over-skip of the FP8 case with a precise skip of the MXFP4 case and rewrite the comment with the matrix of empirical results and step-by-step reproduction so the next person can verify when the underlying kernel/blacklist is fixed. Signed-off-by: xxi <xxi@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #48179 [ run ] triggered by Bot. Commit: |
|
PR_Github #48179 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #48246 [ run ] triggered by Bot. Commit: |
|
PR_Github #48246 [ run ] completed with state |
Summary by CodeRabbit
Description
Swap DeepSeek-V3 and DeepSeek-V4-Flash entries between the CI and local
MoE module test configuration lists in
tests/unittest/_torch/modules/moe/test_moe_module.py.After this change:
CI_MOE_MODEL_CONFIGS(default,TRTLLM_TEST_MOE_CI=1) coversDeepSeek-V4-Flash
(256, 6, 4096, 2048)instead of DeepSeek-V3.LOCAL_MOE_MODEL_CONFIGS(full local matrix,TRTLLM_TEST_MOE_CI=0) now exercises DeepSeek-V3(256, 8, 7168, 2048)together with the existing local-only configs.Net diff is a 2-line swap; no test logic, fixtures, or other configs
are touched.
Test Coverage
This PR only changes which model configurations are exercised by the
existing MoE module tests in
tests/unittest/_torch/modules/moe/test_moe_module.py. The same testfunctions (parametrized over
MOE_MODEL_CONFIGS) run unchanged andcontinue to provide coverage for both configs across the CI and local
matrices.
PR Checklist
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.