[None][test] promote DeepSeek-V4-Flash to MoE CI config subset by xxi-nv · Pull Request #13964 · NVIDIA/TensorRT-LLM

xxi-nv · 2026-05-11T01:34:44Z

Summary by CodeRabbit

Tests
- Updated test model configurations for MoE modules, adjusting which model variants run in CI versus local testing environments for optimized test coverage and execution.

Description

Swap DeepSeek-V3 and DeepSeek-V4-Flash entries between the CI and local
MoE module test configuration lists in
tests/unittest/_torch/modules/moe/test_moe_module.py.

After this change:

CI_MOE_MODEL_CONFIGS (default, TRTLLM_TEST_MOE_CI=1) covers
DeepSeek-V4-Flash (256, 6, 4096, 2048) instead of DeepSeek-V3.
LOCAL_MOE_MODEL_CONFIGS (full local matrix,
TRTLLM_TEST_MOE_CI=0) now exercises DeepSeek-V3
(256, 8, 7168, 2048) together with the existing local-only configs.

Net diff is a 2-line swap; no test logic, fixtures, or other configs
are touched.

Test Coverage

This PR only changes which model configurations are exercised by the
existing MoE module tests in
tests/unittest/_torch/modules/moe/test_moe_module.py. The same test
functions (parametrized over MOE_MODEL_CONFIGS) run unchanged and
continue to provide coverage for both configs across the CI and local
matrices.

PR Checklist

Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Swap DeepSeek-V3 and DeepSeek-V4-Flash in the MoE module test config lists in tests/unittest/_torch/modules/moe/test_moe_module.py. DeepSeek-V4-Flash now runs in the default CI subset (TRTLLM_TEST_MOE_CI=1), while DeepSeek-V3 is exercised only in the local full matrix (TRTLLM_TEST_MOE_CI=0). Signed-off-by: xxi <xxi@nvidia.com>

coderabbitai · 2026-05-11T01:36:31Z

📝 Walkthrough

Walkthrough

Test model configuration matrices for MoE modules are rebalanced by swapping two expert/shape variants: CI runs now test DeepSeek-V4-Flash while local runs test DeepSeek-V3 instead.

Changes

MoE Test Configuration

Layer / File(s)	Summary
Test Model Config Matrix `tests/unittest/_torch/modules/moe/test_moe_module.py`	`CI_MOE_MODEL_CONFIGS` now includes `MoeModelConfig(256, 6, 4096, 2048)` (DeepSeek-V4-Flash). `LOCAL_MOE_MODEL_CONFIGS` now includes `MoeModelConfig(256, 8, 7168, 2048)` (DeepSeek-V3), swapping the previously distributed configuration coverage.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: promoting DeepSeek-V4-Flash to the CI config subset, which matches the core objective of swapping test configurations.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description follows the template structure with clear sections for Description, Test Coverage, and PR Checklist completed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

xxi-nv · 2026-05-11T01:38:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-11T01:44:25Z

PR_Github #47637 [ run ] triggered by Bot. Commit: 4ceda0a Link to invocation

tensorrt-cicd · 2026-05-11T07:11:46Z

PR_Github #47637 [ run ] completed with state SUCCESS. Commit: 4ceda0a
/LLM/main/L0_MergeRequest_PR pipeline #37541 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xxi-nv · 2026-05-13T07:15:35Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-13T07:21:54Z

PR_Github #48131 [ run ] triggered by Bot. Commit: 2e279c8 Link to invocation

…RTLLMGen on B300 Two test-side fixes for failures from PR 13964's first CI run (L0_MergeRequest_PR/37541 -> child L0_Test-x86_64-Single-GPU/899): 1) DGX_B200-PyTorch-2 stage TIMEOUT in test_unittests.py::test_unittests_v2[unittest/_torch/modules/moe/ test_moe_module.py::test_configurable_moe_single_gpu -k "CUTLASS"]: Promoting DeepSeek-V4-Flash (256, 6, 4096, 2048) into CI_MOE_MODEL_CONFIGS caused the CI sub-test matrix to explode for the e256 path because moe_test_utils.should_skip_to_accelerate_ci() gated Rule-1 minimal coverage on hidden_size >= 7168 in addition to num_experts >= 256. V4-Flash has hidden_size=4096 < 7168, so it escaped Rule-1 and ran the full dtype x seq_len x swiglu x routing matrix (~60 CUTLASS sub-tests vs ~4 under Rule-1 for DeepSeek-V3), busting the per-stage Slurm wall-clock budget on B200. Drop the hidden_size threshold so any e256-class model triggers Rule-1 minimal coverage. V4-Flash stays in CI as the e256 signal, with the same minimal coverage envelope (DeepSeekV3 routing, bfloat16, seq=1, non-gptoss SwiGLU) that DeepSeek-V3 had before. 2) B300-PyTorch-1 stage FAILURE in test_unittests_v2[test_moe_backend.py::test_moe_backend -k "TRTLLM"]: TRTLLMGen MoE on B300 (SM103) hits an illegal memory access during tactic autotune. PR head commit 5629e0d partially mitigates by blacklisting tactic [tileN=32, configIndex=5] in cpp/.../trtllmGenKernels/blockScaleMoe/runner.h, but full coverage of other potentially failing tactics is not yet validated end-to-end. Skip all TRTLLMGen tests on B300 (SM103) via should_skip_trtllm() with a [Bug] marker until the fix is verified. Signed-off-by: xxi <xxi@nvidia.com>

Address review feedback on the prior commit's overly broad B300 skip: should_skip_trtllm() was skipping every TRTLLMGen test on SM103, which masks more cases than the actual bug. Code evidence narrows the failing case to W4A16_MXFP4 + bf16 activation: * PR NVIDIA#13964 head commit 5629e0d's Python-side diff only touches Bf16MxE2m1BlockScaleMoERunner.get_valid_tactics (arg order: (num_experts, num_tokens) was swapped vs the C++ getValidConfigIndices signature). That runner is reached only via bf16_mxe2m1_block_scale_moe_runner in moe_op_backend.py, i.e. the W4A16_MXFP4 path. TRTLLMGenFusedMoE.can_implement() hard-requires bf16 activation, so dtype is implicit. * The C++ side of 5629e0d (isKnownInvalidBlockScaleMoeTactic in runner.h, wired into fp4/fp8/mxFp4 BlockScaleMoe runners) already blacklists tactic [tileN=32, configIndex=5] for SM103 across all TRTLLMGen quant_algos. So other TRTLLMGen quants on B300 should no longer expose the IMA. Tighten the skip predicate to ``get_sm_version() == 103 and quant_algo == QuantAlgo.W4A16_MXFP4``. This keeps NVFP4 / FP8_BLOCK_SCALES / W4A8_MXFP4_MXFP8 / etc. running on B300 (where the C++ blacklist is now the only mitigation), and only mutes the W4A16_MXFP4 path that the Python fix targets, until the end-to-end fix is verified. Signed-off-by: xxi <xxi@nvidia.com>

Correct the prior skip predicate. Jenkins inner pytest stdout from build L0_Test-x86_64-Single-GPU/899, stage B300-PyTorch-1, identifies the deterministically failing sub-test as test_moe_backend[e8_k1_h512_i512-seq=8-dtype=torch.bfloat16- backend=TRTLLM-quant=FP8_BLOCK_SCALES-routing=Renormalize] It fails on both the first run (at 47% of the shard) and the retry, while: * e8_k1_h512_i512 + seq=1 + FP8_BLOCK_SCALES PASSES * e8_k1_h512_i512 + seq=1 + W4A16_MXFP4 PASSES * e8_k1_h512_i512 + seq=8 + W4A16_MXFP4 PASSES on retry * e8_k1_h512_i512 + seq=8 + W4A8_NVFP4_FP8 PASSES on retry so the W4A16_MXFP4 / W4A8_NVFP4_FP8 first-run failures were cascading IMA errors from the FP8_BLOCK_SCALES tactic that corrupted the CUDA context. The Bf16MxE2m1BlockScaleMoERunner arg-order bug that PR head commit 5629e0d also fixes is a separate latent issue, not the trigger for the B300 stage failure. Note: test_moe_backend.py's CI_MOE_MODEL_CONFIGS does not include DeepSeek-V4-Flash (256, 6, 4096, 2048) -- V4-Flash is in LOCAL only, so this failure is unrelated to the CI promotion of V4-Flash. Replace the prior W4A16_MXFP4 skip predicate with the exact tuple SM103 + TRTLLM + FP8_BLOCK_SCALES + MoeModelConfig(num_experts=8, top_k=1, hidden_size=512, intermediate_size=512) + seq_len=8 matching the only deterministically failing sub-test. The C++ blacklist in 5629e0d (isKnownInvalidBlockScaleMoeTactic in fp8BlockScaleMoe.cpp for SM103 + tactic [tileN=32, configIndex=5]) should resolve this end-to-end; the skip stays only until that fix is verified on B300. Signed-off-by: xxi <xxi@nvidia.com>

tensorrt-cicd · 2026-05-13T12:06:33Z

PR_Github #48131 [ run ] completed with state SUCCESS. Commit: 2e279c8
/LLM/main/L0_MergeRequest_PR pipeline #37957 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…4 case PR HEAD 5629e0d already covers the original FP8_BLOCK_SCALES failure reported by Jenkins L0_MergeRequest_PR/37541 (the FP8 case was a cascade victim, not the IMA origin). Empirically reproduced on a B300 (SM103) node that the actual first deterministic IMA on PR HEAD is `alpha=1.702_beta=1.0_limit=7.0-e128_k4_h2880_i2880-seq=8-W4A16_MXFP4` via the Bf16MxE2m1BlockScaleMoERunner path. The PR's C++ blacklist `tileN==32 && configIndex==5` is necessary but insufficient: extending it to `tileN==32` (all configIndex) on SM103 lets all 34 -k "TRTLLM" sub-tests pass, while the PR HEAD blacklist alone still cascades. Replace the over-skip of the FP8 case with a precise skip of the MXFP4 case and rewrite the comment with the matrix of empirical results and step-by-step reproduction so the next person can verify when the underlying kernel/blacklist is fixed. Signed-off-by: xxi <xxi@nvidia.com>

xxi-nv · 2026-05-13T13:11:06Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-13T13:17:11Z

PR_Github #48179 [ run ] triggered by Bot. Commit: 8427556 Link to invocation

tensorrt-cicd · 2026-05-13T20:49:46Z

PR_Github #48179 [ run ] completed with state SUCCESS. Commit: 8427556
/LLM/main/L0_MergeRequest_PR pipeline #37999 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

xxi-nv · 2026-05-13T23:34:26Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-13T23:40:52Z

PR_Github #48246 [ run ] triggered by Bot. Commit: 8427556 Link to invocation

tensorrt-cicd · 2026-05-14T01:35:49Z

PR_Github #48246 [ run ] completed with state SUCCESS. Commit: 8427556
/LLM/main/L0_MergeRequest_PR pipeline #38063 completed with status: 'SUCCESS'

CI Report

Link to invocation

github-actions Bot assigned xxi-nv May 11, 2026

xxi-nv requested a review from leslie-fang25 May 11, 2026 01:39

leslie-fang25 approved these changes May 11, 2026

View reviewed changes

xxi-nv enabled auto-merge (squash) May 11, 2026 03:00

xxi-nv force-pushed the xxi/test-moe-promote-deepseek-v4-flash-to-ci branch from b14dc8b to 5629e0d Compare May 13, 2026 06:53

xxi-nv requested a review from a team as a code owner May 13, 2026 06:53

xxi-nv requested a review from yizhang-nv May 13, 2026 06:53

xxi-nv force-pushed the xxi/test-moe-promote-deepseek-v4-flash-to-ci branch from 5629e0d to 4ceda0a Compare May 13, 2026 07:05

xxi-nv removed request for a team and yizhang-nv May 13, 2026 07:15

Merge branch 'main' into xxi/test-moe-promote-deepseek-v4-flash-to-ci

2e279c8

xxi-nv added 3 commits May 13, 2026 00:58

xxi-nv merged commit fca29e8 into NVIDIA:main May 14, 2026
6 checks passed

coderabbitai Bot mentioned this pull request May 21, 2026

[https://nvbugs/6185212][fix] Fix B300 MoE test list ids #14401

Open

Conversation

xxi-nv commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

xxi-nv commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

xxi-nv commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

xxi-nv commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

xxi-nv commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xxi-nv commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading