Skip to content

[None][test] promote DeepSeek-V4-Flash to MoE CI config subset#13964

Merged
xxi-nv merged 6 commits into
NVIDIA:mainfrom
xxi-nv:xxi/test-moe-promote-deepseek-v4-flash-to-ci
May 14, 2026
Merged

[None][test] promote DeepSeek-V4-Flash to MoE CI config subset#13964
xxi-nv merged 6 commits into
NVIDIA:mainfrom
xxi-nv:xxi/test-moe-promote-deepseek-v4-flash-to-ci

Conversation

@xxi-nv
Copy link
Copy Markdown
Collaborator

@xxi-nv xxi-nv commented May 11, 2026

Summary by CodeRabbit

  • Tests
    • Updated test model configurations for MoE modules, adjusting which model variants run in CI versus local testing environments for optimized test coverage and execution.

Review Change Stack

Description

Swap DeepSeek-V3 and DeepSeek-V4-Flash entries between the CI and local
MoE module test configuration lists in
tests/unittest/_torch/modules/moe/test_moe_module.py.

After this change:

  • CI_MOE_MODEL_CONFIGS (default, TRTLLM_TEST_MOE_CI=1) covers
    DeepSeek-V4-Flash (256, 6, 4096, 2048) instead of DeepSeek-V3.
  • LOCAL_MOE_MODEL_CONFIGS (full local matrix,
    TRTLLM_TEST_MOE_CI=0) now exercises DeepSeek-V3
    (256, 8, 7168, 2048) together with the existing local-only configs.

Net diff is a 2-line swap; no test logic, fixtures, or other configs
are touched.

Test Coverage

This PR only changes which model configurations are exercised by the
existing MoE module tests in
tests/unittest/_torch/modules/moe/test_moe_module.py. The same test
functions (parametrized over MOE_MODEL_CONFIGS) run unchanged and
continue to provide coverage for both configs across the CI and local
matrices.

PR Checklist

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Swap DeepSeek-V3 and DeepSeek-V4-Flash in the MoE module test config
lists in tests/unittest/_torch/modules/moe/test_moe_module.py.
DeepSeek-V4-Flash now runs in the default CI subset
(TRTLLM_TEST_MOE_CI=1), while DeepSeek-V3 is exercised only in the
local full matrix (TRTLLM_TEST_MOE_CI=0).

Signed-off-by: xxi <xxi@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

📝 Walkthrough

Walkthrough

Test model configuration matrices for MoE modules are rebalanced by swapping two expert/shape variants: CI runs now test DeepSeek-V4-Flash while local runs test DeepSeek-V3 instead.

Changes

MoE Test Configuration

Layer / File(s) Summary
Test Model Config Matrix
tests/unittest/_torch/modules/moe/test_moe_module.py
CI_MOE_MODEL_CONFIGS now includes MoeModelConfig(256, 6, 4096, 2048) (DeepSeek-V4-Flash). LOCAL_MOE_MODEL_CONFIGS now includes MoeModelConfig(256, 8, 7168, 2048) (DeepSeek-V3), swapping the previously distributed configuration coverage.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: promoting DeepSeek-V4-Flash to the CI config subset, which matches the core objective of swapping test configurations.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description follows the template structure with clear sections for Description, Test Coverage, and PR Checklist completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented May 11, 2026

/bot run --disable-fail-fast

@xxi-nv xxi-nv requested a review from leslie-fang25 May 11, 2026 01:39
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47637 [ run ] triggered by Bot. Commit: 4ceda0a Link to invocation

@xxi-nv xxi-nv enabled auto-merge (squash) May 11, 2026 03:00
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47637 [ run ] completed with state SUCCESS. Commit: 4ceda0a
/LLM/main/L0_MergeRequest_PR pipeline #37541 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xxi-nv xxi-nv force-pushed the xxi/test-moe-promote-deepseek-v4-flash-to-ci branch from b14dc8b to 5629e0d Compare May 13, 2026 06:53
@xxi-nv xxi-nv requested a review from a team as a code owner May 13, 2026 06:53
@xxi-nv xxi-nv requested a review from yizhang-nv May 13, 2026 06:53
@xxi-nv xxi-nv force-pushed the xxi/test-moe-promote-deepseek-v4-flash-to-ci branch from 5629e0d to 4ceda0a Compare May 13, 2026 07:05
@xxi-nv xxi-nv removed request for a team and yizhang-nv May 13, 2026 07:15
@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented May 13, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48131 [ run ] triggered by Bot. Commit: 2e279c8 Link to invocation

xxi-nv added 3 commits May 13, 2026 00:58
…RTLLMGen on B300

Two test-side fixes for failures from PR 13964's first CI run
(L0_MergeRequest_PR/37541 -> child L0_Test-x86_64-Single-GPU/899):

1) DGX_B200-PyTorch-2 stage TIMEOUT in
   test_unittests.py::test_unittests_v2[unittest/_torch/modules/moe/
   test_moe_module.py::test_configurable_moe_single_gpu -k "CUTLASS"]:

   Promoting DeepSeek-V4-Flash (256, 6, 4096, 2048) into
   CI_MOE_MODEL_CONFIGS caused the CI sub-test matrix to explode for
   the e256 path because moe_test_utils.should_skip_to_accelerate_ci()
   gated Rule-1 minimal coverage on hidden_size >= 7168 in addition to
   num_experts >= 256. V4-Flash has hidden_size=4096 < 7168, so it
   escaped Rule-1 and ran the full dtype x seq_len x swiglu x routing
   matrix (~60 CUTLASS sub-tests vs ~4 under Rule-1 for DeepSeek-V3),
   busting the per-stage Slurm wall-clock budget on B200.

   Drop the hidden_size threshold so any e256-class model triggers
   Rule-1 minimal coverage. V4-Flash stays in CI as the e256 signal,
   with the same minimal coverage envelope (DeepSeekV3 routing,
   bfloat16, seq=1, non-gptoss SwiGLU) that DeepSeek-V3 had before.

2) B300-PyTorch-1 stage FAILURE in
   test_unittests_v2[test_moe_backend.py::test_moe_backend -k "TRTLLM"]:

   TRTLLMGen MoE on B300 (SM103) hits an illegal memory access during
   tactic autotune. PR head commit 5629e0d partially mitigates by
   blacklisting tactic [tileN=32, configIndex=5] in
   cpp/.../trtllmGenKernels/blockScaleMoe/runner.h, but full coverage
   of other potentially failing tactics is not yet validated
   end-to-end. Skip all TRTLLMGen tests on B300 (SM103) via
   should_skip_trtllm() with a [Bug] marker until the fix is verified.

Signed-off-by: xxi <xxi@nvidia.com>
Address review feedback on the prior commit's overly broad B300 skip:
should_skip_trtllm() was skipping every TRTLLMGen test on SM103,
which masks more cases than the actual bug.

Code evidence narrows the failing case to W4A16_MXFP4 + bf16 activation:

* PR NVIDIA#13964 head commit 5629e0d's Python-side diff only touches
  Bf16MxE2m1BlockScaleMoERunner.get_valid_tactics (arg order:
  (num_experts, num_tokens) was swapped vs the C++
  getValidConfigIndices signature). That runner is reached only via
  bf16_mxe2m1_block_scale_moe_runner in moe_op_backend.py, i.e. the
  W4A16_MXFP4 path. TRTLLMGenFusedMoE.can_implement() hard-requires
  bf16 activation, so dtype is implicit.

* The C++ side of 5629e0d (isKnownInvalidBlockScaleMoeTactic in
  runner.h, wired into fp4/fp8/mxFp4 BlockScaleMoe runners) already
  blacklists tactic [tileN=32, configIndex=5] for SM103 across all
  TRTLLMGen quant_algos. So other TRTLLMGen quants on B300 should
  no longer expose the IMA.

Tighten the skip predicate to
``get_sm_version() == 103 and quant_algo == QuantAlgo.W4A16_MXFP4``.

This keeps NVFP4 / FP8_BLOCK_SCALES / W4A8_MXFP4_MXFP8 / etc. running
on B300 (where the C++ blacklist is now the only mitigation), and
only mutes the W4A16_MXFP4 path that the Python fix targets, until
the end-to-end fix is verified.

Signed-off-by: xxi <xxi@nvidia.com>
Correct the prior skip predicate. Jenkins inner pytest stdout from build
L0_Test-x86_64-Single-GPU/899, stage B300-PyTorch-1, identifies the
deterministically failing sub-test as

  test_moe_backend[e8_k1_h512_i512-seq=8-dtype=torch.bfloat16-
                   backend=TRTLLM-quant=FP8_BLOCK_SCALES-routing=Renormalize]

It fails on both the first run (at 47% of the shard) and the retry,
while:

* e8_k1_h512_i512 + seq=1 + FP8_BLOCK_SCALES PASSES
* e8_k1_h512_i512 + seq=1 + W4A16_MXFP4         PASSES
* e8_k1_h512_i512 + seq=8 + W4A16_MXFP4         PASSES on retry
* e8_k1_h512_i512 + seq=8 + W4A8_NVFP4_FP8     PASSES on retry

so the W4A16_MXFP4 / W4A8_NVFP4_FP8 first-run failures were cascading
IMA errors from the FP8_BLOCK_SCALES tactic that corrupted the CUDA
context. The Bf16MxE2m1BlockScaleMoERunner arg-order bug that PR head
commit 5629e0d also fixes is a separate latent issue, not the
trigger for the B300 stage failure.

Note: test_moe_backend.py's CI_MOE_MODEL_CONFIGS does not include
DeepSeek-V4-Flash (256, 6, 4096, 2048) -- V4-Flash is in LOCAL only,
so this failure is unrelated to the CI promotion of V4-Flash.

Replace the prior W4A16_MXFP4 skip predicate with the exact tuple

  SM103 + TRTLLM + FP8_BLOCK_SCALES
        + MoeModelConfig(num_experts=8, top_k=1,
                         hidden_size=512, intermediate_size=512)
        + seq_len=8

matching the only deterministically failing sub-test. The C++ blacklist
in 5629e0d (isKnownInvalidBlockScaleMoeTactic in
fp8BlockScaleMoe.cpp for SM103 + tactic [tileN=32, configIndex=5])
should resolve this end-to-end; the skip stays only until that fix is
verified on B300.

Signed-off-by: xxi <xxi@nvidia.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48131 [ run ] completed with state SUCCESS. Commit: 2e279c8
/LLM/main/L0_MergeRequest_PR pipeline #37957 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…4 case

PR HEAD 5629e0d already covers the original FP8_BLOCK_SCALES failure
reported by Jenkins L0_MergeRequest_PR/37541 (the FP8 case was a cascade
victim, not the IMA origin). Empirically reproduced on a B300 (SM103)
node that the actual first deterministic IMA on PR HEAD is
`alpha=1.702_beta=1.0_limit=7.0-e128_k4_h2880_i2880-seq=8-W4A16_MXFP4`
via the Bf16MxE2m1BlockScaleMoERunner path. The PR's C++ blacklist
`tileN==32 && configIndex==5` is necessary but insufficient: extending
it to `tileN==32` (all configIndex) on SM103 lets all 34 -k "TRTLLM"
sub-tests pass, while the PR HEAD blacklist alone still cascades.

Replace the over-skip of the FP8 case with a precise skip of the
MXFP4 case and rewrite the comment with the matrix of empirical
results and step-by-step reproduction so the next person can verify
when the underlying kernel/blacklist is fixed.

Signed-off-by: xxi <xxi@nvidia.com>
@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented May 13, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48179 [ run ] triggered by Bot. Commit: 8427556 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48179 [ run ] completed with state SUCCESS. Commit: 8427556
/LLM/main/L0_MergeRequest_PR pipeline #37999 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xxi-nv
Copy link
Copy Markdown
Collaborator Author

xxi-nv commented May 13, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48246 [ run ] triggered by Bot. Commit: 8427556 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48246 [ run ] completed with state SUCCESS. Commit: 8427556
/LLM/main/L0_MergeRequest_PR pipeline #38063 completed with status: 'SUCCESS'

CI Report

Link to invocation

@xxi-nv xxi-nv merged commit fca29e8 into NVIDIA:main May 14, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants