[None][test] Add DeepSeek-V4 CI coverage#13653
Merged
Barry-Delaney merged 2 commits intoApr 30, 2026
Merged
Conversation
Collaborator
|
/bot run |
Collaborator
|
PR_Github #46340 [ run ] triggered by Bot. Commit: |
Collaborator
|
PR_Github #46340 [ run ] completed with state |
dff35aa to
79591c4
Compare
Collaborator
Author
|
/bot run |
Collaborator
|
PR_Github #46345 [ run ] triggered by Bot. Commit: |
5c0b9ba to
120e606
Compare
Collaborator
Author
|
/bot kill |
Collaborator
|
PR_Github #46353 [ kill ] triggered by Bot. Commit: |
Collaborator
|
PR_Github #46345 [ run ] completed with state |
Collaborator
|
PR_Github #46353 [ kill ] completed with state |
Layered V4-specific CI on top of Emma's per-platform _ds.yml structure.
YAML wiring:
* l0_b200_ds.yml (1x B200, pre_merge): add V4 sparse-attention unit tests
(sparse_mla, cache_manager, indices_transform, o_proj, compressor_kernel,
compressor_module, compressor_tf32) + tokenizer chat-template test.
* l0_dgx_b200_ds.yml (4x B200, pre_merge): add V4-Flash aggregate smoke
(TP=4 NVFP4, TRTLLM backend, 1-sample MMLU) alongside Emma's
test_modeling_deepseekv4.py.
* l0_dgx_b300_ds.yml (4x B300, pre_merge): merge V4 entries into Emma's
existing pre_merge block (added orchestrator: mpi):
- TestDeepSeekV4FlashBase::test_auto_dtype (FP8 aggregate smoke, TP=4)
- TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb[moe_backend=TRTLLM]
B300's larger per-GPU memory (~275 GB) lets V4 EPLB and FP8 aggregate
fit at TP=4. Known-issues comment lists 4 currently-blocked entries
(FP8 EPLB cublas + 3 online-EPLB gdrcopy blockers).
* l0_b200.yml / l0_dgx_b200.yml: remove obsolete V4 entries that are now
homed under the _ds.yml files.
Test code (tests/integration/defs/accuracy/test_llm_api_pytorch.py):
* TestDeepSeekV4Flash::test_auto_dtype: aggregate (non-disagg, non-EPLB)
smoke at TP=4 NVFP4 with TRTLLM backend.
* TestDeepSeekV4FlashBase::test_auto_dtype: aggregate smoke at TP=4 FP8
with WIDEEP backend (CUTLASS path is Hopper-only on Blackwell).
* _run_deepseekv4_eplb: add tensor_parallel_size=4 parameter (default 4
for B300 sizing).
* Rename _8gpus_*_eplb -> _4gpus_*_eplb (both Flash and Flash-Base, both
static and online); update skip_less_mpi_world_size to 4. WIDEEP
variants of V4-Flash NVFP4 EPLB tests are pytest.param(..., marks=skip)
with a reason pointing at the MXFP4 gap in fused_moe_wide_ep.py
_get_quant_method.
Reference accuracy (tests/integration/defs/accuracy/references/gsm8k.yaml):
* deepseek-ai/DeepSeek-V4-Flash + quant_algo: FP8_BLOCK_SCALES = 95.11
(V4-Flash declares quant_method=fp8 globally even though routed MoE
experts are MXFP4; AccuracyTask picks up FP8_BLOCK_SCALES). Comment
records both measurements: 95.11 on 8x B200/TP=8 with tightened config
and 95.38 on 4x B300/TP=4 with default config (drift +0.27 within
hypothesis-test sigma).
Notes on currently-blocked entries (commented in YAML):
* V4-Flash-Base FP8 + EPLB: hits CUBLAS_STATUS_EXECUTION_FAILED in
forward pass with WIDEEP backend on B200 (memory-pressure related).
Aggregate FP8 path works on B300 — EPLB variant may also work; left
commented pending validation.
* Online EPLB (layer_updates_per_iter>0) requires gdrcopy/gdrdrv for
HostAccessibleDeviceAllocator (moeLoadBalancer.cpp:846); without
gdrcopy the executor crashes during initialization. Static EPLB does
not hit this path. Uncomment when CI nodes have gdrcopy installed.
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Two pre-existing failures in test_modeling_deepseekv4.py surfaced once the file was wired into pre_merge CI via the DGX_B200-4_GPUs-PyTorch-DS-1 stage: * test_deepseek_v4_moe_swiglu_limit_applies_to_routed_and_shared_experts — removed. The test was a `inspect.getsource()` string-match check that asserted literal source text (`"supports_swiglu_limit = False"`, `"mode.has_w4a8_mxfp4_mxfp8()"`) which no longer exists in DeepseekV4MoE.__init__ — the implementation simplified to a runtime tuple-membership test (`moe_cls in (CutlassFusedMoE, ...)`) and dropped the quant-mode-specific gate. Brittle source-string matching is the wrong tool for verifying the swiglu_limit dispatch path; runtime behavior is exercised by any V4 test that constructs a model with `swiglu_limit` set in config. * test_deepseek_v4_sparse_ratios_prefer_checkpoint_defaults — adjusted. The setup passed `sliding_window=256` and the assertion expected that to propagate to `sparse_attention_config.window_size`. But V4 sparse MLA hardcodes window_size==128 (FMHA kernel TileSizeKV; see the runtime assertion in DeepseekV4TrtllmAttentionMetadata.__post_init__), so 256 is rejected at runtime regardless of the resolution chain. Dropped the misleading `sliding_window=256` from the setup and changed the assertion to `== 128` to reflect the kernel constraint. The test's primary purpose (compress_ratios resolution from checkpoint) is unchanged and still passes. Also drops the now-unused `DeepseekV4MoE` import (autoflake/ruff). Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
120e606 to
bfcb2c6
Compare
lfr-0531
pushed a commit
that referenced
this pull request
May 7, 2026
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
lfr-0531
pushed a commit
that referenced
this pull request
May 14, 2026
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
lfr-0531
pushed a commit
to lfr-0531/TensorRT-LLM
that referenced
this pull request
May 29, 2026
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com> (cherry picked from commit fa1e55e) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Layered V4-specific CI on top of the per-platform
_ds.ymlstructure introduced in #13604. Adds unit-test coverage for V4 sparse-attention kernels + tokenizer (single B200), an aggregate forward smoke for V4-Flash NVFP4 (4 B200), and aggregate smoke + static EPLB sanity for V4-Flash-Base FP8 / V4-Flash NVFP4 EPLB (4 B300).Built on top of #13604 (
emma/update_ds4_ci).Changes
YAML wiring
tests/integration/test_lists/test-db/l0_b200_ds.ymltest_deepseek_v4_sparse_mla,test_deepseek_v4_cache_manager,test_deepseek_v4_indices_transform,test_deepseek_v4_o_proj,test_compressor_kernel,test_compressor_module,test_compressor_tf32) + tokenizer chat-template tests. Single B200, pre_merge.tests/integration/test_lists/test-db/l0_dgx_b200_ds.ymlTestDeepSeekV4Flash::test_auto_dtype(NVFP4 aggregate smoke, TP=4) to the existing 4-GPU pre_merge block alongsidetest_modeling_deepseekv4.py.tests/integration/test_lists/test-db/l0_dgx_b300_ds.ymlorchestrator: mpi). New entries:TestDeepSeekV4FlashBase::test_auto_dtype(FP8 aggregate smoke, TP=4) +TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb[moe_backend=TRTLLM](full GSM8K eval). Known-issues comment lists 4 currently-blocked entries (FP8 EPLB cublas + 3 online-EPLB gdrcopy blockers) for future unblocking.tests/integration/test_lists/test-db/l0_b200.ymltest_modeling_deepseekv4.pyentry (now homed inl0_dgx_b200_ds.yml4-GPU block per #13604).tests/integration/test_lists/test-db/l0_dgx_b200.yml_ds.ymlfiles.Test code
tests/integration/defs/accuracy/test_llm_api_pytorch.py(+103/-26):TestDeepSeekV4Flash::test_auto_dtype: aggregate (non-disagg, non-EPLB) smoke at TP=4 NVFP4 + TRTLLM backend + attention DP. 1-sample MMLU viais_integration_test=True(no GSM8K reference required).TestDeepSeekV4FlashBase::test_auto_dtype: aggregate smoke at TP=4 FP8 + WIDEEP backend (CUTLASS path is Hopper-only on Blackwell). 1-sample MMLU smoke._run_deepseekv4_eplb: addtensor_parallel_size=4parameter (default 4 for B300 sizing). The earlier B200-specific tightening (kv_cache fraction=0.15, cuda_graph=None, max_num_tokens=2048) was reverted because B300's larger per-GPU memory absorbs the EPLB redundancy + DeepGemm MoE workspace pressure at default settings._8gpus_*_eplb→_4gpus_*_eplb(both Flash and Flash-Base, both static and online); updateskip_less_mpi_world_sizeto 4. WIDEEP variants of V4-Flash NVFP4 EPLB tests are wrapped inpytest.param(..., marks=pytest.mark.skip)with a reason pointing at the MXFP4 gap infused_moe_wide_ep.py:_get_quant_method.Reference accuracy
tests/integration/defs/accuracy/references/gsm8k.yaml(+10):V4-Flash declares
quant_method=fp8globally even though routed MoE experts are MXFP4;AccuracyTaskpicks upFP8_BLOCK_SCALESfrom the top-level config.Stage layout
DGX_B200-PyTorch-DS-1(1× B200)l0_b200_ds.ymlDGX_B200-4_GPUs-PyTorch-DS-1(4× B200)l0_dgx_b200_ds.ymltest_modeling_deepseekv4.py+ V4-Flash NVFP4 aggregate smokeDGX_B300-4_GPUs-PyTorch-DS-1(4× B300)l0_dgx_b300_ds.ymlValidation
Verified end-to-end on 4× B300:
TestDeepSeekV4FlashBase::test_auto_dtype(FP8 agg, TP=4)TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb[TRTLLM](full GSM8K)Earlier validation runs on 8× B200 confirmed:
test_modeling_deepseekv4.pythat pre-date this PR — not regressions.Known issues (commented out in YAMLs)
These entries are listed in YAML comments so they can be uncommented when the underlying issue is fixed:
TestDeepSeekV4FlashBase::test_fp8_4gpus_static_eplb[WIDEEP]— V4-Flash-Base FP8 + EPLB hitsCUBLAS_STATUS_EXECUTION_FAILEDin forward pass with WIDEEP backend on B200 (memory-pressure tied). The aggregate FP8 path works on B300 — the EPLB variant likely also works there but was not validated in this PR.mtp_nextn=0,1) + V4-Flash-Base online EPLB — Online EPLB (layer_updates_per_iter > 0) requires gdrcopy/gdrdrv forHostAccessibleDeviceAllocator(moeLoadBalancer.cpp:846). Without gdrcopy the executor crashes during initialization. Static EPLB does not hit this path.fused_moe_wide_ep.py:_get_quant_methodlacks an MXFP4 branch, so it raisesValueError: Unsupported quantization mode: [65536]for V4-Flash routed experts. Skipped at the test level viapytest.param(..., marks=pytest.mark.skip). TRTLLM backend works.Test plan
After this PR merges, the following pre_merge stages will exercise V4 on every PR touching the relevant code:
DGX_B200-PyTorch-DS-1runs the 8 V4 unit-test files (~10-20 min)DGX_B200-4_GPUs-PyTorch-DS-1runstest_modeling_deepseekv4.py+ V4-Flash agg smoke (~15-25 min)DGX_B300-4_GPUs-PyTorch-DS-1runs V4-Flash-Base agg smoke + V4-Flash NVFP4 static EPLB with full GSM8K eval (~40-60 min)gsm8k.yamlaccuracy threshold check (95.11 ± hypothesis-test sigma) catches V4-Flash NVFP4 EPLB regressionsFollow-ups
fused_moe_wide_ep.py:_get_quant_methodMXFP4 gap so WIDEEP backend can also serve V4-Flash; then enable the WIDEEP-variant tests.mtp_nextn=1(MTP variant) since it's a separatespec_dec_algo: MTPreference key.