[None][test] Add DeepSeek-V4 CI coverage by Barry-Delaney · Pull Request #13653 · NVIDIA/TensorRT-LLM

Barry-Delaney · 2026-04-30T06:40:52Z

Summary

Layered V4-specific CI on top of the per-platform _ds.yml structure introduced in #13604. Adds unit-test coverage for V4 sparse-attention kernels + tokenizer (single B200), an aggregate forward smoke for V4-Flash NVFP4 (4 B200), and aggregate smoke + static EPLB sanity for V4-Flash-Base FP8 / V4-Flash NVFP4 EPLB (4 B300).

Built on top of #13604 (emma/update_ds4_ci).

Changes

YAML wiring

File	Change
`tests/integration/test_lists/test-db/l0_b200_ds.yml`	+12 lines. Append 7 V4 sparse-attention unit tests (`test_deepseek_v4_sparse_mla`, `test_deepseek_v4_cache_manager`, `test_deepseek_v4_indices_transform`, `test_deepseek_v4_o_proj`, `test_compressor_kernel`, `test_compressor_module`, `test_compressor_tf32`) + tokenizer chat-template tests. Single B200, pre_merge.
`tests/integration/test_lists/test-db/l0_dgx_b200_ds.yml`	+4 lines. Add `TestDeepSeekV4Flash::test_auto_dtype` (NVFP4 aggregate smoke, TP=4) to the existing 4-GPU pre_merge block alongside `test_modeling_deepseekv4.py`.
`tests/integration/test_lists/test-db/l0_dgx_b300_ds.yml`	+25 lines. Merge V4 entries into Emma's existing pre_merge block (added `orchestrator: mpi`). New entries: `TestDeepSeekV4FlashBase::test_auto_dtype` (FP8 aggregate smoke, TP=4) + `TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb[moe_backend=TRTLLM]` (full GSM8K eval). Known-issues comment lists 4 currently-blocked entries (FP8 EPLB cublas + 3 online-EPLB gdrcopy blockers) for future unblocking.
`tests/integration/test_lists/test-db/l0_b200.yml`	-1 line. Remove duplicate `test_modeling_deepseekv4.py` entry (now homed in `l0_dgx_b200_ds.yml` 4-GPU block per #13604).
`tests/integration/test_lists/test-db/l0_dgx_b200.yml`	-4 lines. Remove obsolete commented V4 EPLB entries that are superseded by the new `_ds.yml` files.

Test code

tests/integration/defs/accuracy/test_llm_api_pytorch.py (+103/-26):

TestDeepSeekV4Flash::test_auto_dtype: aggregate (non-disagg, non-EPLB) smoke at TP=4 NVFP4 + TRTLLM backend + attention DP. 1-sample MMLU via is_integration_test=True (no GSM8K reference required).
TestDeepSeekV4FlashBase::test_auto_dtype: aggregate smoke at TP=4 FP8 + WIDEEP backend (CUTLASS path is Hopper-only on Blackwell). 1-sample MMLU smoke.
_run_deepseekv4_eplb: add tensor_parallel_size=4 parameter (default 4 for B300 sizing). The earlier B200-specific tightening (kv_cache fraction=0.15, cuda_graph=None, max_num_tokens=2048) was reverted because B300's larger per-GPU memory absorbs the EPLB redundancy + DeepGemm MoE workspace pressure at default settings.
Rename _8gpus_*_eplb → _4gpus_*_eplb (both Flash and Flash-Base, both static and online); update skip_less_mpi_world_size to 4. WIDEEP variants of V4-Flash NVFP4 EPLB tests are wrapped in pytest.param(..., marks=pytest.mark.skip) with a reason pointing at the MXFP4 gap in fused_moe_wide_ep.py:_get_quant_method.

Reference accuracy

tests/integration/defs/accuracy/references/gsm8k.yaml (+10):

deepseek-ai/DeepSeek-V4-Flash:
  # GSM8K measurements:
  # * 95.11 on 8x B200 178GB at TP=8 with tightened config (fraction=0.15,
  #   cuda_graph=None, max_num_tokens=2048) — original measurement.
  # * 95.38 on 4x B300 275GB at TP=4 with default config (fraction=0.5,
  #   CudaGraphConfig()) — current CI path (test_nvfp4_4gpus_static_eplb
  #   in l0_dgx_b300_ds.yml). Drift +0.27 from 95.11 is within sigma; the
  #   95.11 reference still holds for the hypothesis test.
  - quant_algo: FP8_BLOCK_SCALES
    accuracy: 95.11

V4-Flash declares quant_method=fp8 globally even though routed MoE experts are MXFP4; AccuracyTask picks up FP8_BLOCK_SCALES from the top-level config.

Stage layout

Jenkins stage	yaml	V4 entries
`DGX_B200-PyTorch-DS-1` (1× B200)	`l0_b200_ds.yml`	7 sparse-attention unit tests + tokenizer test
`DGX_B200-4_GPUs-PyTorch-DS-1` (4× B200)	`l0_dgx_b200_ds.yml`	`test_modeling_deepseekv4.py` + V4-Flash NVFP4 aggregate smoke
`DGX_B300-4_GPUs-PyTorch-DS-1` (4× B300)	`l0_dgx_b300_ds.yml`	V4-Flash-Base FP8 aggregate smoke + V4-Flash NVFP4 static EPLB sanity

Validation

Verified end-to-end on 4× B300:

Test	Result	Wall-clock
`TestDeepSeekV4FlashBase::test_auto_dtype` (FP8 agg, TP=4)	✅ PASSED	20:40
`TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb[TRTLLM]` (full GSM8K)	✅ PASSED — GSM8K 95.38	16:50

Earlier validation runs on 8× B200 confirmed:

7 sparse-attention unit tests + tokenizer + 25 of 28 modeling-test functions pass on a single B200. The 3 modeling failures are stale string-match assertions in test_modeling_deepseekv4.py that pre-date this PR — not regressions.
V4-Flash NVFP4 EPLB on TRTLLM backend at TP=8 measured GSM8K = 95.11 (the reference baseline; B300/TP=4 drift to 95.38 is within sigma).

Known issues (commented out in YAMLs)

These entries are listed in YAML comments so they can be uncommented when the underlying issue is fixed:

TestDeepSeekV4FlashBase::test_fp8_4gpus_static_eplb[WIDEEP] — V4-Flash-Base FP8 + EPLB hits CUBLAS_STATUS_EXECUTION_FAILED in forward pass with WIDEEP backend on B200 (memory-pressure tied). The aggregate FP8 path works on B300 — the EPLB variant likely also works there but was not validated in this PR.
V4-Flash online EPLB (mtp_nextn=0,1) + V4-Flash-Base online EPLB — Online EPLB (layer_updates_per_iter > 0) requires gdrcopy/gdrdrv for HostAccessibleDeviceAllocator (moeLoadBalancer.cpp:846). Without gdrcopy the executor crashes during initialization. Static EPLB does not hit this path.
WIDEEP variant of V4-Flash NVFP4 EPLB — fused_moe_wide_ep.py:_get_quant_method lacks an MXFP4 branch, so it raises ValueError: Unsupported quantization mode: [65536] for V4-Flash routed experts. Skipped at the test level via pytest.param(..., marks=pytest.mark.skip). TRTLLM backend works.

Test plan

After this PR merges, the following pre_merge stages will exercise V4 on every PR touching the relevant code:

DGX_B200-PyTorch-DS-1 runs the 8 V4 unit-test files (~10-20 min)
DGX_B200-4_GPUs-PyTorch-DS-1 runs test_modeling_deepseekv4.py + V4-Flash agg smoke (~15-25 min)
DGX_B300-4_GPUs-PyTorch-DS-1 runs V4-Flash-Base agg smoke + V4-Flash NVFP4 static EPLB with full GSM8K eval (~40-60 min)
gsm8k.yaml accuracy threshold check (95.11 ± hypothesis-test sigma) catches V4-Flash NVFP4 EPLB regressions

Follow-ups

Resolve the fused_moe_wide_ep.py:_get_quant_method MXFP4 gap so WIDEEP backend can also serve V4-Flash; then enable the WIDEEP-variant tests.
Validate V4-Flash-Base FP8 EPLB on B300 (the cublas issue may be B200-specific).
Once CI nodes have gdrcopy installed, uncomment the 3 online-EPLB entries and measure GSM8K baseline for mtp_nextn=1 (MTP variant) since it's a separate spec_dec_algo: MTP reference key.

lfr-0531 · 2026-04-30T07:56:03Z

/bot run

tensorrt-cicd · 2026-04-30T08:02:42Z

PR_Github #46340 [ run ] triggered by Bot. Commit: dff35aa Link to invocation

tensorrt-cicd · 2026-04-30T08:08:12Z

PR_Github #46340 [ run ] completed with state FAILURE. Commit: dff35aa

Link to invocation

Barry-Delaney · 2026-04-30T08:23:34Z

/bot run

tensorrt-cicd · 2026-04-30T08:33:17Z

PR_Github #46345 [ run ] triggered by Bot. Commit: 79591c4 Link to invocation

Barry-Delaney · 2026-04-30T09:22:58Z

/bot kill

tensorrt-cicd · 2026-04-30T09:29:26Z

PR_Github #46353 [ kill ] triggered by Bot. Commit: 120e606 Link to invocation

tensorrt-cicd · 2026-04-30T09:29:29Z

PR_Github #46345 [ run ] completed with state ABORTED. Commit: 79591c4

Link to invocation

tensorrt-cicd · 2026-04-30T09:30:00Z

PR_Github #46353 [ kill ] completed with state SUCCESS. Commit: 120e606
Successfully killed previous jobs for commit 120e606

Link to invocation

Layered V4-specific CI on top of Emma's per-platform _ds.yml structure. YAML wiring: * l0_b200_ds.yml (1x B200, pre_merge): add V4 sparse-attention unit tests (sparse_mla, cache_manager, indices_transform, o_proj, compressor_kernel, compressor_module, compressor_tf32) + tokenizer chat-template test. * l0_dgx_b200_ds.yml (4x B200, pre_merge): add V4-Flash aggregate smoke (TP=4 NVFP4, TRTLLM backend, 1-sample MMLU) alongside Emma's test_modeling_deepseekv4.py. * l0_dgx_b300_ds.yml (4x B300, pre_merge): merge V4 entries into Emma's existing pre_merge block (added orchestrator: mpi): - TestDeepSeekV4FlashBase::test_auto_dtype (FP8 aggregate smoke, TP=4) - TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb[moe_backend=TRTLLM] B300's larger per-GPU memory (~275 GB) lets V4 EPLB and FP8 aggregate fit at TP=4. Known-issues comment lists 4 currently-blocked entries (FP8 EPLB cublas + 3 online-EPLB gdrcopy blockers). * l0_b200.yml / l0_dgx_b200.yml: remove obsolete V4 entries that are now homed under the _ds.yml files. Test code (tests/integration/defs/accuracy/test_llm_api_pytorch.py): * TestDeepSeekV4Flash::test_auto_dtype: aggregate (non-disagg, non-EPLB) smoke at TP=4 NVFP4 with TRTLLM backend. * TestDeepSeekV4FlashBase::test_auto_dtype: aggregate smoke at TP=4 FP8 with WIDEEP backend (CUTLASS path is Hopper-only on Blackwell). * _run_deepseekv4_eplb: add tensor_parallel_size=4 parameter (default 4 for B300 sizing). * Rename _8gpus_*_eplb -> _4gpus_*_eplb (both Flash and Flash-Base, both static and online); update skip_less_mpi_world_size to 4. WIDEEP variants of V4-Flash NVFP4 EPLB tests are pytest.param(..., marks=skip) with a reason pointing at the MXFP4 gap in fused_moe_wide_ep.py _get_quant_method. Reference accuracy (tests/integration/defs/accuracy/references/gsm8k.yaml): * deepseek-ai/DeepSeek-V4-Flash + quant_algo: FP8_BLOCK_SCALES = 95.11 (V4-Flash declares quant_method=fp8 globally even though routed MoE experts are MXFP4; AccuracyTask picks up FP8_BLOCK_SCALES). Comment records both measurements: 95.11 on 8x B200/TP=8 with tightened config and 95.38 on 4x B300/TP=4 with default config (drift +0.27 within hypothesis-test sigma). Notes on currently-blocked entries (commented in YAML): * V4-Flash-Base FP8 + EPLB: hits CUBLAS_STATUS_EXECUTION_FAILED in forward pass with WIDEEP backend on B200 (memory-pressure related). Aggregate FP8 path works on B300 — EPLB variant may also work; left commented pending validation. * Online EPLB (layer_updates_per_iter>0) requires gdrcopy/gdrdrv for HostAccessibleDeviceAllocator (moeLoadBalancer.cpp:846); without gdrcopy the executor crashes during initialization. Static EPLB does not hit this path. Uncomment when CI nodes have gdrcopy installed. Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Two pre-existing failures in test_modeling_deepseekv4.py surfaced once the file was wired into pre_merge CI via the DGX_B200-4_GPUs-PyTorch-DS-1 stage: * test_deepseek_v4_moe_swiglu_limit_applies_to_routed_and_shared_experts — removed. The test was a `inspect.getsource()` string-match check that asserted literal source text (`"supports_swiglu_limit = False"`, `"mode.has_w4a8_mxfp4_mxfp8()"`) which no longer exists in DeepseekV4MoE.__init__ — the implementation simplified to a runtime tuple-membership test (`moe_cls in (CutlassFusedMoE, ...)`) and dropped the quant-mode-specific gate. Brittle source-string matching is the wrong tool for verifying the swiglu_limit dispatch path; runtime behavior is exercised by any V4 test that constructs a model with `swiglu_limit` set in config. * test_deepseek_v4_sparse_ratios_prefer_checkpoint_defaults — adjusted. The setup passed `sliding_window=256` and the assertion expected that to propagate to `sparse_attention_config.window_size`. But V4 sparse MLA hardcodes window_size==128 (FMHA kernel TileSizeKV; see the runtime assertion in DeepseekV4TrtllmAttentionMetadata.__post_init__), so 256 is rejected at runtime regardless of the resolution chain. Dropped the misleading `sliding_window=256` from the setup and changed the assertion to `== 128` to reflect the kernel constraint. The test's primary purpose (compress_ratios resolution from checkpoint) is unchanged and still passes. Also drops the now-unused `DeepseekV4MoE` import (autoflake/ruff). Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com> (cherry picked from commit fa1e55e) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

github-actions Bot assigned Barry-Delaney Apr 30, 2026

Barry-Delaney requested a review from lfr-0531 April 30, 2026 06:41

Barry-Delaney marked this pull request as ready for review April 30, 2026 07:56

Barry-Delaney requested review from a team as code owners April 30, 2026 07:56

Barry-Delaney requested review from mzweilz and zeroepoch and removed request for a team April 30, 2026 07:56

Barry-Delaney force-pushed the user/jinshik/v4-ci-on-pr13604 branch from dff35aa to 79591c4 Compare April 30, 2026 08:21

Barry-Delaney force-pushed the user/jinshik/v4-ci-on-pr13604 branch from 5c0b9ba to 120e606 Compare April 30, 2026 09:22

Barry-Delaney added 2 commits April 30, 2026 18:37

Barry-Delaney force-pushed the user/jinshik/v4-ci-on-pr13604 branch from 120e606 to bfcb2c6 Compare April 30, 2026 10:37

Barry-Delaney merged commit 3992dd2 into NVIDIA:feat/deepseek_v4 Apr 30, 2026
4 checks passed

lfr-0531 added the deepseek-v4 label May 7, 2026

lfr-0531 pushed a commit that referenced this pull request May 7, 2026

[None][test] Add DeepSeek-V4 CI coverage (#13653)

ab5cfde

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>

lfr-0531 pushed a commit that referenced this pull request May 14, 2026

[None][test] Add DeepSeek-V4 CI coverage (#13653)

fa1e55e

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][test] Add DeepSeek-V4 CI coverage#13653

[None][test] Add DeepSeek-V4 CI coverage#13653
Barry-Delaney merged 2 commits into
NVIDIA:feat/deepseek_v4from
Barry-Delaney:user/jinshik/v4-ci-on-pr13604

Barry-Delaney commented Apr 30, 2026 •

edited

Loading

Uh oh!

lfr-0531 commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

Barry-Delaney commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

Barry-Delaney commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Barry-Delaney commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

YAML wiring

Test code

Reference accuracy

Stage layout

Validation

Known issues (commented out in YAMLs)

Test plan

Follow-ups

Uh oh!

lfr-0531 commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

Barry-Delaney commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

Barry-Delaney commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Barry-Delaney commented Apr 30, 2026 •

edited

Loading