Skip to content

[None][test] Add DeepSeek-V4 CI coverage#13653

Merged
Barry-Delaney merged 2 commits into
NVIDIA:feat/deepseek_v4from
Barry-Delaney:user/jinshik/v4-ci-on-pr13604
Apr 30, 2026
Merged

[None][test] Add DeepSeek-V4 CI coverage#13653
Barry-Delaney merged 2 commits into
NVIDIA:feat/deepseek_v4from
Barry-Delaney:user/jinshik/v4-ci-on-pr13604

Conversation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator

@Barry-Delaney Barry-Delaney commented Apr 30, 2026

Summary

Layered V4-specific CI on top of the per-platform _ds.yml structure introduced in #13604. Adds unit-test coverage for V4 sparse-attention kernels + tokenizer (single B200), an aggregate forward smoke for V4-Flash NVFP4 (4 B200), and aggregate smoke + static EPLB sanity for V4-Flash-Base FP8 / V4-Flash NVFP4 EPLB (4 B300).

Built on top of #13604 (emma/update_ds4_ci).

Changes

YAML wiring

File Change
tests/integration/test_lists/test-db/l0_b200_ds.yml +12 lines. Append 7 V4 sparse-attention unit tests (test_deepseek_v4_sparse_mla, test_deepseek_v4_cache_manager, test_deepseek_v4_indices_transform, test_deepseek_v4_o_proj, test_compressor_kernel, test_compressor_module, test_compressor_tf32) + tokenizer chat-template tests. Single B200, pre_merge.
tests/integration/test_lists/test-db/l0_dgx_b200_ds.yml +4 lines. Add TestDeepSeekV4Flash::test_auto_dtype (NVFP4 aggregate smoke, TP=4) to the existing 4-GPU pre_merge block alongside test_modeling_deepseekv4.py.
tests/integration/test_lists/test-db/l0_dgx_b300_ds.yml +25 lines. Merge V4 entries into Emma's existing pre_merge block (added orchestrator: mpi). New entries: TestDeepSeekV4FlashBase::test_auto_dtype (FP8 aggregate smoke, TP=4) + TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb[moe_backend=TRTLLM] (full GSM8K eval). Known-issues comment lists 4 currently-blocked entries (FP8 EPLB cublas + 3 online-EPLB gdrcopy blockers) for future unblocking.
tests/integration/test_lists/test-db/l0_b200.yml -1 line. Remove duplicate test_modeling_deepseekv4.py entry (now homed in l0_dgx_b200_ds.yml 4-GPU block per #13604).
tests/integration/test_lists/test-db/l0_dgx_b200.yml -4 lines. Remove obsolete commented V4 EPLB entries that are superseded by the new _ds.yml files.

Test code

tests/integration/defs/accuracy/test_llm_api_pytorch.py (+103/-26):

  • TestDeepSeekV4Flash::test_auto_dtype: aggregate (non-disagg, non-EPLB) smoke at TP=4 NVFP4 + TRTLLM backend + attention DP. 1-sample MMLU via is_integration_test=True (no GSM8K reference required).
  • TestDeepSeekV4FlashBase::test_auto_dtype: aggregate smoke at TP=4 FP8 + WIDEEP backend (CUTLASS path is Hopper-only on Blackwell). 1-sample MMLU smoke.
  • _run_deepseekv4_eplb: add tensor_parallel_size=4 parameter (default 4 for B300 sizing). The earlier B200-specific tightening (kv_cache fraction=0.15, cuda_graph=None, max_num_tokens=2048) was reverted because B300's larger per-GPU memory absorbs the EPLB redundancy + DeepGemm MoE workspace pressure at default settings.
  • Rename _8gpus_*_eplb_4gpus_*_eplb (both Flash and Flash-Base, both static and online); update skip_less_mpi_world_size to 4. WIDEEP variants of V4-Flash NVFP4 EPLB tests are wrapped in pytest.param(..., marks=pytest.mark.skip) with a reason pointing at the MXFP4 gap in fused_moe_wide_ep.py:_get_quant_method.

Reference accuracy

tests/integration/defs/accuracy/references/gsm8k.yaml (+10):

deepseek-ai/DeepSeek-V4-Flash:
  # GSM8K measurements:
  # * 95.11 on 8x B200 178GB at TP=8 with tightened config (fraction=0.15,
  #   cuda_graph=None, max_num_tokens=2048) — original measurement.
  # * 95.38 on 4x B300 275GB at TP=4 with default config (fraction=0.5,
  #   CudaGraphConfig()) — current CI path (test_nvfp4_4gpus_static_eplb
  #   in l0_dgx_b300_ds.yml). Drift +0.27 from 95.11 is within sigma; the
  #   95.11 reference still holds for the hypothesis test.
  - quant_algo: FP8_BLOCK_SCALES
    accuracy: 95.11

V4-Flash declares quant_method=fp8 globally even though routed MoE experts are MXFP4; AccuracyTask picks up FP8_BLOCK_SCALES from the top-level config.

Stage layout

Jenkins stage yaml V4 entries
DGX_B200-PyTorch-DS-1 (1× B200) l0_b200_ds.yml 7 sparse-attention unit tests + tokenizer test
DGX_B200-4_GPUs-PyTorch-DS-1 (4× B200) l0_dgx_b200_ds.yml test_modeling_deepseekv4.py + V4-Flash NVFP4 aggregate smoke
DGX_B300-4_GPUs-PyTorch-DS-1 (4× B300) l0_dgx_b300_ds.yml V4-Flash-Base FP8 aggregate smoke + V4-Flash NVFP4 static EPLB sanity

Validation

Verified end-to-end on 4× B300:

Test Result Wall-clock
TestDeepSeekV4FlashBase::test_auto_dtype (FP8 agg, TP=4) ✅ PASSED 20:40
TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb[TRTLLM] (full GSM8K) ✅ PASSED — GSM8K 95.38 16:50

Earlier validation runs on 8× B200 confirmed:

  • 7 sparse-attention unit tests + tokenizer + 25 of 28 modeling-test functions pass on a single B200. The 3 modeling failures are stale string-match assertions in test_modeling_deepseekv4.py that pre-date this PR — not regressions.
  • V4-Flash NVFP4 EPLB on TRTLLM backend at TP=8 measured GSM8K = 95.11 (the reference baseline; B300/TP=4 drift to 95.38 is within sigma).

Known issues (commented out in YAMLs)

These entries are listed in YAML comments so they can be uncommented when the underlying issue is fixed:

  1. TestDeepSeekV4FlashBase::test_fp8_4gpus_static_eplb[WIDEEP] — V4-Flash-Base FP8 + EPLB hits CUBLAS_STATUS_EXECUTION_FAILED in forward pass with WIDEEP backend on B200 (memory-pressure tied). The aggregate FP8 path works on B300 — the EPLB variant likely also works there but was not validated in this PR.
  2. V4-Flash online EPLB (mtp_nextn=0,1) + V4-Flash-Base online EPLB — Online EPLB (layer_updates_per_iter > 0) requires gdrcopy/gdrdrv for HostAccessibleDeviceAllocator (moeLoadBalancer.cpp:846). Without gdrcopy the executor crashes during initialization. Static EPLB does not hit this path.
  3. WIDEEP variant of V4-Flash NVFP4 EPLBfused_moe_wide_ep.py:_get_quant_method lacks an MXFP4 branch, so it raises ValueError: Unsupported quantization mode: [65536] for V4-Flash routed experts. Skipped at the test level via pytest.param(..., marks=pytest.mark.skip). TRTLLM backend works.

Test plan

After this PR merges, the following pre_merge stages will exercise V4 on every PR touching the relevant code:

  • DGX_B200-PyTorch-DS-1 runs the 8 V4 unit-test files (~10-20 min)
  • DGX_B200-4_GPUs-PyTorch-DS-1 runs test_modeling_deepseekv4.py + V4-Flash agg smoke (~15-25 min)
  • DGX_B300-4_GPUs-PyTorch-DS-1 runs V4-Flash-Base agg smoke + V4-Flash NVFP4 static EPLB with full GSM8K eval (~40-60 min)
  • gsm8k.yaml accuracy threshold check (95.11 ± hypothesis-test sigma) catches V4-Flash NVFP4 EPLB regressions

Follow-ups

  • Resolve the fused_moe_wide_ep.py:_get_quant_method MXFP4 gap so WIDEEP backend can also serve V4-Flash; then enable the WIDEEP-variant tests.
  • Validate V4-Flash-Base FP8 EPLB on B300 (the cublas issue may be B200-specific).
  • Once CI nodes have gdrcopy installed, uncomment the 3 online-EPLB entries and measure GSM8K baseline for mtp_nextn=1 (MTP variant) since it's a separate spec_dec_algo: MTP reference key.

@lfr-0531
Copy link
Copy Markdown
Collaborator

/bot run

@Barry-Delaney Barry-Delaney marked this pull request as ready for review April 30, 2026 07:56
@Barry-Delaney Barry-Delaney requested review from a team as code owners April 30, 2026 07:56
@Barry-Delaney Barry-Delaney requested review from mzweilz and zeroepoch and removed request for a team April 30, 2026 07:56
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46340 [ run ] triggered by Bot. Commit: dff35aa Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46340 [ run ] completed with state FAILURE. Commit: dff35aa

Link to invocation

@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/v4-ci-on-pr13604 branch from dff35aa to 79591c4 Compare April 30, 2026 08:21
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46345 [ run ] triggered by Bot. Commit: 79591c4 Link to invocation

@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/v4-ci-on-pr13604 branch from 5c0b9ba to 120e606 Compare April 30, 2026 09:22
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46353 [ kill ] triggered by Bot. Commit: 120e606 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46345 [ run ] completed with state ABORTED. Commit: 79591c4

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46353 [ kill ] completed with state SUCCESS. Commit: 120e606
Successfully killed previous jobs for commit 120e606

Link to invocation

Layered V4-specific CI on top of Emma's per-platform _ds.yml structure.

YAML wiring:
* l0_b200_ds.yml (1x B200, pre_merge): add V4 sparse-attention unit tests
  (sparse_mla, cache_manager, indices_transform, o_proj, compressor_kernel,
  compressor_module, compressor_tf32) + tokenizer chat-template test.
* l0_dgx_b200_ds.yml (4x B200, pre_merge): add V4-Flash aggregate smoke
  (TP=4 NVFP4, TRTLLM backend, 1-sample MMLU) alongside Emma's
  test_modeling_deepseekv4.py.
* l0_dgx_b300_ds.yml (4x B300, pre_merge): merge V4 entries into Emma's
  existing pre_merge block (added orchestrator: mpi):
    - TestDeepSeekV4FlashBase::test_auto_dtype (FP8 aggregate smoke, TP=4)
    - TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb[moe_backend=TRTLLM]
  B300's larger per-GPU memory (~275 GB) lets V4 EPLB and FP8 aggregate
  fit at TP=4. Known-issues comment lists 4 currently-blocked entries
  (FP8 EPLB cublas + 3 online-EPLB gdrcopy blockers).
* l0_b200.yml / l0_dgx_b200.yml: remove obsolete V4 entries that are now
  homed under the _ds.yml files.

Test code (tests/integration/defs/accuracy/test_llm_api_pytorch.py):
* TestDeepSeekV4Flash::test_auto_dtype: aggregate (non-disagg, non-EPLB)
  smoke at TP=4 NVFP4 with TRTLLM backend.
* TestDeepSeekV4FlashBase::test_auto_dtype: aggregate smoke at TP=4 FP8
  with WIDEEP backend (CUTLASS path is Hopper-only on Blackwell).
* _run_deepseekv4_eplb: add tensor_parallel_size=4 parameter (default 4
  for B300 sizing).
* Rename _8gpus_*_eplb -> _4gpus_*_eplb (both Flash and Flash-Base, both
  static and online); update skip_less_mpi_world_size to 4. WIDEEP
  variants of V4-Flash NVFP4 EPLB tests are pytest.param(..., marks=skip)
  with a reason pointing at the MXFP4 gap in fused_moe_wide_ep.py
  _get_quant_method.

Reference accuracy (tests/integration/defs/accuracy/references/gsm8k.yaml):
* deepseek-ai/DeepSeek-V4-Flash + quant_algo: FP8_BLOCK_SCALES = 95.11
  (V4-Flash declares quant_method=fp8 globally even though routed MoE
  experts are MXFP4; AccuracyTask picks up FP8_BLOCK_SCALES). Comment
  records both measurements: 95.11 on 8x B200/TP=8 with tightened config
  and 95.38 on 4x B300/TP=4 with default config (drift +0.27 within
  hypothesis-test sigma).

Notes on currently-blocked entries (commented in YAML):
* V4-Flash-Base FP8 + EPLB: hits CUBLAS_STATUS_EXECUTION_FAILED in
  forward pass with WIDEEP backend on B200 (memory-pressure related).
  Aggregate FP8 path works on B300 — EPLB variant may also work; left
  commented pending validation.
* Online EPLB (layer_updates_per_iter>0) requires gdrcopy/gdrdrv for
  HostAccessibleDeviceAllocator (moeLoadBalancer.cpp:846); without
  gdrcopy the executor crashes during initialization. Static EPLB does
  not hit this path. Uncomment when CI nodes have gdrcopy installed.

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Two pre-existing failures in test_modeling_deepseekv4.py surfaced once the
file was wired into pre_merge CI via the DGX_B200-4_GPUs-PyTorch-DS-1
stage:

* test_deepseek_v4_moe_swiglu_limit_applies_to_routed_and_shared_experts
  — removed. The test was a `inspect.getsource()` string-match check that
  asserted literal source text (`"supports_swiglu_limit = False"`,
  `"mode.has_w4a8_mxfp4_mxfp8()"`) which no longer exists in
  DeepseekV4MoE.__init__ — the implementation simplified to a runtime
  tuple-membership test (`moe_cls in (CutlassFusedMoE, ...)`) and dropped
  the quant-mode-specific gate. Brittle source-string matching is the
  wrong tool for verifying the swiglu_limit dispatch path; runtime
  behavior is exercised by any V4 test that constructs a model with
  `swiglu_limit` set in config.

* test_deepseek_v4_sparse_ratios_prefer_checkpoint_defaults — adjusted.
  The setup passed `sliding_window=256` and the assertion expected that
  to propagate to `sparse_attention_config.window_size`. But V4 sparse
  MLA hardcodes window_size==128 (FMHA kernel TileSizeKV; see the
  runtime assertion in
  DeepseekV4TrtllmAttentionMetadata.__post_init__), so 256 is rejected
  at runtime regardless of the resolution chain. Dropped the misleading
  `sliding_window=256` from the setup and changed the assertion to
  `== 128` to reflect the kernel constraint. The test's primary purpose
  (compress_ratios resolution from checkpoint) is unchanged and still
  passes.

Also drops the now-unused `DeepseekV4MoE` import (autoflake/ruff).

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/v4-ci-on-pr13604 branch from 120e606 to bfcb2c6 Compare April 30, 2026 10:37
@Barry-Delaney Barry-Delaney merged commit 3992dd2 into NVIDIA:feat/deepseek_v4 Apr 30, 2026
4 checks passed
lfr-0531 pushed a commit that referenced this pull request May 7, 2026
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
lfr-0531 pushed a commit that referenced this pull request May 14, 2026
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
lfr-0531 pushed a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 29, 2026
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
(cherry picked from commit fa1e55e)
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants