Skip to content

[None][fix] Plumb swiglu_limit through DeepGEMM and TRTLLMGen FP8 fused MoE#13767

Merged
lfr-0531 merged 5 commits into
NVIDIA:feat/deepseek_v4from
Barry-Delaney:user/jinshik/deepgemm-swiglu-limit
May 7, 2026
Merged

[None][fix] Plumb swiglu_limit through DeepGEMM and TRTLLMGen FP8 fused MoE#13767
lfr-0531 merged 5 commits into
NVIDIA:feat/deepseek_v4from
Barry-Delaney:user/jinshik/deepgemm-swiglu-limit

Conversation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator

@Barry-Delaney Barry-Delaney commented May 5, 2026

Summary

Plumb optional per-expert swiglu_limit through both Blackwell FP8 fused-MoE paths so DeepSeek-V4-Flash-Base's config-declared clamp is actually applied to routed experts. No-clamp callers are byte-identical (Triton HAS_SWIGLU_LIMIT constexpr; CUDA null swigluLimitPtr).

Files changed

DeepGEMM Triton path:

  • tensorrt_llm/quantization/utils/fp8_utils.py
  • tensorrt_llm/_torch/modules/fused_moe/ops/moe_op_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_wide_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py
  • tensorrt_llm/_torch/models/modeling_deepseekv4.py

TRTLLMGen FP8 path:

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.cu
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu
  • cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp
  • tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_op_backend.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

Accuracy

V4-Flash-Base FP8, 8x B200, TP=8 EP=8, lm-eval gsm8k 5-shot, 1319 samples:

Backend clamp GSM8K avg
WIDEEP off (silently dropped) 91.17
WIDEEP → DeepGEMM op (this PR) on 92.23
TRTLLMGen FP8 (this PR) off 92.23
TRTLLMGen FP8 (this PR) on 92.23

@Barry-Delaney Barry-Delaney self-assigned this May 5, 2026
@Barry-Delaney Barry-Delaney requested review from a team as code owners May 5, 2026 14:07
@Barry-Delaney Barry-Delaney requested review from HuiGao-NV, jiaganc, liji-nv and symphonylyh and removed request for a team May 5, 2026 14:07
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46816 [ run ] triggered by Bot. Commit: c81e578 Link to invocation

@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/deepgemm-swiglu-limit branch from c81e578 to f2ecdbf Compare May 5, 2026 17:29
@Barry-Delaney Barry-Delaney requested a review from a team as a code owner May 5, 2026 17:41
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46816 [ run ] completed with state SUCCESS. Commit: c81e578
/LLM/main/L0_MergeRequest_PR pipeline #36837 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/deepgemm-swiglu-limit branch from ac32c9c to e07f48b Compare May 6, 2026 01:40
@Barry-Delaney Barry-Delaney requested a review from lfr-0531 May 6, 2026 01:41
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46889 [ run ] triggered by Bot. Commit: e07f48b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46889 [ run ] completed with state FAILURE. Commit: e07f48b
/LLM/main/L0_MergeRequest_PR pipeline #36898 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…sed MoE

Forward an optional per-expert swiglu_limit through both fused-MoE FP8
paths so DeepSeek-V4-Flash-Base (FP8 block-scale on Blackwell) actually
applies its config-declared gate/up clamp on routed experts, matching
the swiglu_torch reference. Existing callers that pass no limit are
unaffected: the Triton kernel guards on HAS_SWIGLU_LIMIT, and the CUDA
kernels guard on a null swigluLimitPtr.

DeepGEMM Triton path (used by WIDEEP and DEEPGEMM moe_backends):
  - silu_and_mul_masked_post_quant_fwd accepts an optional fp32 [g]
    tensor; the kernel applies gate.clamp(max=limit) and
    up.clamp(-limit, limit) before silu/mul.
  - WideEPMoE / DeepGemmFusedMoE __init__ accept swiglu_limit and
    propagate it to the underlying op via self.swiglu_limit.
  - create_moe.py allow-list includes WideEPMoE / DeepGemmFusedMoE.
  - DeepseekV4MoE supports_swiglu_limit set extends to those classes.

TRTLLMGen FP8 path (run_fp8_block_scale_moe):
  - C++ binding accepts an optional gemm1_clamp_limit tensor of shape
    [local_num_experts]; setOpsData forwards it to
    activation::Data::swigluLimitPtr.
  - Both activationKernel and activationDeepSeekKernel apply the clamp
    after dequantization, before silu/mul. FP8 path treats the limit
    as uniform across experts (reads index [0]); the per-expert tensor
    shape is preserved for API symmetry with the NVFP4 path.
  - fp8_block_scale_moe_runner custom op grows the kwarg; the autotuner
    input list shifts replacement indices accordingly.
  - moe_op_backend TRTLLM impl forwards the new kwarg; Flashinfer impl
    raises NotImplementedError until its wrapper exposes the param.
  - _check_configs is split: bias/alpha/beta still gate to NVFP4/MXFP4
    (they need the fused-GEMM activation cubins); swiglu_limit also
    accepts FP8 block-scale via the separate-activation kernel.

Empirical V4-Flash-Base GSM8K (8x B200, TP=8 EP=8, lm-eval 5-shot):
  WIDEEP no clamp (silently dropped):       91.17
  WIDEEP -> DeepGEMM op, clamp on:          92.23 (+1.06 abs, ~1.4 sigma)
  TRTLLMGen FP8, clamp off:                 92.23
  TRTLLMGen FP8, clamp on (fixed kernel):   92.23 (no shift; FC1 rarely
                                                  trips +-10 in this path)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Parametrize TestDeepSeekV4FlashBase::test_auto_dtype on moe_backend
(WIDEEP, TRTLLM) and switch the 4xB300 pre_merge entry from the
WIDEEP-hardcoded form to the TRTLLM variant. TRTLLMGen FP8 is the
user-facing default on Blackwell (model_config.py::resolve_moe_backend)
and ~7% faster per step than WIDEEP in our V4-Flash-Base GSM8K runs;
holding the WIDEEP variant out of CI for now (still selectable
manually).

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Drop the per-element fp32 round-trip and per-CTA / per-program global
load that the FP8 swiglu_limit plumbing introduced. swiglu_limit is
uniform across experts on the FP8 paths (V4-Flash-Base config), so the
per-expert tensor was redundant; lifting it to a scalar value lets the
clamp run in native dtype and gets baked into the kernel.

DeepGEMM Triton kernel (silu_and_mul_masked_post_quant_fwd):
  - Drop the bf16 -> fp32 -> clamp -> bf16 round-trip on `up`. Clamp
    uses tl.cast(SWIGLU_LIMIT, input dtype); for V4 (limit ~7) this is
    bf16-exact, so semantics are preserved.
  - Replace the swiglu_limit_ptr argument with SWIGLU_LIMIT: tl.constexpr
    (Python float baked into the JIT). Removes one global load per
    program and lets the limit constant-fold.
  - Wrapper now takes Optional[float] instead of Optional[Tensor].

TRTLLMGen FP8 separate-activation kernels (DevKernel.cu):
  - Replace activation::Data::swigluLimitPtr with scalar swigluLimit +
    hasSwigluLimit. Eliminates the per-CTA fp32 global load.
  - Plumb a scalar gemm1_clamp_limit_value + has_gemm1_clamp_limit_value
    through MoERunnerArgs. The pre-existing gemm1_clamp_limit pointer
    is kept for NVFP4 / MXFP4 fused-activation cubins (which genuinely
    consume per-expert limits via fc31_alpha rescaling).
  - fp8BlockScaleMoe.cpp binding takes Optional<double>.

Autotuner (trtllm_gen_custom_ops.py):
  - Drop gemm1_clamp_limit from the FP8 runner's input_tensors_for_tuner.
    The limit doesn't influence tactic validity, so it was just
    fragmenting the cache key. Pass through the runner constructor
    instead.

MoE plumbing:
  - Add swiglu_limit_scalar to MoE base + thread through create_moe,
    DeepseekV4MoE, ConfigurableMoE, CutlassFusedMoE / DeepGemmFusedMoE /
    WideEPMoE / TRTLLMGenFusedMoE constructors. FP8 paths read
    self.swiglu_limit_scalar; NVFP4 paths still use self.swiglu_limit
    (per-expert tensor) unchanged.

Validation (V4-Flash-Base, 8x B200, TP=8 EP=8, lm-eval gsm8k 5-shot):
  WIDEEP backend (DeepGEMM Triton path, clamp load-bearing):
    92.57 +/- 0.72  (vs reference 92.23 with clamp on; 91.17 with
                     clamp silently dropped). Clamp is correctly
                     applied through the optimized Triton kernel.
  TRTLLM backend (TRTLLMGen FP8 separate-activation path):
    90.90 +/- 0.79  (vs reference 92.23; within 95% CI. Clamp is a
                     no-op semantically in this path per the parent
                     commit's measurements.)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
DeepSeek-V4-Flash-Base is not yet on the CI shared model store
(/scratch.trt_llm_data/llm-models/DeepSeek-V4-Flash-Base does not
exist on B300 CI workers). When the path is missing,
_ModelWrapper.__post_init__ leaves model as a string, is_local_model
returns False, and LLM(...) falls into the HF download branch where
snapshot_download rejects the slash-bearing path with HFValidationError.

Comment out the test until the model is uploaded; leave a TODO
pointing at the missing path.

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
…per fix

TestDeepSeekV4Flash::test_nvfp4_4gpus_static_eplb[moe_backend=TRTLLM]
fails on B300 with AttributeError: 'str' object has no attribute 'value'
out of cuda/bindings/driver.pyx. The throw originates from
kv_cache_manager_v2/_exceptions.py:49 calling drv.cuGetErrorString on a
plain Python string instead of a CUresult enum, so the worker's actual
init CUDA error is unreadable. Skip until that wrapper is fixed and the
underlying NVFP4 + EPLB init failure can be diagnosed.

Unrelated to the swiglu_limit FP8 path; surfaces independently in the
NVFP4 fused-activation cubin path.

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
@Barry-Delaney Barry-Delaney force-pushed the user/jinshik/deepgemm-swiglu-limit branch from e07f48b to 865c8b6 Compare May 6, 2026 05:21
@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46924 [ run ] triggered by Bot. Commit: 865c8b6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46924 [ run ] completed with state SUCCESS. Commit: 865c8b6
/LLM/main/L0_MergeRequest_PR pipeline #36930 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Barry-Delaney
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46973 [ run ] triggered by Bot. Commit: 865c8b6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46973 [ run ] completed with state SUCCESS. Commit: 865c8b6
/LLM/main/L0_MergeRequest_PR pipeline #36958 completed with status: 'SUCCESS'

CI Report

Link to invocation

@lfr-0531 lfr-0531 merged commit 1a52b72 into NVIDIA:feat/deepseek_v4 May 7, 2026
5 checks passed
lfr-0531 pushed a commit that referenced this pull request May 7, 2026
…ed MoE (#13767)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
(cherry picked from commit 1a52b72)
Signed-off-by: Yuhang He <58161490+heyuhhh@users.noreply.github.com>
lfr-0531 pushed a commit that referenced this pull request May 14, 2026
…ed MoE (#13767)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
(cherry picked from commit 1a52b72)
Signed-off-by: Yuhang He <58161490+heyuhhh@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
lfr-0531 pushed a commit to lfr-0531/TensorRT-LLM that referenced this pull request May 29, 2026
…ed MoE (NVIDIA#13767)

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
(cherry picked from commit 1a52b72)
Signed-off-by: Yuhang He <58161490+heyuhhh@users.noreply.github.com>
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
(cherry picked from commit 7a9b0ca)
Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants