[None][feat] Enable EPLB for DeepSeek-V4#13595
Merged
lfr-0531 merged 2 commits intoApr 29, 2026
Merged
Conversation
5fe98b0 to
63fb6fa
Compare
63fb6fa to
c3ac084
Compare
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Bandit's hardcoded_password_string heuristic flags the DeepSeek tokenizer special tokens (BOS/EOS/USER/ASSISTANT/THINKING_END) as potential hardcoded passwords. They are tokenizer markers, not credentials. Mark each line with `# nosec B105` so the release_check CI step (which fails on any `Issue:` in bandit output) stops blocking on these false positives. Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
0b41533 to
ad2da18
Compare
lfr-0531
pushed a commit
that referenced
this pull request
May 7, 2026
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
lfr-0531
pushed a commit
that referenced
this pull request
May 14, 2026
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
lfr-0531
pushed a commit
to lfr-0531/TensorRT-LLM
that referenced
this pull request
May 29, 2026
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com> Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com> (cherry picked from commit eb85528) Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@coderabbitai summary
Description
Enables Expert Parallel Load Balancing (EPLB) for DeepSeek-V4 on top of the existing DeepSeek-V4 base support.
DeepseekV4ForCausalLMinmoe_model_arch_listso the MoE load balancer recognizes the V4 architecture (tensorrt_llm/_torch/modules/fused_moe/moe_load_balancer.py).DeepSeek-V4-Flash(NVFP4) andDeepSeek-V4-Flash-Base(FP8) under both static and online EPLB on 8 GPUs (Blackwell)._make_deepseekv4_eplb_config(...), a small helper that buildsMoeLoadBalancerConfigfrom the HF config. DeepSeek-V4 has nofirst_k_dense_replaceprefix, so every layer in0..num_hidden_layers-1is MoE;num_slots = n_routed_experts + 16 * EPmatches the redundancy used byTestNemotronV3Super.l0_dgx_b200.yml, commented out until theDeepSeek-V4-Flash/DeepSeek-V4-Flash-Basecheckpoints are published underllm_models_root(). They can be uncommented in a follow-up once the weights are staged.No runtime behavior changes for existing models.
Test Coverage
New tests in
tests/integration/defs/accuracy/test_llm_api_pytorch.py:TestDeepSeekV4Flash::test_nvfp4_8gpus_static_eplb[moe_backend=WIDEEP|CUTLASS]TestDeepSeekV4Flash::test_nvfp4_8gpus_online_eplb[moe_backend=WIDEEP|CUTLASS|TRTLLM][mtp_nextn=0|1]TestDeepSeekV4FlashBase::test_fp8_8gpus_static_eplb[moe_backend=WIDEEP|CUTLASS]TestDeepSeekV4FlashBase::test_fp8_8gpus_online_eplb[moe_backend=WIDEEP|CUTLASS]All tests are gated by
@skip_pre_blackwelland require 8 GPUs with ≥140 GB memory each, plusLLM_MODELS_ROOTcontaining the V4 checkpoints. They run GSM8K through the LLM API with TP=8 / EP=8 andenable_attention_dp=True.The
l0_dgx_b200pre-merge entries for the static-EPLB WIDEEP case are checked in but commented; please uncomment after the checkpoints land.