Skip to content

[None][feat] support non-divisible EP in MoE alltoall and slurm benchmark#13888

Open
JacobHu-NV wants to merge 6 commits into
NVIDIA:mainfrom
JacobHu-NV:feat/non-divisible-ep
Open

[None][feat] support non-divisible EP in MoE alltoall and slurm benchmark#13888
JacobHu-NV wants to merge 6 commits into
NVIDIA:mainfrom
JacobHu-NV:feat/non-divisible-ep

Conversation

@JacobHu-NV
Copy link
Copy Markdown
Collaborator

@JacobHu-NV JacobHu-NV commented May 8, 2026

This pull request enhances support for non-divisible expert parallelism (EP) in Mixture-of-Experts (MoE) communication by implementing ceil/floor partitioning for expert assignment across ranks. The changes ensure that MoE communication and computation remain correct and efficient when the number of experts is not evenly divisible by the number of ranks. This includes updates to CUDA kernels, Python interfaces, strategy selection logic, and comprehensive unit tests.

Key changes include:

Core algorithm and kernel updates:

  • Updated the CUDA kernel (moeAlltoAllKernels.cu) to use ceil/floor partitioning in compute_target_rank_id, allowing non-divisible expert-to-rank mapping and ensuring all experts are correctly assigned even when num_experts % ep_size != 0. [1] [2]
  • Removed the requirement for num_experts to be divisible by ep_size in the C++ operator, reflecting the new partitioning logic.

Python interface and strategy selection:

  • Introduced _compute_ep_partition in the Python interface to mirror the kernel's partitioning logic, ensuring consistent expert slot assignment and correct initialization in both standard and load-balanced MoE scenarios. [1] [2] [3] [4]
  • Modified communication strategy selection to fall back to AllGatherReduceScatter when non-divisible EP is detected, and enforced that only whitelisted methods (NVLinkOneSided, AllGather) are allowed for non-divisible configurations. [1] [2]

Testing and validation:

  • Updated and expanded unit tests to validate correct behavior for non-divisible EP, including helper functions for partitioning and expert-to-rank mapping, and new test parameter generators to cover edge cases. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

These changes collectively make the MoE implementation robust to non-uniform expert distributions, improving flexibility and correctness in distributed training scenarios.

Summary by CodeRabbit

  • New Features

    • Support for non-divisible expert distributions in Mixture-of-Experts configurations
    • Added compact_packing option for GPU worker placement in distributed SLURM deployments
  • Improvements

    • Enhanced CUDA device selection for distributed workers using SLURM environment variables and GPU mapping
  • Documentation

    • Updated SLURM benchmark configuration documentation with compact_packing usage guidance

Review Change Stack

@JacobHu-NV JacobHu-NV force-pushed the feat/non-divisible-ep branch 2 times, most recently from aa060db to ec776da Compare May 9, 2026 09:54
@JacobHu-NV JacobHu-NV changed the title Support non-divisible expert partitioning in MoE expert parallelism [None][feat] support non-divisible EP in MoE alltoall and slurm benchmark May 9, 2026
@JacobHu-NV JacobHu-NV force-pushed the feat/non-divisible-ep branch from ec776da to 40dda9b Compare May 20, 2026 06:15
JacobHu-NV and others added 4 commits May 20, 2026 00:43
…ded kernel

Replace strict equal-split partitioning with ceil/floor contiguous
partitioning in compute_target_rank_id so that num_experts does not
need to be divisible by ep_size. Ranks [0, num_experts % ep_size)
each own (num_experts / ep_size + 1) experts; the rest own
(num_experts / ep_size). When the remainder is zero the new logic is
mathematically identical to the previous uniform mapping, so all
existing callers see no behavior change.

The corresponding TORCH_CHECK in moeA2ADispatchOp is relaxed
accordingly. The op schema is unchanged.

Tests:
- Add test_moe_comm_non_divisible_ep covering (ep=2,n=5), (ep=4,n=17),
  (ep=4,n=22), each with and without low-precision combine. All 6
  cases pass on B200.
- Existing helpers _compute_ep_partition / _expert_id_to_rank are
  factored out to mirror the kernel logic so dispatch and combine
  reference computations stay correct in the non-divisible case.

NVLinkTwoSided still requires num_slots % ep_size == 0; the
test-side check_feasibility skips non-NVLinkOneSided comm types
for non-divisible configs.

Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
…tion

Add _compute_ep_partition() using ceil/floor distribution so that
num_experts % ep_size != 0 no longer crashes. Remove the
assert num_experts % ep_size == 0 in the no-EPLB fallback and use
the new helper for both __init__ and _init_moe_with_load_balancer.

Restrict comm backend selection for non-divisible EP: after NVLinkOneSided
is tried, fall back immediately to AllGatherReduceScatter instead of
attempting NVLinkTwoSided or DeepEP which both require divisible partitions.
Add a whitelist guard in _create_forced_method to reject unsupported methods.

Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
- submit.py: switch allocator to per-node round-robin where each ctx/gen
  worker owns ceil(world_size/gpus_per_node) whole nodes; advance cursor
  by full nodes so start_worker.sh can derive GPU id from SLURM_LOCALID
  without an explicit cuda_devices arg.
- submit.py: tag log dir with _ep{gen_ep_size} when EP differs from TP;
  honor CLI --log-dir over yaml log_dir; add --no-container-mount-home
  to client srun for consistency with worker srun.
- start_worker.sh: drop the cuda_devices positional arg; map GPU from
  SLURM_LOCALID.
- run_benchmark.sh: add --trust-remote-code to UCX warmup so trust-remote
  models (e.g. Kimi-K2) can warm up.

Signed-off-by: JacobHu-NV <jacohu@nvidia.com>
Add an opt-in `hardware.compact_packing` flag (default false) to the
disagg benchmark submitter. When enabled, pack workers onto the minimum
number of nodes by allowing two workers to share a physical node when
their GPU counts straddle a node boundary (e.g., two TP=6 ctx workers
fit in 3 four-GPU nodes via 4+2 / 2+4, with the middle node split
between them). NVL72's full NVLink fabric makes the shared node free
in performance terms. Default behavior is unchanged: each worker owns
whole nodes, round-robin across them.

submit.py
- allocate_gpus: take a compact_packing arg and dispatch to
  assign_server_compact (per-GPU cursor) or assign_server_round_robin
  (per-node cursor, preserved as the default).
- total_nodes: ceil(total_gpus / gpus_per_node) under compact_packing;
  ctx_nodes + gen_nodes otherwise.
- In compact_packing mode, emit per-worker hostfile_<role>_<id>_base.txt
  (one host per task in rank order) and gpu_map_<role>_<id>_base.txt
  (rank host gpu) with <nodeN_placeholder> hostnames; these drive srun
  --distribution=arbitrary and start_worker.sh's GPU lookup.
- Worker srun command: under compact_packing, prefix
  SLURM_HOSTFILE=<runtime_path> and pass --distribution=arbitrary
  (drop -N and --ntasks-per-node, both incompatible with arbitrary
  distribution); otherwise keep the original --nodelist + -N + --ntasks.

start_worker.sh
- CUDA_VISIBLE_DEVICES selection branches on whether the per-worker
  gpu_map file exists: compact_packing mode reads it via awk keyed by
  SLURM_PROCID (two workers sharing a node would both see LOCALID 0
  but each has its own gpu_map and PROCID space, so they pick distinct
  physical GPUs without collision); otherwise fall back to
  SLURM_LOCALID as before.

disaggr_torch.slurm
- After rewriting placeholders in the existing files, iterate over
  hostfile_*_base.txt and gpu_map_*_base.txt and apply the same
  replace_placeholder substitution. The glob's '[ -f ] || continue'
  guard makes this a no-op when compact_packing is off.

Validated on NVL72 with two real benchmarks (compact_packing=true):
  1ctx6_1gen6: 4 nodes -> 3 (T11 shared by ctx[4,5]/gen[0,1]),
               throughput 137232 -> 137232 tok/s (-0.35%, in noise).
  2ctx6_1gen4: previously raised IndexError; now 4 nodes
               (T11 shared by ctx0[4,5]/ctx1[0,1]) and bench completes.

Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
@JacobHu-NV JacobHu-NV force-pushed the feat/non-divisible-ep branch from 40dda9b to e426525 Compare May 20, 2026 07:47
@JacobHu-NV JacobHu-NV marked this pull request as ready for review May 20, 2026 07:54
@JacobHu-NV JacobHu-NV requested review from a team as code owners May 20, 2026 07:54
@JacobHu-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49391 [ run ] triggered by Bot. Commit: e426525 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

This PR extends MoE expert parallelism to support non-divisible expert distributions via ceil/floor contiguous partitioning and introduces SLURM compact-packing mode for efficient cross-node GPU scheduling. CUDA kernels, PyTorch interfaces, and tests are updated to handle uneven expert-to-rank assignments; SLURM job submission adds optional per-worker hostfiles and GPU mapping.

Changes

Non-Divisible Expert Parallel Support

Layer / File(s) Summary
CUDA kernel expert mapping and op validation
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu, cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
compute_target_rank_id rewritten to compute ceil/floor expert partitioning from num_experts and ep_size instead of assuming even division; divisibility validation removed from the op.
MoE interface partition computation and initialization
tensorrt_llm/_torch/modules/fused_moe/interface.py
New _compute_ep_partition helper derives expert count and slot boundaries per EP rank using ceil/floor distribution; MoE init updated to use it when load balancer absent, removing divisibility assumptions.
Communication strategy guards and method selection
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
create_strategy adds guard to skip unsupported methods for num_experts % ep_size != 0 and fall back to AllGatherReduceScatter; _create_forced_method enforces whitelist of NVLINK_ONE_SIDED and ALLGATHER for non-divisible EP.
Test helpers and reference compute implementation
tests/unittest/_torch/modules/moe/test_moe_comm.py (first 3 layers)
Added partition helpers and refactored simple_moe to accept explicit (slot_start, slot_end) instead of deriving from fixed experts_per_rank; worker pipeline computes per-rank slot ranges using new helper.
Dispatch and combine verification for non-divisible expert partitioning
tests/unittest/_torch/modules/moe/test_moe_comm.py (verification layers)
Token routing expectations, AllToAll slot validation, and DeepEP expert selection updated to use ceil/floor slot boundaries from _compute_ep_partition instead of fixed divisions.
Non-divisible EP test parameters and test case
tests/unittest/_torch/modules/moe/test_moe_comm.py (test generation)
Added parameter generator for non-divisible scenarios restricted to NVLinkOneSided and new test_moe_comm_non_divisible_ep exercising full dispatch→compute→combine pipeline.

SLURM Compact Packing Scheduling

Layer / File(s) Summary
Configuration schema and documentation for compact_packing
examples/disaggregated/slurm/benchmark/config.yaml, examples/disaggregated/slurm/benchmark/README.md
Commented-out compact_packing option and guidance added to benchmark config; README documents when compact packing is recommended (full-mesh NVLink) vs. discouraged (PCIe/partitioned NVLink).
Worker CUDA device selection and gpu_map lookup
examples/disaggregated/slurm/benchmark/start_worker.sh
CUDA device selection replaced: reads optional per-worker gpu_map_*.txt file to map SLURM_PROCID to GPU id via awk, falls back to SLURM_LOCALID, errors if mapping requested but missing.
Base file placeholder substitution
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
Loop added to detect hostfile_*_base.txt and gpu_map_*_base.txt files in log directory, strips _base.txt suffix, and replaces <nodeN_placeholder> values with actual hostnames for arbitrary srun distribution.
Benchmark warmup command argument formatting
examples/disaggregated/slurm/benchmark/run_benchmark.sh
Warmup benchmark Python command adjusted so --non-streaming and --trust-remote-code are properly continued across shell lines in same invocation.
GPU allocation strategy and node count computation
examples/disaggregated/slurm/benchmark/submit.py (allocation/node layers)
allocate_gpus extended with compact_packing parameter; when enabled performs cursor-based placement across nodes with shared-node GPU layouts; submit_job reads flag from config and computes single or dual node counts accordingly.
Log directory tagging with expert-parallel suffix
examples/disaggregated/slurm/benchmark/submit.py (log tagging)
Log directory suffix extended to include _ep{gen_ep_size} when expert-parallel size differs from tensor-parallel; generation TP label adjusted based on attention distributed-parallel configuration.
Worker launch planning, hostfile generation, and srun invocation
examples/disaggregated/slurm/benchmark/submit.py (worker launch)
Worker launch branches on compact_packing: when enabled creates per-worker hostfiles/gpu_maps from allocation nodes and uses SLURM_HOSTFILE with --distribution=arbitrary; when disabled uses prior node-list approach. Removed cuda_devices variable insertion; adjusted client srun prefix flag ordering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • tfogal
  • galagam
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is comprehensive, well-structured, and addresses all key sections including motivation, changes, and testing. However, the PR Checklist section (final required section of the template) is completely missing. Add the PR Checklist section with checkboxes confirming compliance with CODING_GUIDELINES, test coverage, API change labels, dependency scans, CODEOWNERS updates, documentation, and reviewer assignments.
Docstring Coverage ⚠️ Warning Docstring coverage is 67.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main feature: support for non-divisible expert parallelism in MoE alltoall and SLURM benchmark, matching the PR's core objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/unittest/_torch/modules/moe/test_moe_comm.py (1)

450-456: ⚡ Quick win

Keep the feasibility gate aligned with the production whitelist.

CommunicationFactory now allows AllGatherReduceScatter for non-divisible EP, but this helper still skips every backend except NVLinkOneSided. That makes the fallback path impossible to cover in this suite.

Suggested fix
-    if config.num_experts % config.ep_size != 0 and comm_type != COMM_NVLINK_ONE_SIDED:
-        # Only NVLinkOneSided supports non-divisible EP (ceil/floor partitioning).
-        # Other comm types still require num_experts divisible by ep_size.
+    if config.num_experts % config.ep_size != 0 and comm_type not in (
+        COMM_NVLINK_ONE_SIDED,
+        COMM_ALLGATHER_RS,
+    ):
+        # Non-divisible EP is currently covered for NVLinkOneSided and the
+        # AllGatherReduceScatter fallback path.
         return (
             f"comm_type={comm_type} requires num_experts divisible by ep_size, "
             f"got num_experts={config.num_experts}, ep_size={config.ep_size}"
         )

QA list updates look unnecessary here because this remains unittest-only coverage. As per coding guidelines, "Coverage expectations: ... Note missing negative tests ..." and "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/moe/test_moe_comm.py` around lines 450 - 456,
The feasibility gate incorrectly only permits non-divisible EP for
COMM_NVLINK_ONE_SIDED while CommunicationFactory now also allows
COMM_ALLGATHER_REDUCESCATTER; update the conditional in the helper (the block
that checks config.num_experts % config.ep_size != 0 and comm_type !=
COMM_NVLINK_ONE_SIDED) to also accept COMM_ALLGATHER_REDUCESCATTER (e.g., change
the single-value exclusion to a membership test or OR-check), and adjust the
returned error message text to reflect which comm types require num_experts
divisible by ep_size so tests can exercise the fallback path; locate references
to COMM_NVLINK_ONE_SIDED and COMM_ALLGATHER_REDUCESCATTER in this file to apply
the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/disaggregated/slurm/benchmark/submit.py`:
- Line 713: Remove the erroneous f-string prefix on the literal
f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1" in submit.py (this
is the list/argument element used when building the sbatch/submit command);
replace it with a plain string "--no-container-mount-home --mpi=pmix --overlap
-N 1 -n 1" so the linter F541 is not triggered.
- Line 441: The assignment to compact_packing is using
bool(hw_config.get('compact_packing', False)) which will incorrectly coerce
string values like "false" to True; change the logic in the submit code that
reads hw_config.get('compact_packing') (the compact_packing variable) to accept
actual booleans and parse string inputs explicitly (e.g., treat case-insensitive
"true"/"1"/"yes" as True and "false"/"0"/"no" as False) or fallback to the
default False when the key is missing or unrecognized; ensure the code first
checks isinstance(value, bool) and otherwise normalizes str values before
setting compact_packing.

In `@tensorrt_llm/_torch/modules/fused_moe/interface.py`:
- Around line 449-458: The current branch builds initial_global_assignments from
moe_load_balancer_config.num_local_slots even when the real load balancer is
missing, causing mismatched initial_local_expert_ids; change the logic in the
constructor/code that sets initial_global_assignments (where
moe_load_balancer_config, init_expert_size_per_partition and
initial_global_assignments are computed) to detect the actual availability of
the load balancer (i.e., only use moe_load_balancer_config.num_local_slots when
get_moe_load_balancer() returned a valid balancer object); if the balancer is
absent, set initial_global_assignments to the safe sequential mapping
list(range(self.num_experts)) instead; apply the same defensive check and
fallback in the other identical block (the second occurrence around
initial_local_expert_ids).

---

Nitpick comments:
In `@tests/unittest/_torch/modules/moe/test_moe_comm.py`:
- Around line 450-456: The feasibility gate incorrectly only permits
non-divisible EP for COMM_NVLINK_ONE_SIDED while CommunicationFactory now also
allows COMM_ALLGATHER_REDUCESCATTER; update the conditional in the helper (the
block that checks config.num_experts % config.ep_size != 0 and comm_type !=
COMM_NVLINK_ONE_SIDED) to also accept COMM_ALLGATHER_REDUCESCATTER (e.g., change
the single-value exclusion to a membership test or OR-check), and adjust the
returned error message text to reflect which comm types require num_experts
divisible by ep_size so tests can exercise the fallback path; locate references
to COMM_NVLINK_ONE_SIDED and COMM_ALLGATHER_REDUCESCATTER in this file to apply
the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 778f0924-4f42-4cab-8f10-6e20b58a081f

📥 Commits

Reviewing files that changed from the base of the PR and between f7fb5f4 and e426525.

📒 Files selected for processing (11)
  • cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
  • cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
  • examples/disaggregated/slurm/benchmark/README.md
  • examples/disaggregated/slurm/benchmark/config.yaml
  • examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
  • examples/disaggregated/slurm/benchmark/run_benchmark.sh
  • examples/disaggregated/slurm/benchmark/start_worker.sh
  • examples/disaggregated/slurm/benchmark/submit.py
  • tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tests/unittest/_torch/modules/moe/test_moe_comm.py

Comment thread examples/disaggregated/slurm/benchmark/submit.py
f"srun -l --container-name={container_name}",
f"--container-mounts={container_mount_str}",
f"--mpi=pmix --overlap -N 1 -n 1",
f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the empty f-string prefix.

This triggers F541 (f-string without placeholders) and will fail lint gates.

🔧 Proposed fix
-        f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",
+        "--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",
"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",
🧰 Tools
🪛 Flake8 (7.3.0)

[error] 713-713: f-string is missing placeholders

(F541)

🪛 Ruff (0.15.13)

[error] 713-713: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/disaggregated/slurm/benchmark/submit.py` at line 713, Remove the
erroneous f-string prefix on the literal f"--no-container-mount-home --mpi=pmix
--overlap -N 1 -n 1" in submit.py (this is the list/argument element used when
building the sbatch/submit command); replace it with a plain string
"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1" so the linter F541 is
not triggered.

Comment thread tensorrt_llm/_torch/modules/fused_moe/interface.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49391 [ run ] completed with state SUCCESS. Commit: e426525
/LLM/main/L0_MergeRequest_PR pipeline #39042 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@JacobHu-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49538 [ run ] triggered by Bot. Commit: e426525 Link to invocation

@zongfeijing zongfeijing requested a review from qiaoxj07 May 21, 2026 02:49
Comment thread examples/disaggregated/slurm/benchmark/submit.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49538 [ run ] completed with state SUCCESS. Commit: e426525
/LLM/main/L0_MergeRequest_PR pipeline #39167 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

- submit.py: fix calculate_nodes to match assign_server_round_robin's
  per-worker whole-node ownership semantic. Pooled formula
  ceil(world_size * num_servers / gpus_per_node) under-counts nodes
  when world_size is not a multiple of gpus_per_node (qiaoxj07 review,
  e.g. 2 ctx workers x ws=6 + 1 gen worker x ws=4 with gpus_per_node=4
  -> old formula gave 4 but allocator needed 5 -> IndexError).

- submit.py: drop erroneous f-string prefix on line introduced by
  e2157c3 (ruff F541, CodeRabbit review).

- test_moe_comm.py: relax check_feasibility to also allow
  AllGatherReduceScatter under non-divisible EP, matching the
  production fallback in CommunicationFactory.

Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
Match the formatting the project's yapf hook expects (one element per
line + dangling comma + closing paren on its own line), as enforced by
the PR check pre-commit on 221c9d6.

Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
@JacobHu-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@JacobHu-NV JacobHu-NV requested a review from qiaoxj07 May 22, 2026 02:05
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49813 [ run ] triggered by Bot. Commit: d4b351e Link to invocation

tianyuz-nv added a commit to tianyuz-nv/TensorRT-LLM that referenced this pull request May 22, 2026
Port the per-rank hostfile + gpu_map launcher mechanism from PR NVIDIA#13888
(Jianbo Hu, "support non-divisible EP in MoE alltoall and slurm
benchmark") into the in-tree DWDP examples launcher trio
(submit_dwdp.py, disaggr_torch_dwdp.slurm, start_worker_dwdp.sh).

Motivation: SLURM block distribution requires (n_ctx + n_gen * gen_tp)
to be a multiple of gpus_per_node, otherwise GEN tensor-parallel ranks
get split across nodes/trays and incur a large allreduce penalty.
Previously worked around by adding empty CTX slots (over-provisioning)
or picking dwdp_group multiples that happen to divide cleanly — both
fragile and incompatible with Mode B non-uniform expert ranges where
num_ctx is e.g. 5 or 6.

Dual-path design:

* Divisible case (num_ctx_gpus % gpus_per_node == 0):
  Use the legacy --nodelist + -N + --ntasks-per-node srun command.
  Block distribution gives the natural rank-to-node mapping;
  trtllm-serve picks its CUDA device from SLURM_LOCALID. No hostfile
  / gpu_map files are emitted. Perf-optimal path.

* Non-divisible case (Mode B dwdp3 dg=2, dwdp5 dg=1, etc.):
  Emit hostfile_mpi_worker_base.txt and gpu_map_mpi_worker_base.txt
  under log_dir, iterating CTX servers then GEN servers so global
  rank ordering matches split_world_comm's ctx_cfgs + gen_cfgs.
  disaggr_torch_dwdp.slurm rewrites <nodeN_placeholder> tokens to
  real hostnames at runtime. srun uses SLURM_HOSTFILE +
  --distribution=arbitrary to pin every rank's host placement.

DWDP-specific deviation from PR NVIDIA#13888: start_worker_dwdp.sh does NOT
export CUDA_VISIBLE_DEVICES. DWDP relies on intra-node peer GPU
discovery (VA composite cuMemMap of peer MNNVL fabric handles, UCX
cuda_ipc / cuda_copy intra-node KV transports, PyTorch peer device
enumeration); restricting CUDA visibility to a single GPU breaks these
paths and causes a 15% per-CTX-GPU regression (TPOT unchanged, TTFT
std blows up 3x). With our allocate_gpus's sequential cursor, gpu_id
== SLURM_LOCALID for every rank, so trtllm-serve's internal
LOCALID-based device selection already lands each rank on the correct
GPU. The gpu_map file is kept for diagnostics and audit logging only.

Empirical per-CTX-GPU req/s (DSv3-FP4-V2.1, ISL=8192/OSL=1, conc=256):
  dwdp4 dg=1 (8 GPU divisible)       3.46
  dwdp3 dg=4 (16 GPU divisible)      3.41
  dwdp5 dg=4 (24 GPU divisible)      ~3.3 (extrapolated from R1b)
  dwdp3 dg=2 (10 GPU non-divisible)  3.51
  dwdp5 dg=1 (9 GPU non-divisible)   3.42
  TPOT 14-18 ms across all configs   (kernel-neutral).

Enables the non-uniform / Mode B perf configurations used in the
follow-up dwdp_size=3/5 experiments in this refactor.

Co-authored-by: Jianbo Hu <jacohu@nvidia.com>
Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49813 [ run ] completed with state SUCCESS. Commit: d4b351e
/LLM/main/L0_MergeRequest_PR pipeline #39399 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@JacobHu-NV
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49907 [ run ] triggered by Bot. Commit: d4b351e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49907 [ run ] completed with state SUCCESS. Commit: d4b351e
/LLM/main/L0_MergeRequest_PR pipeline #39486 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants