[None][feat] support non-divisible EP in MoE alltoall and slurm benchmark by JacobHu-NV · Pull Request #13888 · NVIDIA/TensorRT-LLM

JacobHu-NV · 2026-05-08T07:20:29Z

This pull request enhances support for non-divisible expert parallelism (EP) in Mixture-of-Experts (MoE) communication by implementing ceil/floor partitioning for expert assignment across ranks. The changes ensure that MoE communication and computation remain correct and efficient when the number of experts is not evenly divisible by the number of ranks. This includes updates to CUDA kernels, Python interfaces, strategy selection logic, and comprehensive unit tests.

Key changes include:

Core algorithm and kernel updates:

Updated the CUDA kernel (moeAlltoAllKernels.cu) to use ceil/floor partitioning in compute_target_rank_id, allowing non-divisible expert-to-rank mapping and ensuring all experts are correctly assigned even when num_experts % ep_size != 0. [1] [2]
Removed the requirement for num_experts to be divisible by ep_size in the C++ operator, reflecting the new partitioning logic.

Python interface and strategy selection:

Introduced _compute_ep_partition in the Python interface to mirror the kernel's partitioning logic, ensuring consistent expert slot assignment and correct initialization in both standard and load-balanced MoE scenarios. [1] [2] [3] [4]
Modified communication strategy selection to fall back to AllGatherReduceScatter when non-divisible EP is detected, and enforced that only whitelisted methods (NVLinkOneSided, AllGather) are allowed for non-divisible configurations. [1] [2]

Testing and validation:

Updated and expanded unit tests to validate correct behavior for non-divisible EP, including helper functions for partitioning and expert-to-rank mapping, and new test parameter generators to cover edge cases. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

These changes collectively make the MoE implementation robust to non-uniform expert distributions, improving flexibility and correctness in distributed training scenarios.

Summary by CodeRabbit

New Features
- Support for non-divisible expert distributions in Mixture-of-Experts configurations
- Added compact_packing option for GPU worker placement in distributed SLURM deployments
Improvements
- Enhanced CUDA device selection for distributed workers using SLURM environment variables and GPU mapping
Documentation
- Updated SLURM benchmark configuration documentation with compact_packing usage guidance

…ded kernel Replace strict equal-split partitioning with ceil/floor contiguous partitioning in compute_target_rank_id so that num_experts does not need to be divisible by ep_size. Ranks [0, num_experts % ep_size) each own (num_experts / ep_size + 1) experts; the rest own (num_experts / ep_size). When the remainder is zero the new logic is mathematically identical to the previous uniform mapping, so all existing callers see no behavior change. The corresponding TORCH_CHECK in moeA2ADispatchOp is relaxed accordingly. The op schema is unchanged. Tests: - Add test_moe_comm_non_divisible_ep covering (ep=2,n=5), (ep=4,n=17), (ep=4,n=22), each with and without low-precision combine. All 6 cases pass on B200. - Existing helpers _compute_ep_partition / _expert_id_to_rank are factored out to mirror the kernel logic so dispatch and combine reference computations stay correct in the non-divisible case. NVLinkTwoSided still requires num_slots % ep_size == 0; the test-side check_feasibility skips non-NVLinkOneSided comm types for non-divisible configs. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

…tion Add _compute_ep_partition() using ceil/floor distribution so that num_experts % ep_size != 0 no longer crashes. Remove the assert num_experts % ep_size == 0 in the no-EPLB fallback and use the new helper for both __init__ and _init_moe_with_load_balancer. Restrict comm backend selection for non-divisible EP: after NVLinkOneSided is tried, fall back immediately to AllGatherReduceScatter instead of attempting NVLinkTwoSided or DeepEP which both require divisible partitions. Add a whitelist guard in _create_forced_method to reject unsupported methods. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

- submit.py: switch allocator to per-node round-robin where each ctx/gen worker owns ceil(world_size/gpus_per_node) whole nodes; advance cursor by full nodes so start_worker.sh can derive GPU id from SLURM_LOCALID without an explicit cuda_devices arg. - submit.py: tag log dir with _ep{gen_ep_size} when EP differs from TP; honor CLI --log-dir over yaml log_dir; add --no-container-mount-home to client srun for consistency with worker srun. - start_worker.sh: drop the cuda_devices positional arg; map GPU from SLURM_LOCALID. - run_benchmark.sh: add --trust-remote-code to UCX warmup so trust-remote models (e.g. Kimi-K2) can warm up. Signed-off-by: JacobHu-NV <jacohu@nvidia.com>

Add an opt-in `hardware.compact_packing` flag (default false) to the disagg benchmark submitter. When enabled, pack workers onto the minimum number of nodes by allowing two workers to share a physical node when their GPU counts straddle a node boundary (e.g., two TP=6 ctx workers fit in 3 four-GPU nodes via 4+2 / 2+4, with the middle node split between them). NVL72's full NVLink fabric makes the shared node free in performance terms. Default behavior is unchanged: each worker owns whole nodes, round-robin across them. submit.py - allocate_gpus: take a compact_packing arg and dispatch to assign_server_compact (per-GPU cursor) or assign_server_round_robin (per-node cursor, preserved as the default). - total_nodes: ceil(total_gpus / gpus_per_node) under compact_packing; ctx_nodes + gen_nodes otherwise. - In compact_packing mode, emit per-worker hostfile_<role>_<id>_base.txt (one host per task in rank order) and gpu_map_<role>_<id>_base.txt (rank host gpu) with <nodeN_placeholder> hostnames; these drive srun --distribution=arbitrary and start_worker.sh's GPU lookup. - Worker srun command: under compact_packing, prefix SLURM_HOSTFILE=<runtime_path> and pass --distribution=arbitrary (drop -N and --ntasks-per-node, both incompatible with arbitrary distribution); otherwise keep the original --nodelist + -N + --ntasks. start_worker.sh - CUDA_VISIBLE_DEVICES selection branches on whether the per-worker gpu_map file exists: compact_packing mode reads it via awk keyed by SLURM_PROCID (two workers sharing a node would both see LOCALID 0 but each has its own gpu_map and PROCID space, so they pick distinct physical GPUs without collision); otherwise fall back to SLURM_LOCALID as before. disaggr_torch.slurm - After rewriting placeholders in the existing files, iterate over hostfile_*_base.txt and gpu_map_*_base.txt and apply the same replace_placeholder substitution. The glob's '[ -f ] || continue' guard makes this a no-op when compact_packing is off. Validated on NVL72 with two real benchmarks (compact_packing=true): 1ctx6_1gen6: 4 nodes -> 3 (T11 shared by ctx[4,5]/gen[0,1]), throughput 137232 -> 137232 tok/s (-0.35%, in noise). 2ctx6_1gen4: previously raised IndexError; now 4 nodes (T11 shared by ctx0[4,5]/ctx1[0,1]) and bench completes. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

JacobHu-NV · 2026-05-20T07:55:26Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-20T08:01:09Z

PR_Github #49391 [ run ] triggered by Bot. Commit: e426525 Link to invocation

coderabbitai · 2026-05-20T08:02:05Z

📝 Walkthrough

Walkthrough

This PR extends MoE expert parallelism to support non-divisible expert distributions via ceil/floor contiguous partitioning and introduces SLURM compact-packing mode for efficient cross-node GPU scheduling. CUDA kernels, PyTorch interfaces, and tests are updated to handle uneven expert-to-rank assignments; SLURM job submission adds optional per-worker hostfiles and GPU mapping.

Changes

Non-Divisible Expert Parallel Support

Layer / File(s)	Summary
CUDA kernel expert mapping and op validation `cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu`, `cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp`	`compute_target_rank_id` rewritten to compute ceil/floor expert partitioning from `num_experts` and `ep_size` instead of assuming even division; divisibility validation removed from the op.
MoE interface partition computation and initialization `tensorrt_llm/_torch/modules/fused_moe/interface.py`	New `_compute_ep_partition` helper derives expert count and slot boundaries per EP rank using ceil/floor distribution; MoE init updated to use it when load balancer absent, removing divisibility assumptions.
Communication strategy guards and method selection `tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`	`create_strategy` adds guard to skip unsupported methods for `num_experts % ep_size != 0` and fall back to AllGatherReduceScatter; `_create_forced_method` enforces whitelist of `NVLINK_ONE_SIDED` and `ALLGATHER` for non-divisible EP.
Test helpers and reference compute implementation `tests/unittest/_torch/modules/moe/test_moe_comm.py` (first 3 layers)	Added partition helpers and refactored `simple_moe` to accept explicit `(slot_start, slot_end)` instead of deriving from fixed `experts_per_rank`; worker pipeline computes per-rank slot ranges using new helper.
Dispatch and combine verification for non-divisible expert partitioning `tests/unittest/_torch/modules/moe/test_moe_comm.py` (verification layers)	Token routing expectations, AllToAll slot validation, and DeepEP expert selection updated to use ceil/floor slot boundaries from `_compute_ep_partition` instead of fixed divisions.
Non-divisible EP test parameters and test case `tests/unittest/_torch/modules/moe/test_moe_comm.py` (test generation)	Added parameter generator for non-divisible scenarios restricted to `NVLinkOneSided` and new `test_moe_comm_non_divisible_ep` exercising full dispatch→compute→combine pipeline.

SLURM Compact Packing Scheduling

Layer / File(s)	Summary
Configuration schema and documentation for compact_packing `examples/disaggregated/slurm/benchmark/config.yaml`, `examples/disaggregated/slurm/benchmark/README.md`	Commented-out `compact_packing` option and guidance added to benchmark config; README documents when compact packing is recommended (full-mesh NVLink) vs. discouraged (PCIe/partitioned NVLink).
Worker CUDA device selection and gpu_map lookup `examples/disaggregated/slurm/benchmark/start_worker.sh`	CUDA device selection replaced: reads optional per-worker `gpu_map_*.txt` file to map `SLURM_PROCID` to GPU id via `awk`, falls back to `SLURM_LOCALID`, errors if mapping requested but missing.
Base file placeholder substitution `examples/disaggregated/slurm/benchmark/disaggr_torch.slurm`	Loop added to detect `hostfile__base.txt` and `gpu_map__base.txt` files in log directory, strips `_base.txt` suffix, and replaces `<nodeN_placeholder>` values with actual hostnames for arbitrary srun distribution.
Benchmark warmup command argument formatting `examples/disaggregated/slurm/benchmark/run_benchmark.sh`	Warmup benchmark Python command adjusted so `--non-streaming` and `--trust-remote-code` are properly continued across shell lines in same invocation.
GPU allocation strategy and node count computation `examples/disaggregated/slurm/benchmark/submit.py` (allocation/node layers)	`allocate_gpus` extended with `compact_packing` parameter; when enabled performs cursor-based placement across nodes with shared-node GPU layouts; `submit_job` reads flag from config and computes single or dual node counts accordingly.
Log directory tagging with expert-parallel suffix `examples/disaggregated/slurm/benchmark/submit.py` (log tagging)	Log directory suffix extended to include `_ep{gen_ep_size}` when expert-parallel size differs from tensor-parallel; generation TP label adjusted based on attention distributed-parallel configuration.
Worker launch planning, hostfile generation, and srun invocation `examples/disaggregated/slurm/benchmark/submit.py` (worker launch)	Worker launch branches on `compact_packing`: when enabled creates per-worker hostfiles/gpu_maps from allocation nodes and uses `SLURM_HOSTFILE` with `--distribution=arbitrary`; when disabled uses prior node-list approach. Removed `cuda_devices` variable insertion; adjusted client srun prefix flag ordering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

tfogal
galagam

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is comprehensive, well-structured, and addresses all key sections including motivation, changes, and testing. However, the PR Checklist section (final required section of the template) is completely missing.	Add the PR Checklist section with checkboxes confirming compliance with CODING_GUIDELINES, test coverage, API change labels, dependency scans, CODEOWNERS updates, documentation, and reviewer assignments.
Docstring Coverage	⚠️ Warning	Docstring coverage is 67.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main feature: support for non-divisible expert parallelism in MoE alltoall and SLURM benchmark, matching the PR's core objective.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/unittest/_torch/modules/moe/test_moe_comm.py (1)

450-456: ⚡ Quick win

Keep the feasibility gate aligned with the production whitelist.

CommunicationFactory now allows AllGatherReduceScatter for non-divisible EP, but this helper still skips every backend except NVLinkOneSided. That makes the fallback path impossible to cover in this suite.

Suggested fix

-    if config.num_experts % config.ep_size != 0 and comm_type != COMM_NVLINK_ONE_SIDED:
-        # Only NVLinkOneSided supports non-divisible EP (ceil/floor partitioning).
-        # Other comm types still require num_experts divisible by ep_size.
+    if config.num_experts % config.ep_size != 0 and comm_type not in (
+        COMM_NVLINK_ONE_SIDED,
+        COMM_ALLGATHER_RS,
+    ):
+        # Non-divisible EP is currently covered for NVLinkOneSided and the
+        # AllGatherReduceScatter fallback path.
         return (
             f"comm_type={comm_type} requires num_experts divisible by ep_size, "
             f"got num_experts={config.num_experts}, ep_size={config.ep_size}"
         )

QA list updates look unnecessary here because this remains unittest-only coverage. As per coding guidelines, "Coverage expectations: ... Note missing negative tests ..." and "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modules/moe/test_moe_comm.py` around lines 450 - 456,
The feasibility gate incorrectly only permits non-divisible EP for
COMM_NVLINK_ONE_SIDED while CommunicationFactory now also allows
COMM_ALLGATHER_REDUCESCATTER; update the conditional in the helper (the block
that checks config.num_experts % config.ep_size != 0 and comm_type !=
COMM_NVLINK_ONE_SIDED) to also accept COMM_ALLGATHER_REDUCESCATTER (e.g., change
the single-value exclusion to a membership test or OR-check), and adjust the
returned error message text to reflect which comm types require num_experts
divisible by ep_size so tests can exercise the fallback path; locate references
to COMM_NVLINK_ONE_SIDED and COMM_ALLGATHER_REDUCESCATTER in this file to apply
the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/disaggregated/slurm/benchmark/submit.py`:
- Line 713: Remove the erroneous f-string prefix on the literal
f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1" in submit.py (this
is the list/argument element used when building the sbatch/submit command);
replace it with a plain string "--no-container-mount-home --mpi=pmix --overlap
-N 1 -n 1" so the linter F541 is not triggered.
- Line 441: The assignment to compact_packing is using
bool(hw_config.get('compact_packing', False)) which will incorrectly coerce
string values like "false" to True; change the logic in the submit code that
reads hw_config.get('compact_packing') (the compact_packing variable) to accept
actual booleans and parse string inputs explicitly (e.g., treat case-insensitive
"true"/"1"/"yes" as True and "false"/"0"/"no" as False) or fallback to the
default False when the key is missing or unrecognized; ensure the code first
checks isinstance(value, bool) and otherwise normalizes str values before
setting compact_packing.

In `@tensorrt_llm/_torch/modules/fused_moe/interface.py`:
- Around line 449-458: The current branch builds initial_global_assignments from
moe_load_balancer_config.num_local_slots even when the real load balancer is
missing, causing mismatched initial_local_expert_ids; change the logic in the
constructor/code that sets initial_global_assignments (where
moe_load_balancer_config, init_expert_size_per_partition and
initial_global_assignments are computed) to detect the actual availability of
the load balancer (i.e., only use moe_load_balancer_config.num_local_slots when
get_moe_load_balancer() returned a valid balancer object); if the balancer is
absent, set initial_global_assignments to the safe sequential mapping
list(range(self.num_experts)) instead; apply the same defensive check and
fallback in the other identical block (the second occurrence around
initial_local_expert_ids).

---

Nitpick comments:
In `@tests/unittest/_torch/modules/moe/test_moe_comm.py`:
- Around line 450-456: The feasibility gate incorrectly only permits
non-divisible EP for COMM_NVLINK_ONE_SIDED while CommunicationFactory now also
allows COMM_ALLGATHER_REDUCESCATTER; update the conditional in the helper (the
block that checks config.num_experts % config.ep_size != 0 and comm_type !=
COMM_NVLINK_ONE_SIDED) to also accept COMM_ALLGATHER_REDUCESCATTER (e.g., change
the single-value exclusion to a membership test or OR-check), and adjust the
returned error message text to reflect which comm types require num_experts
divisible by ep_size so tests can exercise the fallback path; locate references
to COMM_NVLINK_ONE_SIDED and COMM_ALLGATHER_REDUCESCATTER in this file to apply
the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 778f0924-4f42-4cab-8f10-6e20b58a081f

📥 Commits

Reviewing files that changed from the base of the PR and between f7fb5f4 and e426525.

📒 Files selected for processing (11)

cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cu
cpp/tensorrt_llm/thop/moeAlltoAllOp.cpp
examples/disaggregated/slurm/benchmark/README.md
examples/disaggregated/slurm/benchmark/config.yaml
examples/disaggregated/slurm/benchmark/disaggr_torch.slurm
examples/disaggregated/slurm/benchmark/run_benchmark.sh
examples/disaggregated/slurm/benchmark/start_worker.sh
examples/disaggregated/slurm/benchmark/submit.py
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
tensorrt_llm/_torch/modules/fused_moe/interface.py
tests/unittest/_torch/modules/moe/test_moe_comm.py

coderabbitai · 2026-05-20T08:02:08Z

        f"srun -l --container-name={container_name}",
        f"--container-mounts={container_mount_str}",
-        f"--mpi=pmix --overlap -N 1 -n 1",
+        f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the empty f-string prefix.

This triggers F541 (f-string without placeholders) and will fail lint gates.

🔧 Proposed fix

- f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1", + "--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",

"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",

🧰 Tools

🪛 Flake8 (7.3.0)

[error] 713-713: f-string is missing placeholders

(F541)

🪛 Ruff (0.15.13)

[error] 713-713: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/disaggregated/slurm/benchmark/submit.py` at line 713, Remove the erroneous f-string prefix on the literal f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1" in submit.py (this is the list/argument element used when building the sbatch/submit command); replace it with a plain string "--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1" so the linter F541 is not triggered.

tensorrt-cicd · 2026-05-20T21:23:25Z

PR_Github #49391 [ run ] completed with state SUCCESS. Commit: e426525
/LLM/main/L0_MergeRequest_PR pipeline #39042 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

JacobHu-NV · 2026-05-21T02:05:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-21T02:11:07Z

PR_Github #49538 [ run ] triggered by Bot. Commit: e426525 Link to invocation

tensorrt-cicd · 2026-05-21T04:52:19Z

PR_Github #49538 [ run ] completed with state SUCCESS. Commit: e426525
/LLM/main/L0_MergeRequest_PR pipeline #39167 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

- submit.py: fix calculate_nodes to match assign_server_round_robin's per-worker whole-node ownership semantic. Pooled formula ceil(world_size * num_servers / gpus_per_node) under-counts nodes when world_size is not a multiple of gpus_per_node (qiaoxj07 review, e.g. 2 ctx workers x ws=6 + 1 gen worker x ws=4 with gpus_per_node=4 -> old formula gave 4 but allocator needed 5 -> IndexError). - submit.py: drop erroneous f-string prefix on line introduced by e2157c3 (ruff F541, CodeRabbit review). - test_moe_comm.py: relax check_feasibility to also allow AllGatherReduceScatter under non-divisible EP, matching the production fallback in CommunicationFactory. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

Match the formatting the project's yapf hook expects (one element per line + dangling comma + closing paren on its own line), as enforced by the PR check pre-commit on 221c9d6. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>

JacobHu-NV · 2026-05-22T02:04:46Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-22T02:11:49Z

PR_Github #49813 [ run ] triggered by Bot. Commit: d4b351e Link to invocation

Port the per-rank hostfile + gpu_map launcher mechanism from PR NVIDIA#13888 (Jianbo Hu, "support non-divisible EP in MoE alltoall and slurm benchmark") into the in-tree DWDP examples launcher trio (submit_dwdp.py, disaggr_torch_dwdp.slurm, start_worker_dwdp.sh). Motivation: SLURM block distribution requires (n_ctx + n_gen * gen_tp) to be a multiple of gpus_per_node, otherwise GEN tensor-parallel ranks get split across nodes/trays and incur a large allreduce penalty. Previously worked around by adding empty CTX slots (over-provisioning) or picking dwdp_group multiples that happen to divide cleanly — both fragile and incompatible with Mode B non-uniform expert ranges where num_ctx is e.g. 5 or 6. Dual-path design: * Divisible case (num_ctx_gpus % gpus_per_node == 0): Use the legacy --nodelist + -N + --ntasks-per-node srun command. Block distribution gives the natural rank-to-node mapping; trtllm-serve picks its CUDA device from SLURM_LOCALID. No hostfile / gpu_map files are emitted. Perf-optimal path. * Non-divisible case (Mode B dwdp3 dg=2, dwdp5 dg=1, etc.): Emit hostfile_mpi_worker_base.txt and gpu_map_mpi_worker_base.txt under log_dir, iterating CTX servers then GEN servers so global rank ordering matches split_world_comm's ctx_cfgs + gen_cfgs. disaggr_torch_dwdp.slurm rewrites <nodeN_placeholder> tokens to real hostnames at runtime. srun uses SLURM_HOSTFILE + --distribution=arbitrary to pin every rank's host placement. DWDP-specific deviation from PR NVIDIA#13888: start_worker_dwdp.sh does NOT export CUDA_VISIBLE_DEVICES. DWDP relies on intra-node peer GPU discovery (VA composite cuMemMap of peer MNNVL fabric handles, UCX cuda_ipc / cuda_copy intra-node KV transports, PyTorch peer device enumeration); restricting CUDA visibility to a single GPU breaks these paths and causes a 15% per-CTX-GPU regression (TPOT unchanged, TTFT std blows up 3x). With our allocate_gpus's sequential cursor, gpu_id == SLURM_LOCALID for every rank, so trtllm-serve's internal LOCALID-based device selection already lands each rank on the correct GPU. The gpu_map file is kept for diagnostics and audit logging only. Empirical per-CTX-GPU req/s (DSv3-FP4-V2.1, ISL=8192/OSL=1, conc=256): dwdp4 dg=1 (8 GPU divisible) 3.46 dwdp3 dg=4 (16 GPU divisible) 3.41 dwdp5 dg=4 (24 GPU divisible) ~3.3 (extrapolated from R1b) dwdp3 dg=2 (10 GPU non-divisible) 3.51 dwdp5 dg=1 (9 GPU non-divisible) 3.42 TPOT 14-18 ms across all configs (kernel-neutral). Enables the non-uniform / Mode B perf configurations used in the follow-up dwdp_size=3/5 experiments in this refactor. Co-authored-by: Jianbo Hu <jacohu@nvidia.com> Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

tensorrt-cicd · 2026-05-22T09:07:56Z

PR_Github #49813 [ run ] completed with state SUCCESS. Commit: d4b351e
/LLM/main/L0_MergeRequest_PR pipeline #39399 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

JacobHu-NV · 2026-05-22T09:09:53Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-22T09:15:32Z

PR_Github #49907 [ run ] triggered by Bot. Commit: d4b351e Link to invocation

tensorrt-cicd · 2026-05-22T12:04:50Z

PR_Github #49907 [ run ] completed with state SUCCESS. Commit: d4b351e
/LLM/main/L0_MergeRequest_PR pipeline #39486 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned JacobHu-NV May 8, 2026

JacobHu-NV force-pushed the feat/non-divisible-ep branch 2 times, most recently from aa060db to ec776da Compare May 9, 2026 09:54

JacobHu-NV changed the title ~~Support non-divisible expert partitioning in MoE expert parallelism~~ [None][feat] support non-divisible EP in MoE alltoall and slurm benchmark May 9, 2026

JacobHu-NV force-pushed the feat/non-divisible-ep branch from ec776da to 40dda9b Compare May 20, 2026 06:15

JacobHu-NV and others added 4 commits May 20, 2026 00:43

JacobHu-NV force-pushed the feat/non-divisible-ep branch from 40dda9b to e426525 Compare May 20, 2026 07:47

JacobHu-NV marked this pull request as ready for review May 20, 2026 07:54

JacobHu-NV requested review from a team as code owners May 20, 2026 07:54

JacobHu-NV requested review from FrankD412, QiJune, bo-nv and reasonsolo May 20, 2026 07:54

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Shixiaowei02 requested review from Shixiaowei02 and kaiyux May 21, 2026 02:49

zongfeijing requested a review from qiaoxj07 May 21, 2026 02:49

qiaoxj07 reviewed May 21, 2026

View reviewed changes

Comment thread examples/disaggregated/slurm/benchmark/submit.py

JacobHu-NV added 2 commits May 21, 2026 01:44

JacobHu-NV requested a review from qiaoxj07 May 22, 2026 02:05

tianyuz-nv mentioned this pull request May 22, 2026

[None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA #14453

Draft

7 tasks

	f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",
	"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",

Conversation

JacobHu-NV commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

JacobHu-NV commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

JacobHu-NV commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

JacobHu-NV commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

JacobHu-NV commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JacobHu-NV commented May 8, 2026 •

edited by coderabbitai Bot

Loading