[None][feat] support non-divisible EP in MoE alltoall and slurm benchmark#13888
[None][feat] support non-divisible EP in MoE alltoall and slurm benchmark#13888JacobHu-NV wants to merge 6 commits into
Conversation
aa060db to
ec776da
Compare
ec776da to
40dda9b
Compare
…ded kernel Replace strict equal-split partitioning with ceil/floor contiguous partitioning in compute_target_rank_id so that num_experts does not need to be divisible by ep_size. Ranks [0, num_experts % ep_size) each own (num_experts / ep_size + 1) experts; the rest own (num_experts / ep_size). When the remainder is zero the new logic is mathematically identical to the previous uniform mapping, so all existing callers see no behavior change. The corresponding TORCH_CHECK in moeA2ADispatchOp is relaxed accordingly. The op schema is unchanged. Tests: - Add test_moe_comm_non_divisible_ep covering (ep=2,n=5), (ep=4,n=17), (ep=4,n=22), each with and without low-precision combine. All 6 cases pass on B200. - Existing helpers _compute_ep_partition / _expert_id_to_rank are factored out to mirror the kernel logic so dispatch and combine reference computations stay correct in the non-divisible case. NVLinkTwoSided still requires num_slots % ep_size == 0; the test-side check_feasibility skips non-NVLinkOneSided comm types for non-divisible configs. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
…tion Add _compute_ep_partition() using ceil/floor distribution so that num_experts % ep_size != 0 no longer crashes. Remove the assert num_experts % ep_size == 0 in the no-EPLB fallback and use the new helper for both __init__ and _init_moe_with_load_balancer. Restrict comm backend selection for non-divisible EP: after NVLinkOneSided is tried, fall back immediately to AllGatherReduceScatter instead of attempting NVLinkTwoSided or DeepEP which both require divisible partitions. Add a whitelist guard in _create_forced_method to reject unsupported methods. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
- submit.py: switch allocator to per-node round-robin where each ctx/gen
worker owns ceil(world_size/gpus_per_node) whole nodes; advance cursor
by full nodes so start_worker.sh can derive GPU id from SLURM_LOCALID
without an explicit cuda_devices arg.
- submit.py: tag log dir with _ep{gen_ep_size} when EP differs from TP;
honor CLI --log-dir over yaml log_dir; add --no-container-mount-home
to client srun for consistency with worker srun.
- start_worker.sh: drop the cuda_devices positional arg; map GPU from
SLURM_LOCALID.
- run_benchmark.sh: add --trust-remote-code to UCX warmup so trust-remote
models (e.g. Kimi-K2) can warm up.
Signed-off-by: JacobHu-NV <jacohu@nvidia.com>
Add an opt-in `hardware.compact_packing` flag (default false) to the
disagg benchmark submitter. When enabled, pack workers onto the minimum
number of nodes by allowing two workers to share a physical node when
their GPU counts straddle a node boundary (e.g., two TP=6 ctx workers
fit in 3 four-GPU nodes via 4+2 / 2+4, with the middle node split
between them). NVL72's full NVLink fabric makes the shared node free
in performance terms. Default behavior is unchanged: each worker owns
whole nodes, round-robin across them.
submit.py
- allocate_gpus: take a compact_packing arg and dispatch to
assign_server_compact (per-GPU cursor) or assign_server_round_robin
(per-node cursor, preserved as the default).
- total_nodes: ceil(total_gpus / gpus_per_node) under compact_packing;
ctx_nodes + gen_nodes otherwise.
- In compact_packing mode, emit per-worker hostfile_<role>_<id>_base.txt
(one host per task in rank order) and gpu_map_<role>_<id>_base.txt
(rank host gpu) with <nodeN_placeholder> hostnames; these drive srun
--distribution=arbitrary and start_worker.sh's GPU lookup.
- Worker srun command: under compact_packing, prefix
SLURM_HOSTFILE=<runtime_path> and pass --distribution=arbitrary
(drop -N and --ntasks-per-node, both incompatible with arbitrary
distribution); otherwise keep the original --nodelist + -N + --ntasks.
start_worker.sh
- CUDA_VISIBLE_DEVICES selection branches on whether the per-worker
gpu_map file exists: compact_packing mode reads it via awk keyed by
SLURM_PROCID (two workers sharing a node would both see LOCALID 0
but each has its own gpu_map and PROCID space, so they pick distinct
physical GPUs without collision); otherwise fall back to
SLURM_LOCALID as before.
disaggr_torch.slurm
- After rewriting placeholders in the existing files, iterate over
hostfile_*_base.txt and gpu_map_*_base.txt and apply the same
replace_placeholder substitution. The glob's '[ -f ] || continue'
guard makes this a no-op when compact_packing is off.
Validated on NVL72 with two real benchmarks (compact_packing=true):
1ctx6_1gen6: 4 nodes -> 3 (T11 shared by ctx[4,5]/gen[0,1]),
throughput 137232 -> 137232 tok/s (-0.35%, in noise).
2ctx6_1gen4: previously raised IndexError; now 4 nodes
(T11 shared by ctx0[4,5]/ctx1[0,1]) and bench completes.
Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
40dda9b to
e426525
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #49391 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR extends MoE expert parallelism to support non-divisible expert distributions via ceil/floor contiguous partitioning and introduces SLURM compact-packing mode for efficient cross-node GPU scheduling. CUDA kernels, PyTorch interfaces, and tests are updated to handle uneven expert-to-rank assignments; SLURM job submission adds optional per-worker hostfiles and GPU mapping. ChangesNon-Divisible Expert Parallel Support
SLURM Compact Packing Scheduling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/unittest/_torch/modules/moe/test_moe_comm.py (1)
450-456: ⚡ Quick winKeep the feasibility gate aligned with the production whitelist.
CommunicationFactorynow allowsAllGatherReduceScatterfor non-divisible EP, but this helper still skips every backend exceptNVLinkOneSided. That makes the fallback path impossible to cover in this suite.Suggested fix
- if config.num_experts % config.ep_size != 0 and comm_type != COMM_NVLINK_ONE_SIDED: - # Only NVLinkOneSided supports non-divisible EP (ceil/floor partitioning). - # Other comm types still require num_experts divisible by ep_size. + if config.num_experts % config.ep_size != 0 and comm_type not in ( + COMM_NVLINK_ONE_SIDED, + COMM_ALLGATHER_RS, + ): + # Non-divisible EP is currently covered for NVLinkOneSided and the + # AllGatherReduceScatter fallback path. return ( f"comm_type={comm_type} requires num_experts divisible by ep_size, " f"got num_experts={config.num_experts}, ep_size={config.ep_size}" )QA list updates look unnecessary here because this remains unittest-only coverage. As per coding guidelines, "Coverage expectations: ... Note missing negative tests ..." and "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/modules/moe/test_moe_comm.py` around lines 450 - 456, The feasibility gate incorrectly only permits non-divisible EP for COMM_NVLINK_ONE_SIDED while CommunicationFactory now also allows COMM_ALLGATHER_REDUCESCATTER; update the conditional in the helper (the block that checks config.num_experts % config.ep_size != 0 and comm_type != COMM_NVLINK_ONE_SIDED) to also accept COMM_ALLGATHER_REDUCESCATTER (e.g., change the single-value exclusion to a membership test or OR-check), and adjust the returned error message text to reflect which comm types require num_experts divisible by ep_size so tests can exercise the fallback path; locate references to COMM_NVLINK_ONE_SIDED and COMM_ALLGATHER_REDUCESCATTER in this file to apply the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/disaggregated/slurm/benchmark/submit.py`:
- Line 713: Remove the erroneous f-string prefix on the literal
f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1" in submit.py (this
is the list/argument element used when building the sbatch/submit command);
replace it with a plain string "--no-container-mount-home --mpi=pmix --overlap
-N 1 -n 1" so the linter F541 is not triggered.
- Line 441: The assignment to compact_packing is using
bool(hw_config.get('compact_packing', False)) which will incorrectly coerce
string values like "false" to True; change the logic in the submit code that
reads hw_config.get('compact_packing') (the compact_packing variable) to accept
actual booleans and parse string inputs explicitly (e.g., treat case-insensitive
"true"/"1"/"yes" as True and "false"/"0"/"no" as False) or fallback to the
default False when the key is missing or unrecognized; ensure the code first
checks isinstance(value, bool) and otherwise normalizes str values before
setting compact_packing.
In `@tensorrt_llm/_torch/modules/fused_moe/interface.py`:
- Around line 449-458: The current branch builds initial_global_assignments from
moe_load_balancer_config.num_local_slots even when the real load balancer is
missing, causing mismatched initial_local_expert_ids; change the logic in the
constructor/code that sets initial_global_assignments (where
moe_load_balancer_config, init_expert_size_per_partition and
initial_global_assignments are computed) to detect the actual availability of
the load balancer (i.e., only use moe_load_balancer_config.num_local_slots when
get_moe_load_balancer() returned a valid balancer object); if the balancer is
absent, set initial_global_assignments to the safe sequential mapping
list(range(self.num_experts)) instead; apply the same defensive check and
fallback in the other identical block (the second occurrence around
initial_local_expert_ids).
---
Nitpick comments:
In `@tests/unittest/_torch/modules/moe/test_moe_comm.py`:
- Around line 450-456: The feasibility gate incorrectly only permits
non-divisible EP for COMM_NVLINK_ONE_SIDED while CommunicationFactory now also
allows COMM_ALLGATHER_REDUCESCATTER; update the conditional in the helper (the
block that checks config.num_experts % config.ep_size != 0 and comm_type !=
COMM_NVLINK_ONE_SIDED) to also accept COMM_ALLGATHER_REDUCESCATTER (e.g., change
the single-value exclusion to a membership test or OR-check), and adjust the
returned error message text to reflect which comm types require num_experts
divisible by ep_size so tests can exercise the fallback path; locate references
to COMM_NVLINK_ONE_SIDED and COMM_ALLGATHER_REDUCESCATTER in this file to apply
the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 778f0924-4f42-4cab-8f10-6e20b58a081f
📒 Files selected for processing (11)
cpp/tensorrt_llm/kernels/communicationKernels/moeAlltoAllKernels.cucpp/tensorrt_llm/thop/moeAlltoAllOp.cppexamples/disaggregated/slurm/benchmark/README.mdexamples/disaggregated/slurm/benchmark/config.yamlexamples/disaggregated/slurm/benchmark/disaggr_torch.slurmexamples/disaggregated/slurm/benchmark/run_benchmark.shexamples/disaggregated/slurm/benchmark/start_worker.shexamples/disaggregated/slurm/benchmark/submit.pytensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.pytensorrt_llm/_torch/modules/fused_moe/interface.pytests/unittest/_torch/modules/moe/test_moe_comm.py
| f"srun -l --container-name={container_name}", | ||
| f"--container-mounts={container_mount_str}", | ||
| f"--mpi=pmix --overlap -N 1 -n 1", | ||
| f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1", |
There was a problem hiding this comment.
Remove the empty f-string prefix.
This triggers F541 (f-string without placeholders) and will fail lint gates.
🔧 Proposed fix
- f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",
+ "--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| f"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1", | |
| "--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1", |
🧰 Tools
🪛 Flake8 (7.3.0)
[error] 713-713: f-string is missing placeholders
(F541)
🪛 Ruff (0.15.13)
[error] 713-713: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/disaggregated/slurm/benchmark/submit.py` at line 713, Remove the
erroneous f-string prefix on the literal f"--no-container-mount-home --mpi=pmix
--overlap -N 1 -n 1" in submit.py (this is the list/argument element used when
building the sbatch/submit command); replace it with a plain string
"--no-container-mount-home --mpi=pmix --overlap -N 1 -n 1" so the linter F541 is
not triggered.
|
PR_Github #49391 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #49538 [ run ] triggered by Bot. Commit: |
|
PR_Github #49538 [ run ] completed with state
|
- submit.py: fix calculate_nodes to match assign_server_round_robin's per-worker whole-node ownership semantic. Pooled formula ceil(world_size * num_servers / gpus_per_node) under-counts nodes when world_size is not a multiple of gpus_per_node (qiaoxj07 review, e.g. 2 ctx workers x ws=6 + 1 gen worker x ws=4 with gpus_per_node=4 -> old formula gave 4 but allocator needed 5 -> IndexError). - submit.py: drop erroneous f-string prefix on line introduced by e2157c3 (ruff F541, CodeRabbit review). - test_moe_comm.py: relax check_feasibility to also allow AllGatherReduceScatter under non-divisible EP, matching the production fallback in CommunicationFactory. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
Match the formatting the project's yapf hook expects (one element per line + dangling comma + closing paren on its own line), as enforced by the PR check pre-commit on 221c9d6. Signed-off-by: JacobHu-NV <266902545+JacobHu-NV@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #49813 [ run ] triggered by Bot. Commit: |
Port the per-rank hostfile + gpu_map launcher mechanism from PR NVIDIA#13888 (Jianbo Hu, "support non-divisible EP in MoE alltoall and slurm benchmark") into the in-tree DWDP examples launcher trio (submit_dwdp.py, disaggr_torch_dwdp.slurm, start_worker_dwdp.sh). Motivation: SLURM block distribution requires (n_ctx + n_gen * gen_tp) to be a multiple of gpus_per_node, otherwise GEN tensor-parallel ranks get split across nodes/trays and incur a large allreduce penalty. Previously worked around by adding empty CTX slots (over-provisioning) or picking dwdp_group multiples that happen to divide cleanly — both fragile and incompatible with Mode B non-uniform expert ranges where num_ctx is e.g. 5 or 6. Dual-path design: * Divisible case (num_ctx_gpus % gpus_per_node == 0): Use the legacy --nodelist + -N + --ntasks-per-node srun command. Block distribution gives the natural rank-to-node mapping; trtllm-serve picks its CUDA device from SLURM_LOCALID. No hostfile / gpu_map files are emitted. Perf-optimal path. * Non-divisible case (Mode B dwdp3 dg=2, dwdp5 dg=1, etc.): Emit hostfile_mpi_worker_base.txt and gpu_map_mpi_worker_base.txt under log_dir, iterating CTX servers then GEN servers so global rank ordering matches split_world_comm's ctx_cfgs + gen_cfgs. disaggr_torch_dwdp.slurm rewrites <nodeN_placeholder> tokens to real hostnames at runtime. srun uses SLURM_HOSTFILE + --distribution=arbitrary to pin every rank's host placement. DWDP-specific deviation from PR NVIDIA#13888: start_worker_dwdp.sh does NOT export CUDA_VISIBLE_DEVICES. DWDP relies on intra-node peer GPU discovery (VA composite cuMemMap of peer MNNVL fabric handles, UCX cuda_ipc / cuda_copy intra-node KV transports, PyTorch peer device enumeration); restricting CUDA visibility to a single GPU breaks these paths and causes a 15% per-CTX-GPU regression (TPOT unchanged, TTFT std blows up 3x). With our allocate_gpus's sequential cursor, gpu_id == SLURM_LOCALID for every rank, so trtllm-serve's internal LOCALID-based device selection already lands each rank on the correct GPU. The gpu_map file is kept for diagnostics and audit logging only. Empirical per-CTX-GPU req/s (DSv3-FP4-V2.1, ISL=8192/OSL=1, conc=256): dwdp4 dg=1 (8 GPU divisible) 3.46 dwdp3 dg=4 (16 GPU divisible) 3.41 dwdp5 dg=4 (24 GPU divisible) ~3.3 (extrapolated from R1b) dwdp3 dg=2 (10 GPU non-divisible) 3.51 dwdp5 dg=1 (9 GPU non-divisible) 3.42 TPOT 14-18 ms across all configs (kernel-neutral). Enables the non-uniform / Mode B perf configurations used in the follow-up dwdp_size=3/5 experiments in this refactor. Co-authored-by: Jianbo Hu <jacohu@nvidia.com> Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
|
PR_Github #49813 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #49907 [ run ] triggered by Bot. Commit: |
|
PR_Github #49907 [ run ] completed with state
|
This pull request enhances support for non-divisible expert parallelism (EP) in Mixture-of-Experts (MoE) communication by implementing ceil/floor partitioning for expert assignment across ranks. The changes ensure that MoE communication and computation remain correct and efficient when the number of experts is not evenly divisible by the number of ranks. This includes updates to CUDA kernels, Python interfaces, strategy selection logic, and comprehensive unit tests.
Key changes include:
Core algorithm and kernel updates:
moeAlltoAllKernels.cu) to use ceil/floor partitioning incompute_target_rank_id, allowing non-divisible expert-to-rank mapping and ensuring all experts are correctly assigned even whennum_experts % ep_size != 0. [1] [2]num_expertsto be divisible byep_sizein the C++ operator, reflecting the new partitioning logic.Python interface and strategy selection:
_compute_ep_partitionin the Python interface to mirror the kernel's partitioning logic, ensuring consistent expert slot assignment and correct initialization in both standard and load-balanced MoE scenarios. [1] [2] [3] [4]AllGatherReduceScatterwhen non-divisible EP is detected, and enforced that only whitelisted methods (NVLinkOneSided, AllGather) are allowed for non-divisible configurations. [1] [2]Testing and validation:
These changes collectively make the MoE implementation robust to non-uniform expert distributions, improving flexibility and correctness in distributed training scenarios.
Summary by CodeRabbit
New Features
compact_packingoption for GPU worker placement in distributed SLURM deploymentsImprovements
Documentation
compact_packingusage guidance