[None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA#14453
Draft
tianyuz-nv wants to merge 11 commits into
Draft
[None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA#14453tianyuz-nv wants to merge 11 commits into
tianyuz-nv wants to merge 11 commits into
Conversation
Commit 1 of the DWDP IPC->VA refactor. Purely additive: no existing
code paths reference the new package yet (wired up in Commit 2).
Package tensorrt_llm/_torch/modules/dwdp/ (migrated from tekit):
* vmm.py CUDA VMM wrappers (RAII handle + VA region, fabric
handle export/import, granularity via lru_cache)
* specs.py WeightSpec / MnnvlHandleSet / PageAlignedLayout
* page_pool.py Double-buffer pool of fabric page handles
* weight_buffer.py Composite VA layout per (layer, weight)
* weight_manager.py Runtime prefetch / wait_and_bind / events
* transport.py MNNVL alloc + MPI-based peer handle exchange
* setup.py Orchestration: Transport -> WeightBuffer ->
WeightManager -> fixup_moe_backends
* __init__.py Package exports
MPI replaces tekit's TCPDWDPStore:
* Transport uses per-pair comm.allgather for handle bytes; the
allgather is itself a sync point (no explicit barrier required
between Phase 1 iterations).
* fixup_moe_backends uses comm.allgather for small scale params
(bias, fc31_alpha, fc2_alpha), replacing tekit's per-sender
serialized put/get/barrier pattern.
* tekit's store.py is intentionally NOT migrated.
collect_moe_params takes layer_indices as an input parameter (SSOT
from DwdpManager._registered_layers; wired up in Commit 2). This
replaces tekit's model-tree walking and makes the MoE layer set a
single source of truth.
Mapping (tensorrt_llm/mapping.py):
* Add dwdp_size / dwdp_rank kwargs (at tail, strictly additive)
* Validate range and store on _dwdp_size / _dwdp_rank
* Override moe_tp_size=1 / moe_ep_size=dwdp_size / cluster=1 when
DWDP enabled
* moe_ep_rank returns dwdp_rank when DWDP enabled
* Expose dwdp_size / dwdp_rank / dwdp_enabled properties
* Include in __eq__ / __hash__ / to_dict
Quality fixes applied during migration:
* vmm.py: logger.debug on cuda-bindings import fallback; thread-safe
granularity cache via functools.lru_cache; logger.info / warning
on tensor_from_ptr fallback paths
* page_pool.py: __del__ uses logger.debug instead of silent pass
* weight_manager.py: drop hardcoded "DeepSeek R1" in docstring
* transport.py: logger.error on create() exception (preserves root
cause before cleanup)
Tests (registered in tests/integration/test_lists/test-db/l0_a10.yml):
* tests/unittest/others/test_mapping_dwdp.py — construction
validation, MoE override, moe_ep_rank branch, to_dict roundtrip,
__eq__/__hash__
* tests/unittest/_torch/modules/test_dwdp_fixup_moe_backends.py —
mock MPI comm; verify allgather semantic equivalence to tekit
TCPDWDPStore version for _allgather_e_score_correction_bias and
_allgather_expert_scales
VA infrastructure originally authored by @dongxuy04.
Co-Authored-By: dongxuy04 <dongxuy@nvidia.com>
Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
Commit 2 of the DWDP IPC->VA refactor. This rewires the integration
layer to use the VA infrastructure added in the previous commit. Net
-487 lines across the modified files: 204 insertions, 691 deletions.
pyexecutor/dwdp.py (585 -> 208 lines):
* DwdpManager becomes a pure facade:
- __init__(config, dist, mapping) stores mapping, creates DWDP MPI
sub-communicator (COMM_WORLD.Create_group on rank // dwdp_size).
- __enter__/__exit__ manage the global singleton; duplicate __enter__
now raises instead of silently replacing.
- add_layer(layer_idx) appends to self._registered_layers, which is
the Single Source of Truth for MoE layer indices (setup_dwdp takes
them as input — no more model-tree walking).
- setup(model) delegates to setup_dwdp(model, mapping, device_id,
comm, layer_indices=sorted(_registered_layers)) and caches the
returned DWDPWeightManager.
- prefetch_first_layers / wait_and_bind / record_compute_and_prefetch_next
forward to the DWDPWeightManager.
* Removed: DwdpLayerHandleCollector (~90 lines), DwdpPrefetchBuffer
(~90 lines), all IPC handle plumbing (exchange_all_handles,
initialize_prefetch_buffer, build_weight_view, peer_expert_ranges,
prefetch_layer internals).
pyexecutor/py_executor_creator.py:
* DwdpConfig -> Mapping bridge: when dwdp_config.dwdp_size > 1, rebuild
the Mapping with dwdp_size/dwdp_rank injected (ParallelConfig.to_mapping
doesn't know about DWDP).
* DwdpManager ctor now gets mapping=mapping.
* Replace exchange_all_handles + initialize_prefetch_buffer with
dwdp_manager.setup(model_engine.model) — a single orchestration call.
modules/fused_moe/configurable_moe.py:
* __init__ order fix (F4): DWDP init block now runs BEFORE
_create_comm_strategy_auto so the factory can see self.enable_dwdp.
* _create_comm_strategy_auto returns None when DWDP enabled (fixup
moe backends ep_size=1, no alltoall needed).
* DWDP init simplified: only add_layer; removed dwdp_handle_collector,
dwdp_rank, backend.dwdp_handle_collector.
* wait_and_bind added to forward_impl at Step 3 entry (per-layer,
not per-chunk — correctness fix for multi-chunk forward).
* Removed dwdp_weight_view kwargs injection in _prepare_backend_kwargs.
modules/fused_moe/fused_moe_cute_dsl.py:
* Deleted run_moe_nvfp4_impl_dwdp (~100 lines, multi-B kernel path).
* run_moe_nvfp4 no longer branches on is_dwdp; single-tensor path
handles all cases (VA swaps param.data so kernel sees full weights).
* Removed dwdp_weight_view read in forward hook and
dwdp_handle_collector.register_weights call in load_weights.
modules/fused_moe/interface.py:
* Deleted _init_dwdp_expert_layout and its __init__ call; dropped
get_global_dwdp_manager import. fixup_moe_backends (run from
setup_dwdp) is now the single source of truth for DWDP expert
layout, eliminating the transient EP-sharded state window between
__init__ and setup().
llmapi/llm_args.py:
* DwdpConfig docstring: CUDA IPC -> CUDA VMM + MNNVL fabric handles.
* Fields and status (prototype) unchanged — api_stability passes.
Tests:
* tests/unittest/_torch/executor/test_dwdp_manager.py (new, ~280 lines):
mocks COMM_WORLD / global_mpi_rank / setup_dwdp; verifies
construction validation, global singleton, duplicate __enter__
rejection, add_layer SSOT, setup() forwards sorted layer_indices,
prefetch_first_layers double-depth warmup, record_compute_and_prefetch_next
last-layer no-op, wait_and_bind delegation.
* Registered in tests/integration/test_lists/test-db/l0_a10.yml.
Smoke tested on GB200: 78 pytest passes in 1.02s (Commit 1 + 2 unit
tests + full api_stability suite).
Co-Authored-By: dongxuy04 <dongxuy@nvidia.com>
Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
… unused by VA path) Commit 3 of the DWDP IPC->VA refactor. Three files (custom op wrapper + two blackwell kernels) are restored verbatim to their pre-PR-NVIDIA#12136 state. These multi-B paths were introduced purely to support DWDP's IPC scheme, which passed N peer expert shards as separate B tensors into each kernel call. The VA pipeline swaps param.data to a single [num_experts, ...] tensor via cuMemMap, so the standard single-B kernel path handles every case — the multi-B parameters, MAX_B_TENSORS branches, and tuple-ified b/sfb/alpha signatures become dead code. Files reverted to the commit before e92ee4f (PR NVIDIA#12136): * _torch/custom_ops/cute_dsl_custom_ops.py - Removes *_multi_b custom op registrations - Restores GatherGroupedGemmInputsHelper to single-tensor layout * _torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py - Removes b_tensor_l_sizes param, MAX_B_TENSORS, num_b_tensors const_expr branches for 1/2/3/4 B tensors, _make_tma_b helper, kernel-side gB_nkl_0..3 / gSFB_nkl_0..3 expansions, tuple'd signatures * _torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py - Same pattern Verified no commits touched these files between PR NVIDIA#12136 and HEAD, so the revert is surgical and does not risk clobbering unrelated work: $ git log --oneline e92ee4f..HEAD -- <file> # empty for all 3 Net change: +371 / -1519 = -1148 lines in the three files. Smoke tested: 78 pytest passes (DWDP units + full api_stability), plus runtime import checks confirm both kernel modules and the custom ops module load cleanly after the revert. Co-Authored-By: dongxuy04 <dongxuy@nvidia.com> Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
The VA refactor (a4efa38) inadvertently dropped the IPC version's config-driven expert range computation, replacing it with a hard-coded `num_experts // dwdp_size` in setup.py and transport.py. As a result, DwdpConfig.num_experts_per_worker and num_prefetch_experts became dead fields: read into DwdpManager but never consumed by the runtime. The schema suggested the contract supported by IPC (size = range length, stride = rank-to-rank offset) while the runtime silently assumed uniform integer-division partitioning. Restore the IPC contract by passing both fields through DwdpManager.setup -> setup_dwdp -> DWDPTransport.create. For uniform configs (dwdp=4: 64/64, dwdp=8: 32/32) the formula is mathematically identical to the old integer division; behavior on tested workloads is unchanged. Add `_validate_partition_config` helper enforcing four invariants before the range is computed: 1. positive `num_experts_per_worker` and `num_prefetch_experts` 2. stride <= size (no gaps between consecutive ranks) 3. coverage `(dwdp_size - 1) * stride + size >= num_experts_total` (last rank reaches the final expert) 4. chunk shape parity with `num_experts_per_worker` (DwdpConfig and the fused MoE loader agree on local expert count) These catch the most common schema/runtime mismatches early with specific error messages, and provide the plumbing needed to support non-uniform / redundant partitioning in the future. Redundancy itself is still blocked by the fused MoE backend's `num_experts % ep_size == 0` assertion (interface.py:391); that is a separate, larger change. Verification: - Unit tests: 57 passed + 4 subtests, including 14 new validation cases in tests/unittest/_torch/modules/test_dwdp_setup_validation.py - Accuracy (DSv3-Lite + dwdp=2 + GSM8K): 63.95% (above 61.537 threshold; ~0.7pp above the 63.23% pre-fix data point, within sampling noise) - Perf DEP baseline: 12.88 req/s (vs historical 12.92, -0.3%) - Perf DWDP=4: 14.11 req/s (matches the historical 14.11 to two decimal places, +9.6% over DEP baseline) - Perf DWDP=8: 27.53 req/s on 8 GPUs across 2 trays (1.95x DWDP=4 scaling) Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
`num_groups` has been a dead schema field in DwdpConfig from PR NVIDIA#12136 (the original IPC implementation) through commit 1fbc0d4: read into DwdpManager but never consumed by the runtime. Mis-configured YAMLs where the user's declared topology disagrees with the launch were left to fail mysteriously deeper in MPI sub-communicator creation, or to silently accept a launch that didn't match the schema. Convert the field into runtime validation in DwdpManager.__init__: 1. num_groups must be positive. 2. This rank's computed group_id (`rank // dwdp_size`) must be less than num_groups, so the launch hasn't started more CTX workers than the declared topology can hold. 3. `num_groups * dwdp_size <= MPI world size`, so the world is large enough to fit all declared groups. The three checks together catch over-subscription, world-size under-allocation, and obviously bogus values, while remaining local (no extra inter-rank communication required since DwdpConfig is identical on every rank). Verification: - Unit tests: 61 passed + 4 subtests (4 new num_groups cases in tests/unittest/_torch/executor/test_dwdp_manager.py) - Accuracy (DSv3-Lite + dwdp=2 + GSM8K): PASS, above the 61.5% threshold (3min 52s) - Perf DEP baseline: 12.95 req/s - Perf DWDP=4: 14.26 req/s (+10.1% over DEP) - Perf DWDP=8 cross-tray: 27.21 req/s (1.91x DWDP=4 scaling) Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
This is the first of two follow-up commits restoring IPC-era support for non-uniform partitioning (dwdp_size that does not divide num_experts) and redundancy storage on the VA-based DWDP path. Pre-Phase-1 the VA refactor coupled DWDP to fused MoE EP by setting mapping.moe_ep_size = dwdp_size. That triggered the ``num_experts % ep_size == 0`` assert deep in fused_moe/interface.py and locked DWDP to integer-divisible values (2/4/8 only on DSv3's 256 experts). Phase 1 decouples the two: * mapping.py: when dwdp_size > 1, set moe_ep_size = 1 (and moe_ep_rank = 0). The fused MoE backend now sees a single full-table partition; DWDP installs the per-rank layout itself. * pyexecutor/dwdp.py: DwdpManager exposes start_expert_id / end_expert_id derived from the rank-to-rank stride (num_prefetch_experts) and storage size (num_experts_per_worker) in DwdpConfig. These are read by the override hook below. * fused_moe/interface.py: restore the IPC-era _init_dwdp_expert_layout method, called at the end of MoE.__init__ after _init_load_balancer. When DWDP is enabled (get_global_dwdp_manager() is not None) it overrides expert_size_per_partition / slot_start / slot_end / initial_local_expert_ids before create_weights() runs, so the backend allocates num_experts_per_worker slots per rank. For uniform integer-divisible configs (dwdp=4: 64/64, dwdp=8: 32/32) this is mathematically equivalent to the legacy ep_size=dwdp_size layout — verified by the existing 96 unit tests staying green. Phase 2 (next commit) wires the runtime side so non-uniform / redundant configs actually work end-to-end. Tests: * test_mapping_dwdp.py: assert moe_ep_size=1 / moe_ep_rank=0 under DWDP and verify Mapping construction accepts dwdp_size=3/5 unconditionally (Phase 1 contract: rejection moves to DwdpManager / setup_dwdp). * test_dwdp_manager.py: new tests for start_expert_id / end_expert_id under uniform and redundancy configs. Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
…omposite VA Second of two follow-up commits — Phase 2 + 3 of the restoration plan in the prior commit. Phase 1 made model-load flexible (per-rank storage = num_experts_per_worker regardless of ep_size); this commit makes the runtime side handle non-uniform / overlapping peer ranges end-to-end. Core abstraction (specs.py): * compute_peer_ranges(dwdp_size, size, stride, num_experts) -> PeerRanges: deterministic, no allgather. Returns ``[(start, end_capped)]`` per rank where ``end_capped = min(start + size, num_experts)``. * lookup_owner(expert_id, peer_ranges) -> int: first-match owner lookup, picking the lowest-rank owner under redundancy overlap. Equivalent to ``expert_id // (num_experts // dwdp_size)`` when the partition is uniform. Wire-through (transport.py / weight_manager.py / setup.py): * Transport publishes peer_ranges via ``get_peer_ranges()``; setup_dwdp threads them to fill_edge_bytes (replacing ``prev_expert // experts_per_rank``), DWDPWeightManager.prefetch_layer (replacing ``cursor // _experts_per_rank``), and fixup_moe_backends. The legacy ``num_experts % dwdp_size == 0`` assert in ``DWDPTransport.create()`` is removed. * setup.py:_scatter_shards_to_full is the new common reconstruction helper for fixup_moe_backends's MPI-allgathered scale params and the gate's e_score_correction_bias. Each peer's shard is sized num_experts_per_worker (uniform); ``shard[:end - start]`` is placed at ``full[start:end]``. For redundancy, overlapping writes hit the same expert multiple times but values agree (every rank that owns expert ``e`` loaded ``e`` from the same checkpoint). Validation (Phase 3, setup.py:_validate_partition_config): * New check rejecting tail-padding configurations: ``(dwdp_size-1)*stride + size <= num_experts``. Combined with the existing coverage check (``>=``) this enforces strict equality. Tail padding is rejected because GB200 fabric handles cannot be partially mapped into a num_experts-sized composite VA (cuMemMap requires ``size == handle_phys_size``; partial mapping returns CUDA_ERROR_NOT_SUPPORTED). * Error message lists Mode B recipes for common configs: ``dwdp=3 + 256 -> size=86 stride=85``; ``dwdp=5 + 256 -> 52 / 51``; ``dwdp=7 + 256 -> 40 / 36``. Tests: * tests/unittest/_torch/modules/test_dwdp_peer_ranges.py (new): compute_peer_ranges + lookup_owner under uniform (dwdp=4/8), Mode B overlap (dwdp=3/7), and out-of-range / negative inputs; defensive cap tests for inputs that fail validation upstream. * test_dwdp_setup_validation.py: Mode B overlap pass cases (dwdp=3/5/7) + tail-padding rejection cases (dwdp=3/5). * test_dwdp_fixup_moe_backends.py: ``_scatter_shards_to_full`` direct tests (uniform, Mode B overlap, higher-dim trailing shape) + an end-to-end ``_allgather_expert_scales`` test for Mode B dwdp=3 reconstructing fc31_alpha to a full 256-element vector. * l0_a10.yml: registers the new test file for pre-merge CI. For uniform configs (dwdp=4 / dwdp=8), behaviour is unchanged — ``lookup_owner`` is mathematically equivalent to ``expert_id // (num_experts // dwdp_size)`` and ``compute_peer_ranges`` produces the same start/end tuples the formula would. Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
The Phase 2 commit (``63908b7c``) propagated peer_ranges built from ``compute_peer_ranges(num_experts_total=spec.num_experts, ...)``, where ``spec.num_experts`` was derived in ``build_weight_specs`` as ``experts_per_rank * dwdp_size``. Under uniform divisible configurations (dwdp=2 + 36/36 → 72, dwdp=4 + 64/64 → 256) this happened to equal the model's gate-side ``num_experts`` and validation passed. Under Mode B redundancy (e.g. dwdp=3 + size=26/stride=23 on DSv3-Lite's 72 experts), ``experts_per_rank * dwdp_size = 78 ≠ 72``, so the strict-equality validation falsely reported coverage 72 < storage 78 and rejected the configuration. The fix queries the actual gate-side expert count from the first MoE layer's ``experts_module.num_experts`` via the new ``_get_model_num_experts(model, layer_indices)`` helper, and threads it through ``build_weight_specs`` as an explicit ``num_experts_total`` parameter. Downstream, this becomes ``spec.num_experts`` and is used consistently by: * ``_validate_partition_config`` — strict equality is now checked against the real expert count. * ``compute_peer_ranges`` — peer ranges cap at the real expert count. * ``PageAlignedLayout`` / ``WeightBuffer._compute_remote_slices`` — composite VA's full tensor shape is ``(num_experts, ...)`` and remote slices end at ``num_experts``. For uniform divisible cases the new path is mathematically equivalent to the prior code (since storage extent == gate's num_experts when ``size == stride == num_experts // dwdp_size``). Verified by re-running the dwdp=4 accuracy regression (passes with GSM8K = 64.519) and by the Mode B unit tests in ``test_dwdp_peer_ranges.py`` / ``test_dwdp_setup_validation.py``, which already use gate-side ``num_experts_total`` values directly. Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
Adds ``test_dwdp_accuracy_mode_b_overlap`` to TestDwdpDeepSeekV3Lite, covering the Phase 2 redundancy (overlap) path end-to-end on a real DSv3-Lite + GSM8K workload. Config: ``dwdp_size=2``, ``num_experts_per_worker=40``, ``num_prefetch_experts=32``. The strict-equality check evaluates to ``1*32 + 40 == 72`` (== model's gate-side num_experts), with 8-expert overlap between rank 0's ``[0, 40)`` and rank 1's ``[32, 72)``. This exercises: * ``_init_dwdp_expert_layout`` overriding ``expert_size_per_partition`` to 40 (not the uniform ``72 // 2 == 36``). * ``lookup_owner`` first-match policy in ``fill_edge_bytes`` / ``weight_manager.prefetch_layer`` for the 8 shared experts. * ``_scatter_shards_to_full`` writing each overlap expert from two peers; values must agree because every rank that owns expert ``e`` loaded ``e`` from the same checkpoint. Resource footprint matches the existing ``test_dwdp_accuracy`` (2 CTX TP=1 + 1 GEN TP=2 = 4 GPUs on GB200). ``dwdp_size=2`` is kept intentionally — a ``dwdp_size=3`` variant triggers an unrelated UCX bus error in ``libtensorrt_llm_ucx_wrapper.so``'s ``WorkerProgressThread::progressUntilSync`` when 3 CTX workers simultaneously exchange KV cache. The bug is in the KV transceiver layer (not DWDP); the existing 2-CTX baseline runs cleanly and exercises the overlap path equally well. Verified: PASSED, GSM8K = 63.268 (threshold 61.537, reference 64.740); runtime 234s. Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
Port the per-rank hostfile + gpu_map launcher mechanism from PR NVIDIA#13888 (Jianbo Hu, "support non-divisible EP in MoE alltoall and slurm benchmark") into the in-tree DWDP examples launcher trio (submit_dwdp.py, disaggr_torch_dwdp.slurm, start_worker_dwdp.sh). Motivation: SLURM block distribution requires (n_ctx + n_gen * gen_tp) to be a multiple of gpus_per_node, otherwise GEN tensor-parallel ranks get split across nodes/trays and incur a large allreduce penalty. Previously worked around by adding empty CTX slots (over-provisioning) or picking dwdp_group multiples that happen to divide cleanly — both fragile and incompatible with Mode B non-uniform expert ranges where num_ctx is e.g. 5 or 6. Dual-path design: * Divisible case (num_ctx_gpus % gpus_per_node == 0): Use the legacy --nodelist + -N + --ntasks-per-node srun command. Block distribution gives the natural rank-to-node mapping; trtllm-serve picks its CUDA device from SLURM_LOCALID. No hostfile / gpu_map files are emitted. Perf-optimal path. * Non-divisible case (Mode B dwdp3 dg=2, dwdp5 dg=1, etc.): Emit hostfile_mpi_worker_base.txt and gpu_map_mpi_worker_base.txt under log_dir, iterating CTX servers then GEN servers so global rank ordering matches split_world_comm's ctx_cfgs + gen_cfgs. disaggr_torch_dwdp.slurm rewrites <nodeN_placeholder> tokens to real hostnames at runtime. srun uses SLURM_HOSTFILE + --distribution=arbitrary to pin every rank's host placement. DWDP-specific deviation from PR NVIDIA#13888: start_worker_dwdp.sh does NOT export CUDA_VISIBLE_DEVICES. DWDP relies on intra-node peer GPU discovery (VA composite cuMemMap of peer MNNVL fabric handles, UCX cuda_ipc / cuda_copy intra-node KV transports, PyTorch peer device enumeration); restricting CUDA visibility to a single GPU breaks these paths and causes a 15% per-CTX-GPU regression (TPOT unchanged, TTFT std blows up 3x). With our allocate_gpus's sequential cursor, gpu_id == SLURM_LOCALID for every rank, so trtllm-serve's internal LOCALID-based device selection already lands each rank on the correct GPU. The gpu_map file is kept for diagnostics and audit logging only. Empirical per-CTX-GPU req/s (DSv3-FP4-V2.1, ISL=8192/OSL=1, conc=256): dwdp4 dg=1 (8 GPU divisible) 3.46 dwdp3 dg=4 (16 GPU divisible) 3.41 dwdp5 dg=4 (24 GPU divisible) ~3.3 (extrapolated from R1b) dwdp3 dg=2 (10 GPU non-divisible) 3.51 dwdp5 dg=1 (9 GPU non-divisible) 3.42 TPOT 14-18 ms across all configs (kernel-neutral). Enables the non-uniform / Mode B perf configurations used in the follow-up dwdp_size=3/5 experiments in this refactor. Co-authored-by: Jianbo Hu <jacohu@nvidia.com> Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
…er.wait_and_bind Explain the O(N_moe)→O(1) event simplification that the VA refactor introduces in wait_and_bind. The IPC-era predecessor carried per-layer compute_events to signal 'kernel(L) finished, slot is reusable'; the VA implementation replaces them with per-slot consume_events recorded inside this very method, relying on CUDA stream in-order semantics to provide the same WAR ordering at O(1) bookkeeping cost. Documents the assumption (event recorded on compute_stream AFTER bind and BEFORE next layer's kernel launch) so future maintainers don't silently break it when extending to deeper pipelines (num_buffers > 2). Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@coderabbitai summary
Description
Migrate the DWDP (Distributed Weight Data Parallelism) implementation from CUDA IPC to a CUDA VMM + MNNVL fabric-handle based design. DWDP accelerates the context (prefill) phase of disaggregated MoE serving by combining data parallelism with NVLink-based expert weight sharing — each worker holds a subset of experts locally and asynchronously prefetches the rest from peers, eliminating both alltoall latency and synchronization costs on DeepSeek-V3-class models.
The original IPC-based DWDP (PR #12136) used
cudaIpcGetMemHandle/cudaIpcOpenMemHandleand a tensor-list interface to the MoE kernel. This refactor moves tocuMemCreate(HANDLE_TYPE_FABRIC)+ a composite virtual address layout that interleaves zero-copy local expert data with double-buffered remote regions. The kernel now sees a single[num_experts, ...]tensor view per layer instead of a per-rank list, removing the need for DWDP-specific multi-buffer kernel paths.Key properties preserved:
DwdpConfigschema,submit_dwdp.py,disaggr_torch_dwdp.slurm, and YAML config layout are all identical to the IPC version. No user-facing breakage.New capabilities unlocked:
dwdp_size=8across 2 trays.dwdp_size+ redundancy storage —dwdp_sizevalues that don't dividenum_experts(e.g. 3, 5, 7 on DSv3's 256 experts) and configurations where adjacent peer ranges intentionally overlap. The IPC implementation supported this via_init_dwdp_expert_layout; the early VA refactor inadvertently dropped it.VA infrastructure was originally written by dongxuy04 on the tekit branch (commits
9a9729546f,48050a3d4f); this PR migrates it into TRT-LLM main with the integration glue (DwdpManager facade, MoE backend hookup,fixup_moe_backendsMPI allgather replacing tekit's TCPDWDPStore).Commits
a4efa38f7a16cc70b3cfd8901fbc0d49num_experts_per_workersemantics in DWDP setup83a9cff2DwdpConfig.num_groupsagainst the MPI launch40364f1463908b7cb99ba416num_expertsfor DWDP setup validationfc49b5e2d700ce7e--distribution=arbitrary2d433592DWDPWeightManager.wait_and_bindVerification
New unit tests cover DwdpManager lifecycle, Mapping integration with the new DWDP fields, MPI allgather semantics in
fixup_moe_backends, peer-range computation under uniform and Mode B layouts, and partition-config validation. All registered inl0_a10.yml.A new integration test
test_dwdp_accuracy_mode_b_overlap(commit 9) exercises the non-uniform Mode B path end-to-end on DSv3-Lite + GSM8K, alongside the existingtest_dwdp_accuracy(uniform).End-to-end performance was measured across uniform single-tray, cross-tray (
dwdp_size=8), and Mode B non-divisible (dwdp_size=3/dwdp_size=5) configurations on DeepSeek-R1-0528-FP4-V2.1, matching the IPC-era baseline.Cross-node coverage
Cross-node DWDP (where partner-expert prefetch traverses MNNVL fabric instead of intra-node NVLink) is not added as a pytest in this PR for the same reason PR #13888 doesn't — the
disaggregated_mpi_workertest pattern uses nestedmpirunand is incompatible with the multi-node CI infrastructure's outer-srunMPI world. The cross-node code path is logically equivalent to the intra-node path (model-side code doesn't branch on locality; onlycuMemMapof partner expert weights uses a fabric handle instead of an IPC handle), and is transitively covered by the unit tests plus the end-to-end perf runs above. A manual one-off cross-node GSM8K run confirmed no accuracy regression.Compatibility
DwdpConfig(tensorrt_llm/llmapi/llm_args.py): same four fields (dwdp_size,num_groups,num_experts_per_worker,num_prefetch_experts); only the docstring is updated to mention CUDA VMM + MNNVL.examples/disaggregated/slurm/benchmark/submit_dwdp.py/disaggr_torch_dwdp.slurm/start_worker_dwdp.sh: launcher logic upgraded for non-divisibledwdp_sizeand cross-tray (commit 10); CLI flags unchanged.ctx_config.dwdp_configblock): identical structure.Acknowledgments
VA infrastructure (
_torch/modules/dwdp/) was originally authored by dongxuy04 (Dongxu Yang) on the tekit branch. This PR is the migration into TRT-LLM main with the integration glue.The hostfile +
srun --distribution=arbitrarylauncher mechanism in commit 10 is ported from Jianbo Hu's upstream PR #13888 ("support non-divisible EP in MoE alltoall and slurm benchmark"), adapted for the DWDP examples launcher trio.PR Checklist
DwdpConfigdocstring updated; AGENTS.md not touched.GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.