[None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA by tianyuz-nv · Pull Request #14453 · NVIDIA/TensorRT-LLM

tianyuz-nv · 2026-05-22T08:34:28Z

Description

Migrate the DWDP (Distributed Weight Data Parallelism) implementation from CUDA IPC to a CUDA VMM + MNNVL fabric-handle based design. DWDP accelerates the context (prefill) phase of disaggregated MoE serving by combining data parallelism with NVLink-based expert weight sharing — each worker holds a subset of experts locally and asynchronously prefetches the rest from peers, eliminating both alltoall latency and synchronization costs on DeepSeek-V3-class models.

The original IPC-based DWDP (PR #12136) used cudaIpcGetMemHandle / cudaIpcOpenMemHandle and a tensor-list interface to the MoE kernel. This refactor moves to cuMemCreate(HANDLE_TYPE_FABRIC) + a composite virtual address layout that interleaves zero-copy local expert data with double-buffered remote regions. The kernel now sees a single [num_experts, ...] tensor view per layer instead of a per-rank list, removing the need for DWDP-specific multi-buffer kernel paths.

Key properties preserved:

External contract unchanged — DwdpConfig schema, submit_dwdp.py, disaggr_torch_dwdp.slurm, and YAML config layout are all identical to the IPC version. No user-facing breakage.
Same performance envelope as IPC.
Same accuracy — GSM8K on DSv3-Lite still passes the configured threshold.

New capabilities unlocked:

Cross-tray scaling within an NVL72 rack — dwdp_size=8 across 2 trays.
Non-divisible dwdp_size + redundancy storage — dwdp_size values that don't divide num_experts (e.g. 3, 5, 7 on DSv3's 256 experts) and configurations where adjacent peer ranges intentionally overlap. The IPC implementation supported this via _init_dwdp_expert_layout; the early VA refactor inadvertently dropped it.

VA infrastructure was originally written by dongxuy04 on the tekit branch (commits 9a9729546f, 48050a3d4f); this PR migrates it into TRT-LLM main with the integration glue (DwdpManager facade, MoE backend hookup, fixup_moe_backends MPI allgather replacing tekit's TCPDWDPStore).

Commits

#	SHA	Title
1	`a4efa38f`	Add VA-based DWDP infrastructure and mapping support
2	`7a16cc70`	Switch DWDP integration layer from IPC to VA
3	`b3cfd890`	Revert multi-B kernel infrastructure (DWDP IPC-specific, unused by VA path)
4	`1fbc0d49`	Restore `num_experts_per_worker` semantics in DWDP setup
5	`83a9cff2`	Validate `DwdpConfig.num_groups` against the MPI launch
6	`40364f14`	Decouple DWDP expert layout from fused MoE EP
7	`63908b7c`	Support non-uniform / redundancy expert ranges in DWDP composite VA
8	`b99ba416`	Use gate-side `num_experts` for DWDP setup validation
9	`fc49b5e2`	Add Mode B overlap accuracy integration test
10	`d700ce7e`	Pin DWDP ranks via hostfile + srun `--distribution=arbitrary`
11	`2d433592`	Document implicit compute-done signal in `DWDPWeightManager.wait_and_bind`

Verification

New unit tests cover DwdpManager lifecycle, Mapping integration with the new DWDP fields, MPI allgather semantics in fixup_moe_backends, peer-range computation under uniform and Mode B layouts, and partition-config validation. All registered in l0_a10.yml.

A new integration test test_dwdp_accuracy_mode_b_overlap (commit 9) exercises the non-uniform Mode B path end-to-end on DSv3-Lite + GSM8K, alongside the existing test_dwdp_accuracy (uniform).

End-to-end performance was measured across uniform single-tray, cross-tray (dwdp_size=8), and Mode B non-divisible (dwdp_size=3 / dwdp_size=5) configurations on DeepSeek-R1-0528-FP4-V2.1, matching the IPC-era baseline.

Cross-node coverage

Cross-node DWDP (where partner-expert prefetch traverses MNNVL fabric instead of intra-node NVLink) is not added as a pytest in this PR for the same reason PR #13888 doesn't — the disaggregated_mpi_worker test pattern uses nested mpirun and is incompatible with the multi-node CI infrastructure's outer-srun MPI world. The cross-node code path is logically equivalent to the intra-node path (model-side code doesn't branch on locality; only cuMemMap of partner expert weights uses a fabric handle instead of an IPC handle), and is transitively covered by the unit tests plus the end-to-end perf runs above. A manual one-off cross-node GSM8K run confirmed no accuracy regression.

Compatibility

DwdpConfig (tensorrt_llm/llmapi/llm_args.py): same four fields (dwdp_size, num_groups, num_experts_per_worker, num_prefetch_experts); only the docstring is updated to mention CUDA VMM + MNNVL.
examples/disaggregated/slurm/benchmark/submit_dwdp.py / disaggr_torch_dwdp.slurm / start_worker_dwdp.sh: launcher logic upgraded for non-divisible dwdp_size and cross-tray (commit 10); CLI flags unchanged.
YAML config schema (ctx_config.dwdp_config block): identical structure.

Acknowledgments

VA infrastructure (_torch/modules/dwdp/) was originally authored by dongxuy04 (Dongxu Yang) on the tekit branch. This PR is the migration into TRT-LLM main with the integration glue.
The hostfile + srun --distribution=arbitrary launcher mechanism in commit 10 is ported from Jianbo Hu's upstream PR #13888 ("support non-divisible EP in MoE alltoall and slurm benchmark"), adapted for the DWDP examples launcher trio.

PR Checklist

PR description clearly explains what and why.
PR follows TRT-LLM CODING GUIDELINES.
Test cases are provided for new code paths.
No new dependencies introduced.
No CODEOWNERS changes needed (touches existing DWDP files only).
Documentation updates: DwdpConfig docstring updated; AGENTS.md not touched.
No tava architecture diagram update needed (DWDP is a runtime-internal change, not a backend addition).

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@dongxuy04

Commit 1 of the DWDP IPC->VA refactor. Purely additive: no existing code paths reference the new package yet (wired up in Commit 2). Package tensorrt_llm/_torch/modules/dwdp/ (migrated from tekit): * vmm.py CUDA VMM wrappers (RAII handle + VA region, fabric handle export/import, granularity via lru_cache) * specs.py WeightSpec / MnnvlHandleSet / PageAlignedLayout * page_pool.py Double-buffer pool of fabric page handles * weight_buffer.py Composite VA layout per (layer, weight) * weight_manager.py Runtime prefetch / wait_and_bind / events * transport.py MNNVL alloc + MPI-based peer handle exchange * setup.py Orchestration: Transport -> WeightBuffer -> WeightManager -> fixup_moe_backends * __init__.py Package exports MPI replaces tekit's TCPDWDPStore: * Transport uses per-pair comm.allgather for handle bytes; the allgather is itself a sync point (no explicit barrier required between Phase 1 iterations). * fixup_moe_backends uses comm.allgather for small scale params (bias, fc31_alpha, fc2_alpha), replacing tekit's per-sender serialized put/get/barrier pattern. * tekit's store.py is intentionally NOT migrated. collect_moe_params takes layer_indices as an input parameter (SSOT from DwdpManager._registered_layers; wired up in Commit 2). This replaces tekit's model-tree walking and makes the MoE layer set a single source of truth. Mapping (tensorrt_llm/mapping.py): * Add dwdp_size / dwdp_rank kwargs (at tail, strictly additive) * Validate range and store on _dwdp_size / _dwdp_rank * Override moe_tp_size=1 / moe_ep_size=dwdp_size / cluster=1 when DWDP enabled * moe_ep_rank returns dwdp_rank when DWDP enabled * Expose dwdp_size / dwdp_rank / dwdp_enabled properties * Include in __eq__ / __hash__ / to_dict Quality fixes applied during migration: * vmm.py: logger.debug on cuda-bindings import fallback; thread-safe granularity cache via functools.lru_cache; logger.info / warning on tensor_from_ptr fallback paths * page_pool.py: __del__ uses logger.debug instead of silent pass * weight_manager.py: drop hardcoded "DeepSeek R1" in docstring * transport.py: logger.error on create() exception (preserves root cause before cleanup) Tests (registered in tests/integration/test_lists/test-db/l0_a10.yml): * tests/unittest/others/test_mapping_dwdp.py — construction validation, MoE override, moe_ep_rank branch, to_dict roundtrip, __eq__/__hash__ * tests/unittest/_torch/modules/test_dwdp_fixup_moe_backends.py — mock MPI comm; verify allgather semantic equivalence to tekit TCPDWDPStore version for _allgather_e_score_correction_bias and _allgather_expert_scales VA infrastructure originally authored by @dongxuy04. Co-Authored-By: dongxuy04 <dongxuy@nvidia.com> Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

Commit 2 of the DWDP IPC->VA refactor. This rewires the integration layer to use the VA infrastructure added in the previous commit. Net -487 lines across the modified files: 204 insertions, 691 deletions. pyexecutor/dwdp.py (585 -> 208 lines): * DwdpManager becomes a pure facade: - __init__(config, dist, mapping) stores mapping, creates DWDP MPI sub-communicator (COMM_WORLD.Create_group on rank // dwdp_size). - __enter__/__exit__ manage the global singleton; duplicate __enter__ now raises instead of silently replacing. - add_layer(layer_idx) appends to self._registered_layers, which is the Single Source of Truth for MoE layer indices (setup_dwdp takes them as input — no more model-tree walking). - setup(model) delegates to setup_dwdp(model, mapping, device_id, comm, layer_indices=sorted(_registered_layers)) and caches the returned DWDPWeightManager. - prefetch_first_layers / wait_and_bind / record_compute_and_prefetch_next forward to the DWDPWeightManager. * Removed: DwdpLayerHandleCollector (~90 lines), DwdpPrefetchBuffer (~90 lines), all IPC handle plumbing (exchange_all_handles, initialize_prefetch_buffer, build_weight_view, peer_expert_ranges, prefetch_layer internals). pyexecutor/py_executor_creator.py: * DwdpConfig -> Mapping bridge: when dwdp_config.dwdp_size > 1, rebuild the Mapping with dwdp_size/dwdp_rank injected (ParallelConfig.to_mapping doesn't know about DWDP). * DwdpManager ctor now gets mapping=mapping. * Replace exchange_all_handles + initialize_prefetch_buffer with dwdp_manager.setup(model_engine.model) — a single orchestration call. modules/fused_moe/configurable_moe.py: * __init__ order fix (F4): DWDP init block now runs BEFORE _create_comm_strategy_auto so the factory can see self.enable_dwdp. * _create_comm_strategy_auto returns None when DWDP enabled (fixup moe backends ep_size=1, no alltoall needed). * DWDP init simplified: only add_layer; removed dwdp_handle_collector, dwdp_rank, backend.dwdp_handle_collector. * wait_and_bind added to forward_impl at Step 3 entry (per-layer, not per-chunk — correctness fix for multi-chunk forward). * Removed dwdp_weight_view kwargs injection in _prepare_backend_kwargs. modules/fused_moe/fused_moe_cute_dsl.py: * Deleted run_moe_nvfp4_impl_dwdp (~100 lines, multi-B kernel path). * run_moe_nvfp4 no longer branches on is_dwdp; single-tensor path handles all cases (VA swaps param.data so kernel sees full weights). * Removed dwdp_weight_view read in forward hook and dwdp_handle_collector.register_weights call in load_weights. modules/fused_moe/interface.py: * Deleted _init_dwdp_expert_layout and its __init__ call; dropped get_global_dwdp_manager import. fixup_moe_backends (run from setup_dwdp) is now the single source of truth for DWDP expert layout, eliminating the transient EP-sharded state window between __init__ and setup(). llmapi/llm_args.py: * DwdpConfig docstring: CUDA IPC -> CUDA VMM + MNNVL fabric handles. * Fields and status (prototype) unchanged — api_stability passes. Tests: * tests/unittest/_torch/executor/test_dwdp_manager.py (new, ~280 lines): mocks COMM_WORLD / global_mpi_rank / setup_dwdp; verifies construction validation, global singleton, duplicate __enter__ rejection, add_layer SSOT, setup() forwards sorted layer_indices, prefetch_first_layers double-depth warmup, record_compute_and_prefetch_next last-layer no-op, wait_and_bind delegation. * Registered in tests/integration/test_lists/test-db/l0_a10.yml. Smoke tested on GB200: 78 pytest passes in 1.02s (Commit 1 + 2 unit tests + full api_stability suite). Co-Authored-By: dongxuy04 <dongxuy@nvidia.com> Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

… unused by VA path) Commit 3 of the DWDP IPC->VA refactor. Three files (custom op wrapper + two blackwell kernels) are restored verbatim to their pre-PR-NVIDIA#12136 state. These multi-B paths were introduced purely to support DWDP's IPC scheme, which passed N peer expert shards as separate B tensors into each kernel call. The VA pipeline swaps param.data to a single [num_experts, ...] tensor via cuMemMap, so the standard single-B kernel path handles every case — the multi-B parameters, MAX_B_TENSORS branches, and tuple-ified b/sfb/alpha signatures become dead code. Files reverted to the commit before e92ee4f (PR NVIDIA#12136): * _torch/custom_ops/cute_dsl_custom_ops.py - Removes *_multi_b custom op registrations - Restores GatherGroupedGemmInputsHelper to single-tensor layout * _torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py - Removes b_tensor_l_sizes param, MAX_B_TENSORS, num_b_tensors const_expr branches for 1/2/3/4 B tensors, _make_tma_b helper, kernel-side gB_nkl_0..3 / gSFB_nkl_0..3 expansions, tuple'd signatures * _torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py - Same pattern Verified no commits touched these files between PR NVIDIA#12136 and HEAD, so the revert is surgical and does not risk clobbering unrelated work: $ git log --oneline e92ee4f..HEAD -- <file> # empty for all 3 Net change: +371 / -1519 = -1148 lines in the three files. Smoke tested: 78 pytest passes (DWDP units + full api_stability), plus runtime import checks confirm both kernel modules and the custom ops module load cleanly after the revert. Co-Authored-By: dongxuy04 <dongxuy@nvidia.com> Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

The VA refactor (a4efa38) inadvertently dropped the IPC version's config-driven expert range computation, replacing it with a hard-coded `num_experts // dwdp_size` in setup.py and transport.py. As a result, DwdpConfig.num_experts_per_worker and num_prefetch_experts became dead fields: read into DwdpManager but never consumed by the runtime. The schema suggested the contract supported by IPC (size = range length, stride = rank-to-rank offset) while the runtime silently assumed uniform integer-division partitioning. Restore the IPC contract by passing both fields through DwdpManager.setup -> setup_dwdp -> DWDPTransport.create. For uniform configs (dwdp=4: 64/64, dwdp=8: 32/32) the formula is mathematically identical to the old integer division; behavior on tested workloads is unchanged. Add `_validate_partition_config` helper enforcing four invariants before the range is computed: 1. positive `num_experts_per_worker` and `num_prefetch_experts` 2. stride <= size (no gaps between consecutive ranks) 3. coverage `(dwdp_size - 1) * stride + size >= num_experts_total` (last rank reaches the final expert) 4. chunk shape parity with `num_experts_per_worker` (DwdpConfig and the fused MoE loader agree on local expert count) These catch the most common schema/runtime mismatches early with specific error messages, and provide the plumbing needed to support non-uniform / redundant partitioning in the future. Redundancy itself is still blocked by the fused MoE backend's `num_experts % ep_size == 0` assertion (interface.py:391); that is a separate, larger change. Verification: - Unit tests: 57 passed + 4 subtests, including 14 new validation cases in tests/unittest/_torch/modules/test_dwdp_setup_validation.py - Accuracy (DSv3-Lite + dwdp=2 + GSM8K): 63.95% (above 61.537 threshold; ~0.7pp above the 63.23% pre-fix data point, within sampling noise) - Perf DEP baseline: 12.88 req/s (vs historical 12.92, -0.3%) - Perf DWDP=4: 14.11 req/s (matches the historical 14.11 to two decimal places, +9.6% over DEP baseline) - Perf DWDP=8: 27.53 req/s on 8 GPUs across 2 trays (1.95x DWDP=4 scaling) Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

`num_groups` has been a dead schema field in DwdpConfig from PR NVIDIA#12136 (the original IPC implementation) through commit 1fbc0d4: read into DwdpManager but never consumed by the runtime. Mis-configured YAMLs where the user's declared topology disagrees with the launch were left to fail mysteriously deeper in MPI sub-communicator creation, or to silently accept a launch that didn't match the schema. Convert the field into runtime validation in DwdpManager.__init__: 1. num_groups must be positive. 2. This rank's computed group_id (`rank // dwdp_size`) must be less than num_groups, so the launch hasn't started more CTX workers than the declared topology can hold. 3. `num_groups * dwdp_size <= MPI world size`, so the world is large enough to fit all declared groups. The three checks together catch over-subscription, world-size under-allocation, and obviously bogus values, while remaining local (no extra inter-rank communication required since DwdpConfig is identical on every rank). Verification: - Unit tests: 61 passed + 4 subtests (4 new num_groups cases in tests/unittest/_torch/executor/test_dwdp_manager.py) - Accuracy (DSv3-Lite + dwdp=2 + GSM8K): PASS, above the 61.5% threshold (3min 52s) - Perf DEP baseline: 12.95 req/s - Perf DWDP=4: 14.26 req/s (+10.1% over DEP) - Perf DWDP=8 cross-tray: 27.21 req/s (1.91x DWDP=4 scaling) Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

This is the first of two follow-up commits restoring IPC-era support for non-uniform partitioning (dwdp_size that does not divide num_experts) and redundancy storage on the VA-based DWDP path. Pre-Phase-1 the VA refactor coupled DWDP to fused MoE EP by setting mapping.moe_ep_size = dwdp_size. That triggered the ``num_experts % ep_size == 0`` assert deep in fused_moe/interface.py and locked DWDP to integer-divisible values (2/4/8 only on DSv3's 256 experts). Phase 1 decouples the two: * mapping.py: when dwdp_size > 1, set moe_ep_size = 1 (and moe_ep_rank = 0). The fused MoE backend now sees a single full-table partition; DWDP installs the per-rank layout itself. * pyexecutor/dwdp.py: DwdpManager exposes start_expert_id / end_expert_id derived from the rank-to-rank stride (num_prefetch_experts) and storage size (num_experts_per_worker) in DwdpConfig. These are read by the override hook below. * fused_moe/interface.py: restore the IPC-era _init_dwdp_expert_layout method, called at the end of MoE.__init__ after _init_load_balancer. When DWDP is enabled (get_global_dwdp_manager() is not None) it overrides expert_size_per_partition / slot_start / slot_end / initial_local_expert_ids before create_weights() runs, so the backend allocates num_experts_per_worker slots per rank. For uniform integer-divisible configs (dwdp=4: 64/64, dwdp=8: 32/32) this is mathematically equivalent to the legacy ep_size=dwdp_size layout — verified by the existing 96 unit tests staying green. Phase 2 (next commit) wires the runtime side so non-uniform / redundant configs actually work end-to-end. Tests: * test_mapping_dwdp.py: assert moe_ep_size=1 / moe_ep_rank=0 under DWDP and verify Mapping construction accepts dwdp_size=3/5 unconditionally (Phase 1 contract: rejection moves to DwdpManager / setup_dwdp). * test_dwdp_manager.py: new tests for start_expert_id / end_expert_id under uniform and redundancy configs. Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

…omposite VA Second of two follow-up commits — Phase 2 + 3 of the restoration plan in the prior commit. Phase 1 made model-load flexible (per-rank storage = num_experts_per_worker regardless of ep_size); this commit makes the runtime side handle non-uniform / overlapping peer ranges end-to-end. Core abstraction (specs.py): * compute_peer_ranges(dwdp_size, size, stride, num_experts) -> PeerRanges: deterministic, no allgather. Returns ``[(start, end_capped)]`` per rank where ``end_capped = min(start + size, num_experts)``. * lookup_owner(expert_id, peer_ranges) -> int: first-match owner lookup, picking the lowest-rank owner under redundancy overlap. Equivalent to ``expert_id // (num_experts // dwdp_size)`` when the partition is uniform. Wire-through (transport.py / weight_manager.py / setup.py): * Transport publishes peer_ranges via ``get_peer_ranges()``; setup_dwdp threads them to fill_edge_bytes (replacing ``prev_expert // experts_per_rank``), DWDPWeightManager.prefetch_layer (replacing ``cursor // _experts_per_rank``), and fixup_moe_backends. The legacy ``num_experts % dwdp_size == 0`` assert in ``DWDPTransport.create()`` is removed. * setup.py:_scatter_shards_to_full is the new common reconstruction helper for fixup_moe_backends's MPI-allgathered scale params and the gate's e_score_correction_bias. Each peer's shard is sized num_experts_per_worker (uniform); ``shard[:end - start]`` is placed at ``full[start:end]``. For redundancy, overlapping writes hit the same expert multiple times but values agree (every rank that owns expert ``e`` loaded ``e`` from the same checkpoint). Validation (Phase 3, setup.py:_validate_partition_config): * New check rejecting tail-padding configurations: ``(dwdp_size-1)*stride + size <= num_experts``. Combined with the existing coverage check (``>=``) this enforces strict equality. Tail padding is rejected because GB200 fabric handles cannot be partially mapped into a num_experts-sized composite VA (cuMemMap requires ``size == handle_phys_size``; partial mapping returns CUDA_ERROR_NOT_SUPPORTED). * Error message lists Mode B recipes for common configs: ``dwdp=3 + 256 -> size=86 stride=85``; ``dwdp=5 + 256 -> 52 / 51``; ``dwdp=7 + 256 -> 40 / 36``. Tests: * tests/unittest/_torch/modules/test_dwdp_peer_ranges.py (new): compute_peer_ranges + lookup_owner under uniform (dwdp=4/8), Mode B overlap (dwdp=3/7), and out-of-range / negative inputs; defensive cap tests for inputs that fail validation upstream. * test_dwdp_setup_validation.py: Mode B overlap pass cases (dwdp=3/5/7) + tail-padding rejection cases (dwdp=3/5). * test_dwdp_fixup_moe_backends.py: ``_scatter_shards_to_full`` direct tests (uniform, Mode B overlap, higher-dim trailing shape) + an end-to-end ``_allgather_expert_scales`` test for Mode B dwdp=3 reconstructing fc31_alpha to a full 256-element vector. * l0_a10.yml: registers the new test file for pre-merge CI. For uniform configs (dwdp=4 / dwdp=8), behaviour is unchanged — ``lookup_owner`` is mathematically equivalent to ``expert_id // (num_experts // dwdp_size)`` and ``compute_peer_ranges`` produces the same start/end tuples the formula would. Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

The Phase 2 commit (``63908b7c``) propagated peer_ranges built from ``compute_peer_ranges(num_experts_total=spec.num_experts, ...)``, where ``spec.num_experts`` was derived in ``build_weight_specs`` as ``experts_per_rank * dwdp_size``. Under uniform divisible configurations (dwdp=2 + 36/36 → 72, dwdp=4 + 64/64 → 256) this happened to equal the model's gate-side ``num_experts`` and validation passed. Under Mode B redundancy (e.g. dwdp=3 + size=26/stride=23 on DSv3-Lite's 72 experts), ``experts_per_rank * dwdp_size = 78 ≠ 72``, so the strict-equality validation falsely reported coverage 72 < storage 78 and rejected the configuration. The fix queries the actual gate-side expert count from the first MoE layer's ``experts_module.num_experts`` via the new ``_get_model_num_experts(model, layer_indices)`` helper, and threads it through ``build_weight_specs`` as an explicit ``num_experts_total`` parameter. Downstream, this becomes ``spec.num_experts`` and is used consistently by: * ``_validate_partition_config`` — strict equality is now checked against the real expert count. * ``compute_peer_ranges`` — peer ranges cap at the real expert count. * ``PageAlignedLayout`` / ``WeightBuffer._compute_remote_slices`` — composite VA's full tensor shape is ``(num_experts, ...)`` and remote slices end at ``num_experts``. For uniform divisible cases the new path is mathematically equivalent to the prior code (since storage extent == gate's num_experts when ``size == stride == num_experts // dwdp_size``). Verified by re-running the dwdp=4 accuracy regression (passes with GSM8K = 64.519) and by the Mode B unit tests in ``test_dwdp_peer_ranges.py`` / ``test_dwdp_setup_validation.py``, which already use gate-side ``num_experts_total`` values directly. Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

Adds ``test_dwdp_accuracy_mode_b_overlap`` to TestDwdpDeepSeekV3Lite, covering the Phase 2 redundancy (overlap) path end-to-end on a real DSv3-Lite + GSM8K workload. Config: ``dwdp_size=2``, ``num_experts_per_worker=40``, ``num_prefetch_experts=32``. The strict-equality check evaluates to ``1*32 + 40 == 72`` (== model's gate-side num_experts), with 8-expert overlap between rank 0's ``[0, 40)`` and rank 1's ``[32, 72)``. This exercises: * ``_init_dwdp_expert_layout`` overriding ``expert_size_per_partition`` to 40 (not the uniform ``72 // 2 == 36``). * ``lookup_owner`` first-match policy in ``fill_edge_bytes`` / ``weight_manager.prefetch_layer`` for the 8 shared experts. * ``_scatter_shards_to_full`` writing each overlap expert from two peers; values must agree because every rank that owns expert ``e`` loaded ``e`` from the same checkpoint. Resource footprint matches the existing ``test_dwdp_accuracy`` (2 CTX TP=1 + 1 GEN TP=2 = 4 GPUs on GB200). ``dwdp_size=2`` is kept intentionally — a ``dwdp_size=3`` variant triggers an unrelated UCX bus error in ``libtensorrt_llm_ucx_wrapper.so``'s ``WorkerProgressThread::progressUntilSync`` when 3 CTX workers simultaneously exchange KV cache. The bug is in the KV transceiver layer (not DWDP); the existing 2-CTX baseline runs cleanly and exercises the overlap path equally well. Verified: PASSED, GSM8K = 63.268 (threshold 61.537, reference 64.740); runtime 234s. Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

Port the per-rank hostfile + gpu_map launcher mechanism from PR NVIDIA#13888 (Jianbo Hu, "support non-divisible EP in MoE alltoall and slurm benchmark") into the in-tree DWDP examples launcher trio (submit_dwdp.py, disaggr_torch_dwdp.slurm, start_worker_dwdp.sh). Motivation: SLURM block distribution requires (n_ctx + n_gen * gen_tp) to be a multiple of gpus_per_node, otherwise GEN tensor-parallel ranks get split across nodes/trays and incur a large allreduce penalty. Previously worked around by adding empty CTX slots (over-provisioning) or picking dwdp_group multiples that happen to divide cleanly — both fragile and incompatible with Mode B non-uniform expert ranges where num_ctx is e.g. 5 or 6. Dual-path design: * Divisible case (num_ctx_gpus % gpus_per_node == 0): Use the legacy --nodelist + -N + --ntasks-per-node srun command. Block distribution gives the natural rank-to-node mapping; trtllm-serve picks its CUDA device from SLURM_LOCALID. No hostfile / gpu_map files are emitted. Perf-optimal path. * Non-divisible case (Mode B dwdp3 dg=2, dwdp5 dg=1, etc.): Emit hostfile_mpi_worker_base.txt and gpu_map_mpi_worker_base.txt under log_dir, iterating CTX servers then GEN servers so global rank ordering matches split_world_comm's ctx_cfgs + gen_cfgs. disaggr_torch_dwdp.slurm rewrites <nodeN_placeholder> tokens to real hostnames at runtime. srun uses SLURM_HOSTFILE + --distribution=arbitrary to pin every rank's host placement. DWDP-specific deviation from PR NVIDIA#13888: start_worker_dwdp.sh does NOT export CUDA_VISIBLE_DEVICES. DWDP relies on intra-node peer GPU discovery (VA composite cuMemMap of peer MNNVL fabric handles, UCX cuda_ipc / cuda_copy intra-node KV transports, PyTorch peer device enumeration); restricting CUDA visibility to a single GPU breaks these paths and causes a 15% per-CTX-GPU regression (TPOT unchanged, TTFT std blows up 3x). With our allocate_gpus's sequential cursor, gpu_id == SLURM_LOCALID for every rank, so trtllm-serve's internal LOCALID-based device selection already lands each rank on the correct GPU. The gpu_map file is kept for diagnostics and audit logging only. Empirical per-CTX-GPU req/s (DSv3-FP4-V2.1, ISL=8192/OSL=1, conc=256): dwdp4 dg=1 (8 GPU divisible) 3.46 dwdp3 dg=4 (16 GPU divisible) 3.41 dwdp5 dg=4 (24 GPU divisible) ~3.3 (extrapolated from R1b) dwdp3 dg=2 (10 GPU non-divisible) 3.51 dwdp5 dg=1 (9 GPU non-divisible) 3.42 TPOT 14-18 ms across all configs (kernel-neutral). Enables the non-uniform / Mode B perf configurations used in the follow-up dwdp_size=3/5 experiments in this refactor. Co-authored-by: Jianbo Hu <jacohu@nvidia.com> Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

…er.wait_and_bind Explain the O(N_moe)→O(1) event simplification that the VA refactor introduces in wait_and_bind. The IPC-era predecessor carried per-layer compute_events to signal 'kernel(L) finished, slot is reusable'; the VA implementation replaces them with per-slot consume_events recorded inside this very method, relying on CUDA stream in-order semantics to provide the same WAR ordering at O(1) bookkeeping cost. Documents the assumption (event recorded on compute_stream AFTER bind and BEFORE next layer's kernel launch) so future maintainers don't silently break it when extending to deeper pipelines (num_buffers > 2). Signed-off-by: tianyuz-nv <tianyuz@nvidia.com>

tianyuz-nv and others added 11 commits April 19, 2026 22:34

github-actions Bot assigned tianyuz-nv May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA#14453

[None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA#14453
tianyuz-nv wants to merge 11 commits into
NVIDIA:mainfrom
tianyuz-nv:feat/dwdp-va-refactor

tianyuz-nv commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tianyuz-nv commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Commits

Verification

Cross-node coverage

Compatibility

Acknowledgments

PR Checklist

GitHub Bot Help

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tianyuz-nv commented May 22, 2026 •

edited

Loading