Skip to content

feat(ep): support independent combine dtype for EP dispatch/combine benchmarks#239

Merged
isytwu merged 1 commit intoROCm:mainfrom
isytwu:support-comb-dtype
Mar 30, 2026
Merged

feat(ep): support independent combine dtype for EP dispatch/combine benchmarks#239
isytwu merged 1 commit intoROCm:mainfrom
isytwu:support-comb-dtype

Conversation

@isytwu
Copy link
Copy Markdown
Collaborator

@isytwu isytwu commented Mar 30, 2026

Motivation

In real EP+MoE workloads, dispatch and combine often use different data types (e.g., FP4 dispatch with BF16 combine after MoE dequantization). The existing benchmarks assumed dispatch and combine share the same dtype, making it impossible to measure cross-type performance accurately. This PR adds independent combine dtype support to both intranode and internode benchmarks, along with a single-phase tuning mode.

Technical Details

Intranode benchmark (tests/python/ops/bench_dispatch_combine.py)

  • Add --combine-dtype CLI argument to specify a separate dtype for combine
  • Compute combine_hidden_dim internally based on FP4 packing (FP4 dispatch: hidden_dim/2, BF16 combine: full hidden_dim)
  • Config uses max(dispatch_hidden_dim, combine_hidden_dim) to satisfy C++ buffer assertions
  • Type conversion between dispatch and combine via unpack_fp4x2() (FP4→any) or .to() (non-FP4)
  • Separate bandwidth/bytes calculation for dispatch vs combine
  • Per-round latency (lat) printing
  • Skip e2e CUDA graph for cross-type (split graph still works)
  • Single-phase tuning: common sweep (same block/warp for both) + extra dispatch sweep (over-subscribe)
  • LaunchConfig namedtuple for readability

Internode benchmark (examples/ops/dispatch_combine/test_dispatch_combine_internode.py)

  • Same cross-type support with --combine-dtype
  • 3-event timing structure to exclude type conversion from both dispatch and combine measurements
  • Separate RDMA and XGMI bandwidth for dispatch vs combine
  • PrettyTable annotates dtype and data volume per rank when cross-type
  • Tuning mode with 3D sweep: block_num × warp_per_block × rdma_block_num (constraint: rdma < block)
  • Stress test: non-graph loop supports cross-type; CUDA graph skipped for cross-type
  • Launch params (block_num, rdma_block_num, warp_per_block) forwarded through run_dispatch/run_combine
  • Fixed pre-existing sweep_bench parameter order bug

Shared (tests/python/ops/test_dispatch_combine.py)

  • unpack_fp4x2(tensor, dtype=bf16): LUT-based FP4 E2M1 unpacking to any float dtype
  • check_combine_result: accepts combine_data_type, skips only when combine is FP4 (not dispatch), enables verification for FP4 dispatch + non-FP4 combine

Kernel fix (src/ops/dispatch_combine/low_latency_async.cpp)

  • Buffer sizing fix for cross-type internode dispatch/combine

Test Plan

# Intranode
PYTHONPATH=$(pwd) python3 tests/python/ops/bench_dispatch_combine.py --dtype bf16 --max-tokens 128
PYTHONPATH=$(pwd) python3 tests/python/ops/bench_dispatch_combine.py --dtype fp4 --combine-dtype bf16 --world-size 4 --max-tokens 128 --zero-copy 0
PYTHONPATH=$(pwd) python3 tests/python/ops/bench_dispatch_combine.py --dtype fp4 --combine-dtype bf16 --world-size 4 --max-tokens 128 --zero-copy 1
PYTHONPATH=$(pwd) python3 tests/python/ops/bench_dispatch_combine.py --dtype fp8_e4m3_fnuz --combine-dtype bf16 --world-size 4 --max-tokens 128
PYTHONPATH=$(pwd) python3 tests/python/ops/bench_dispatch_combine.py --dtype fp4 --combine-dtype bf16 --world-size 4 --cmd stress

# Internode (single-node simulation)
PYTHONPATH=$(pwd) WORLD_SIZE=1 RANK=0 MASTER_ADDR=localhost MASTER_PORT=29500 MORI_DISABLE_P2P=1 \
  python examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
  --kernel-type async_ll --cmd bench --max-tokens 128 --dtype fp4 --combine-dtype bf16

@isytwu isytwu self-assigned this Mar 30, 2026
@isytwu isytwu merged commit 05f33b2 into ROCm:main Mar 30, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant