feat(ep): support independent combine dtype for EP dispatch/combine benchmarks by isytwu · Pull Request #239 · ROCm/mori

isytwu · 2026-03-30T05:07:52Z

Motivation

In real EP+MoE workloads, dispatch and combine often use different data types (e.g., FP4 dispatch with BF16 combine after MoE dequantization). The existing benchmarks assumed dispatch and combine share the same dtype, making it impossible to measure cross-type performance accurately. This PR adds independent combine dtype support to both intranode and internode benchmarks, along with a single-phase tuning mode.

Technical Details

Intranode benchmark (`tests/python/ops/bench_dispatch_combine.py`)

Add --combine-dtype CLI argument to specify a separate dtype for combine
Compute combine_hidden_dim internally based on FP4 packing (FP4 dispatch: hidden_dim/2, BF16 combine: full hidden_dim)
Config uses max(dispatch_hidden_dim, combine_hidden_dim) to satisfy C++ buffer assertions
Type conversion between dispatch and combine via unpack_fp4x2() (FP4→any) or .to() (non-FP4)
Separate bandwidth/bytes calculation for dispatch vs combine
Per-round latency (lat) printing
Skip e2e CUDA graph for cross-type (split graph still works)
Single-phase tuning: common sweep (same block/warp for both) + extra dispatch sweep (over-subscribe)
LaunchConfig namedtuple for readability

Internode benchmark (`examples/ops/dispatch_combine/test_dispatch_combine_internode.py`)

Same cross-type support with --combine-dtype
3-event timing structure to exclude type conversion from both dispatch and combine measurements
Separate RDMA and XGMI bandwidth for dispatch vs combine
PrettyTable annotates dtype and data volume per rank when cross-type
Tuning mode with 3D sweep: block_num × warp_per_block × rdma_block_num (constraint: rdma < block)
Stress test: non-graph loop supports cross-type; CUDA graph skipped for cross-type
Launch params (block_num, rdma_block_num, warp_per_block) forwarded through run_dispatch/run_combine
Fixed pre-existing sweep_bench parameter order bug

Shared (`tests/python/ops/test_dispatch_combine.py`)

unpack_fp4x2(tensor, dtype=bf16): LUT-based FP4 E2M1 unpacking to any float dtype
check_combine_result: accepts combine_data_type, skips only when combine is FP4 (not dispatch), enables verification for FP4 dispatch + non-FP4 combine

Kernel fix (`src/ops/dispatch_combine/low_latency_async.cpp`)

Buffer sizing fix for cross-type internode dispatch/combine

Test Plan

# Intranode
PYTHONPATH=$(pwd) python3 tests/python/ops/bench_dispatch_combine.py --dtype bf16 --max-tokens 128
PYTHONPATH=$(pwd) python3 tests/python/ops/bench_dispatch_combine.py --dtype fp4 --combine-dtype bf16 --world-size 4 --max-tokens 128 --zero-copy 0
PYTHONPATH=$(pwd) python3 tests/python/ops/bench_dispatch_combine.py --dtype fp4 --combine-dtype bf16 --world-size 4 --max-tokens 128 --zero-copy 1
PYTHONPATH=$(pwd) python3 tests/python/ops/bench_dispatch_combine.py --dtype fp8_e4m3_fnuz --combine-dtype bf16 --world-size 4 --max-tokens 128
PYTHONPATH=$(pwd) python3 tests/python/ops/bench_dispatch_combine.py --dtype fp4 --combine-dtype bf16 --world-size 4 --cmd stress

# Internode (single-node simulation)
PYTHONPATH=$(pwd) WORLD_SIZE=1 RANK=0 MASTER_ADDR=localhost MASTER_PORT=29500 MORI_DISABLE_P2P=1 \
  python examples/ops/dispatch_combine/test_dispatch_combine_internode.py \
  --kernel-type async_ll --cmd bench --max-tokens 128 --dtype fp4 --combine-dtype bf16

benchmark support combine dtype

8e4c231

isytwu self-assigned this Mar 30, 2026

isytwu merged commit 05f33b2 into ROCm:main Mar 30, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ep): support independent combine dtype for EP dispatch/combine benchmarks#239

feat(ep): support independent combine dtype for EP dispatch/combine benchmarks#239
isytwu merged 1 commit intoROCm:mainfrom
isytwu:support-comb-dtype

isytwu commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

isytwu commented Mar 30, 2026

Motivation

Technical Details

Intranode benchmark (tests/python/ops/bench_dispatch_combine.py)

Internode benchmark (examples/ops/dispatch_combine/test_dispatch_combine_internode.py)

Shared (tests/python/ops/test_dispatch_combine.py)

Kernel fix (src/ops/dispatch_combine/low_latency_async.cpp)

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Intranode benchmark (`tests/python/ops/bench_dispatch_combine.py`)

Internode benchmark (`examples/ops/dispatch_combine/test_dispatch_combine_internode.py`)

Shared (`tests/python/ops/test_dispatch_combine.py`)

Kernel fix (`src/ops/dispatch_combine/low_latency_async.cpp`)