Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
593bcf4
[AMD] Add vLLM disaggregated prefill-decode benchmark for MI355X
chunfangamd Mar 11, 2026
f805b62
[AMD] Refactor vLLM disagg recipe: models.yaml, UCX cleanup, QoS support
chunfangamd Mar 11, 2026
a65d6be
[AMD] Update vLLM disagg recipe for v0.17.1 NixlConnector API
chunfangamd Mar 11, 2026
d62d53c
[AMD] Make vLLM disagg recipe CI-compatible (mia1 cluster)
chunfangamd Mar 12, 2026
788aa2b
[AMD] Co-locate vLLM disagg router with prefill on NODE_RANK=0
chunfangamd Mar 12, 2026
efce933
[AMD] Use public vLLM base image with runtime dependency install
chunfangamd Mar 12, 2026
2ffd37f
[AMD] Enable Expert Parallelism with MoRI all-to-all on vLLM disagg d…
chunfangamd Mar 13, 2026
25345ce
[AMD] Switch vLLM disagg KV transfer to MoRI-IO with protocol-aware p…
chunfangamd Mar 13, 2026
c50b3c8
[AMD] BUG fix: RANDOM_RANGE_RATIO never reaches bench.sh
ichbinblau Mar 17, 2026
fa7794d
Bug fix: 1. With DRY_RUN=1, node 0 skipped starting proxy/prefill but…
ichbinblau Mar 17, 2026
8fb6f48
[AMD] Fix vLLM disagg hang: READ mode support + safety timeouts
chunfangamd Mar 19, 2026
5c5d072
Adapt vLLM disagg recipe for 9N mia1 cluster (mlx5 NICs)
chunfangamd Mar 21, 2026
776bde9
[AMD] Fix vLLM disagg sweep hang: KV cache leak + benchmark client ha…
chunfangamd Mar 22, 2026
a4b3658
[AMD] Fix vLLM disagg Slurm job never terminating after benchmark com…
chunfangamd Mar 22, 2026
a28dce5
[AMD] Enable MoRI-IO READ mode by default for vLLM disagg
chunfangamd Mar 22, 2026
af1bbb4
[AMD] Fix CI checkout failure caused by root-owned __pycache__ files
chunfangamd Mar 22, 2026
7eddefa
[AMD] Fix CI checkout EACCES by redirecting Python bytecache off NFS
chunfangamd Mar 23, 2026
1b791b6
[AMD] Fix KV reaper deadlock on high-ISL disagg workloads
chunfangamd Mar 23, 2026
5c5f0b2
[AMD] Enable reading PREFILL_TP,PREFILL_EP,PREFILL_DP_ATTN,DECODE_TP,…
ichbinblau Mar 24, 2026
a337fae
[AMD] Upgrade vLLM disagg image from v0.17.1 to v0.18.0
chunfangamd Mar 29, 2026
fb211a4
[AMD] Add Kimi-K2.5-MXFP4 disagg inference config (1P2D)
chunfangamd Mar 30, 2026
9b8159e
feat: add MiniMax M2.5 PD disaggregation recipe (1P2D, MoRI-EP + MoRI…
chunfangamd Apr 3, 2026
e3319a7
feat: add Dockerfile and runtime patch for MiniMax M2.5 WideEP + MoRI
chunfangamd Apr 3, 2026
17a4abf
Fix: rename minimaxm25 to minimaxm2.5 for CI naming consistency
chunfangamd Apr 3, 2026
fec9fe2
Optimize: add --gpu-memory-utilization 0.95 and --block-size 32 to Mi…
chunfangamd Apr 3, 2026
4a0a81a
Fix: MiniMax M2.5 disagg — require EP=8 for prefill, fix ROCm gate dtype
chunfangamd Apr 3, 2026
9445f6a
Remove unused docker/minimax-m25-disagg/ directory
chunfangamd Apr 3, 2026
4b94881
remove vllm disagg for dpsr1 and dpv3
ichbinblau Apr 13, 2026
c5ba7ea
consolidate amd_utils for sglang and vllm
ichbinblau Apr 21, 2026
ac064a8
use vLLM router as default router for vllm disagg
ichbinblau Apr 21, 2026
75b18c6
fix bugs
ichbinblau Apr 23, 2026
5fcca87
[AMD] Bump to nightly vllm and vllm-router images (#1208)
simondanielsson May 4, 2026
b4d0b48
update vllm image and vllm router image
ichbinblau May 12, 2026
b51320d
update the interface prefix for tw cluster
ichbinblau May 12, 2026
7d84712
add deps for ib device auto-detection
ichbinblau May 13, 2026
f377527
update vllm image
ichbinblau May 13, 2026
d868a77
fix indentation and add missing finally block in async_request_openai…
ichbinblau May 13, 2026
cd03311
fix tw-eth interface detection pattern in env.sh
ichbinblau May 13, 2026
e46ffbb
fix vllm-disagg config schema: use scenarios.fixed-seq-len
ichbinblau May 13, 2026
fecf422
fix vllm-disagg routing to multi_node benchmark subdir
ichbinblau May 13, 2026
b2664d0
fix result collection to use FRAMEWORK as log directory prefix
ichbinblau May 13, 2026
8a6c464
suppress tokenizer warnings and debug output in bench.sh
ichbinblau May 14, 2026
6ed08fb
fix vllm-disagg deadlock: stop router after rank 0 container exits
ichbinblau May 14, 2026
9fba828
reduce vllm-disagg concurrency sweep to single point for faster itera…
ichbinblau May 14, 2026
4ea260d
preserve slurm logs on failure and print stderr inline
ichbinblau May 14, 2026
756becb
enable set -x around docker privilege detection for CI debugging
ichbinblau May 14, 2026
7f9025f
fix docker detection: test on compute node, not batch host
ichbinblau May 14, 2026
400ef36
fix docker detection: per-node probe since group membership varies
ichbinblau May 14, 2026
21983ad
add vllm-disagg changelog entries and update kimi conc-list
ichbinblau May 14, 2026
898e901
switch vllm-disagg to 8k1k config to trigger multi-node eval
ichbinblau May 14, 2026
f311bfd
add multi-node eval feature
ichbinblau May 15, 2026
7b92e57
remove start_etcd.sh
ichbinblau May 15, 2026
e18e09d
change decode to 1, easier for testing
ichbinblau May 15, 2026
21eab91
add --served-model-name to vllm serve commands and wire up eval
ichbinblau May 15, 2026
58bb2a3
fix model name consistency between vllm serve and bench client
ichbinblau May 15, 2026
c17d4c1
add token patch to bench for vllm
ichbinblau May 15, 2026
47455c4
add --tokenizer passthrough to run_benchmark_serving
ichbinblau May 15, 2026
839b547
update vllm image for kimi2.5 and Minimax disagg.
ichbinblau May 15, 2026
3f43d14
Update setup_deps.sh
ichbinblau May 18, 2026
e4852e2
Update amd-master.yaml
ichbinblau May 18, 2026
61bc8b9
update req rate for vllm.
ichbinblau May 19, 2026
81203a3
make the sglang env consistent with upstream
ichbinblau May 19, 2026
895ba67
node blacklist
ichbinblau May 19, 2026
dab93b8
fix: remove faulty minimax patch
simondanielsson May 21, 2026
3e07aea
fix: remove unneeded commented-out code from setup_deps.sh
simondanielsson May 21, 2026
9237eac
fix: bump to latest nightly vllm image on minimax
simondanielsson May 21, 2026
4c1520d
fix: temporarily mount /coredumps
simondanielsson May 21, 2026
c2e0377
tmp: add bette r debugging capabilities
simondanielsson May 21, 2026
b172350
fix: disable custom all-reduce for minimax
simondanielsson May 21, 2026
9eaf548
fix: minimax segfault by avoiding M=8K fmoe kernel shape
simondanielsson May 21, 2026
2bde2b6
revert: fix: temporarily mount /coredumps
simondanielsson May 21, 2026
e6d26d7
feat: add VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 as in single node example
simondanielsson May 21, 2026
102e59f
fix: use FRAMEWORK arg in collect_latest_results.py to match vllm-dis…
ichbinblau May 26, 2026
c60e6af
remove unused vllm_disagg_utils directory
ichbinblau May 26, 2026
106a4e4
revert: restore backend_request_func.py to match main
ichbinblau May 26, 2026
8ccd28a
revert: restore benchmark_serving.py to match main
ichbinblau May 26, 2026
93da023
revert: fully restore benchmark_serving.py to match main
ichbinblau May 26, 2026
f242ee5
revert: fully restore backend_request_func.py to match main
ichbinblau May 26, 2026
b133e5f
add pr-link to vllm-disagg changelog entries
ichbinblau May 27, 2026
b53a95b
fix: sync env.sh with upstream main
ichbinblau May 27, 2026
8de53c8
fix: restore SGLANG_MORI_COMBINE_DTYPE in server launch commands
ichbinblau May 27, 2026
9fe9b24
refactor: move static vLLM env vars to env.sh, remove dead etcd code
ichbinblau May 27, 2026
6286f44
fix: pass IS_MULTINODE into Docker container
ichbinblau May 27, 2026
37733fb
fix: improve vllm-disagg changelog descriptions
ichbinblau May 27, 2026
b1ae781
fix: restore DP+EP override blocks and trailing newline in server_sgl…
ichbinblau May 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1350,6 +1350,115 @@ dsr1-fp8-mi355x-sglang-disagg-mtp:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=2"

kimik2.5-fp4-mi355x-vllm-disagg:
image: vllm/vllm-openai-rocm:nightly-bf610c2f56764e1b30bc6065f4ceace3d6e59036
model: amd/Kimi-K2.5-MXFP4
model-prefix: kimik2.5
runner: mi355x-disagg
precision: fp4
framework: vllm-disagg
multinode: true
disagg: true
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
# 1P2D: 1 prefill node (co-located with proxy) + 2 decode nodes = 3 nodes total
- spec-decoding: "none"
conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
- "VLLM_MORIIO_CONNECTOR_READ_MODE=1"
decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "DECODE_NODES=2"

- isl: 8192
osl: 1024
search-space:
- spec-decoding: "none"
conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
prefill:
num-worker: 1
tp: 8
ep: 1
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
- "VLLM_MORIIO_CONNECTOR_READ_MODE=1"
decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "DECODE_NODES=2"

minimaxm2.5-fp8-mi355x-vllm-disagg:
image: vllm/vllm-openai-rocm:nightly-a6682d1d259cca69a9ae737ea5608fbbe7520031
model: MiniMaxAI/MiniMax-M2.5
model-prefix: minimaxm2.5
runner: mi355x-disagg
precision: fp8
framework: vllm-disagg
multinode: true
disagg: true
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
# 1P2D: 1 prefill node (co-located with proxy) + 2 decode nodes = 3 nodes total
# Prefill also needs EP=8: MiniMax M2.5 expert intermediate_size=1536,
# TP8 shards to 192 which is not divisible by FP8 block_n=128.
- spec-decoding: "none"
conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
- "VLLM_MORIIO_CONNECTOR_READ_MODE=1"
decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "DECODE_NODES=2"

- isl: 8192
osl: 1024
search-space:
- spec-decoding: "none"
conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "PREFILL_NODES=1"
- "VLLM_MORIIO_CONNECTOR_READ_MODE=1"
decode:
num-worker: 2
tp: 8
ep: 8
dp-attn: false
additional-settings:
- "DECODE_NODES=2"

dsr1-fp4-mi355x-sglang-disagg:
image: lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260519
Expand Down
9 changes: 9 additions & 0 deletions benchmarks/benchmark_lib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,7 @@ run_benchmark_serving() {
local dsv4=false
local trust_remote_code=false
local server_pid=""
local tokenizer=""

while [[ $# -gt 0 ]]; do
case $1 in
Expand Down Expand Up @@ -278,6 +279,10 @@ run_benchmark_serving() {
server_pid="$2"
shift 2
;;
--tokenizer)
tokenizer="$2"
shift 2
;;
*)
echo "Unknown parameter: $1"
return 1
Expand Down Expand Up @@ -385,6 +390,10 @@ run_benchmark_serving() {
benchmark_cmd+=(--trust-remote-code)
fi

if [[ -n "$tokenizer" ]]; then
benchmark_cmd+=(--tokenizer "$tokenizer")
fi

# Run benchmark with optional server monitoring
set -x
if [[ -n "$server_pid" ]]; then
Expand Down
79 changes: 62 additions & 17 deletions benchmarks/multi_node/amd_utils/bench.sh
Original file line number Diff line number Diff line change
@@ -1,63 +1,108 @@
#!/bin/bash
# Dual-Engine Disaggregated Benchmark Runner
#
# ENGINE=sglang (default): SGLang benchmark
# ENGINE=vllm: vLLM benchmark
#
# Produces JSON result files via benchmark_serving.py so that the CI pipeline
# can collect and process results.
#
# Usage: bash bench.sh <n_prefill> <n_decode> <prefill_gpus> <decode_gpus> \
# <model_dir> <model_name> <log_path> <isl> <osl> \
# <concurrency_list> <req_rate> <random_range_ratio> <num_prompts_multiplier>

ENGINE="${ENGINE:-sglang-disagg}"

n_prefill=$1
n_decode=$2
prefill_gpus=$3
decode_gpus=$4
model_path=$5
model_name=$6
MODEL_PATH="${model_path}/${model_name}"
MODEL_PATH="${MODEL_PATH:-${model_path}/${model_name}}"
# vllm-disagg uses --served-model-name MODEL_NAME; sglang defaults to MODEL_PATH
if [[ "$ENGINE" == "vllm-disagg" ]]; then
BENCH_MODEL="${MODEL_NAME:-${MODEL_PATH}}"
else
BENCH_MODEL="${MODEL_PATH}"
fi
log_path=$7

chosen_isl=${8:-1024}
chosen_osl=${9:-1024}
concurrency_list=${10:-"512x1"}
chosen_req_rate=${11:-1}
if [[ "$ENGINE" == "vllm-disagg" ]]; then
chosen_req_rate=${11:-inf}
else
chosen_req_rate=${11:-1}
fi
random_range_ratio=${12:-0.8}
num_prompts_multiplier=${13:-10}

IFS='x' read -r -a chosen_concurrencies <<< "$concurrency_list"

echo "Config ${chosen_isl}; ${chosen_osl}; ${chosen_concurrencies[0]}; ${chosen_req_rate}"
ROUTER_PORT="${ROUTER_PORT:-30000}"

head_node="localhost"
head_port="30000"
export TRANSFORMERS_VERBOSITY=error
export TOKENIZERS_PARALLELISM=false

echo "Config ${chosen_isl}; ${chosen_osl}; ${chosen_concurrencies[0]}; ${chosen_req_rate}"

profile_folder="${log_path}/sglang_isl_${chosen_isl}_osl_${chosen_osl}"
mkdir -p $profile_folder
profile_folder="${log_path}/${ENGINE}_isl_${chosen_isl}_osl_${chosen_osl}"
mkdir -p "$profile_folder"

source "$(dirname "$0")/../../benchmark_lib.sh"

# Repo root inside the container (3 levels up from this script's directory)
REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"

for max_concurrency in ${chosen_concurrencies[@]}; do
for max_concurrency in "${chosen_concurrencies[@]}"; do

export_file="${profile_folder}/concurrency_${max_concurrency}_req_rate_${chosen_req_rate}_gpus_$((prefill_gpus+decode_gpus))_ctx_${prefill_gpus}_gen_${decode_gpus}"

num_prompts=$(( max_concurrency * num_prompts_multiplier ))
if [[ "$num_prompts" -lt 16 ]]; then
num_prompts=16
fi

echo "profile_folder: $profile_folder"
echo "max_concurrency: $max_concurrency"
echo "chosen_req_rate: $chosen_req_rate"
echo "MODEL_PATH: $MODEL_PATH"
echo "head_port: $head_port"
echo "ROUTER_PORT: $ROUTER_PORT"
echo "chosen_isl: $chosen_isl"
echo "chosen_osl: $chosen_osl"
echo "num_prompts: $num_prompts"
echo "export_file: $export_file"

# Engine-specific extra flags
extra_flags=""
if [[ "$ENGINE" == "vllm-disagg" ]]; then
extra_flags="--trust-remote-code --tokenizer $MODEL_PATH"
else
if [ "$IS_MTP" = "true" ]; then
extra_flags="--use-chat-template"
fi
fi

run_benchmark_serving \
--bench-serving-dir "$REPO_ROOT" \
--model ${MODEL_PATH} \
--port ${head_port} \
--model "$BENCH_MODEL" \
--port "$ROUTER_PORT" \
--backend openai \
--input-len ${chosen_isl} \
--output-len ${chosen_osl} \
--random-range-ratio ${random_range_ratio} \
--num-prompts $(( $max_concurrency * $num_prompts_multiplier )) \
--input-len "$chosen_isl" \
--output-len "$chosen_osl" \
--random-range-ratio "$random_range_ratio" \
--num-prompts "$num_prompts" \
--max-concurrency "$max_concurrency" \
--result-filename "$export_file" \
--result-dir /workspace/ \
$( [ "$IS_MTP" = "true" ] && echo "--use-chat-template" )
$extra_flags

echo "-----------------------------------------"

# vLLM: cooldown between rounds for idle KV block reaper
if [[ "$ENGINE" == "vllm-disagg" ]]; then
echo "[BENCH] Cooldown: waiting 10s for idle KV block reaper..."
sleep 10
fi
done
Loading