[AMD][MI355X] add the kimik2.5_int4_mi355x_vllm-disagg support for AMD GPU. by haic0 · Pull Request #1581 · SemiAnalysisAI/InferenceX

haic0 · 2026-05-28T15:28:55Z

Summary

Add Kimi-K2.5 INT4 to the MI355X vLLM disaggregated multi-node model registry.
Add a matching MI355X vLLM-disagg multi-node config and launcher following the Kimi-K2.5 INT4 topology.
-Add Kimi-K2.5-INT4 disagg inference configs (1P2D)
-Consolidate amd_utils to support both sglang and vllm disagg engines

Test plan

Parsed edited YAML files with python3/yaml.
Generated the new multi-node matrix config with generate_sweep_configs.py.
-Verify CI passes on amd-master config with vllm-disagg entries
-Validate multi-node sglang benchmarks still work (no regressions from amd_utils refactor)
-Run vllm-disagg multi-node benchmark on MI355X cluster
-Confirm Kimi-K2.5 INT4 disagg recipes launch correctly

Note

Low Risk
Benchmark and CI matrix wiring only; no runtime application or auth paths change.

Overview
Adds Kimi-K2.5 INT4 disaggregated prefill–decode benchmarking on MI355X with vLLM (vllm/vllm-openai-rocm:v0.21.0), aligned with the existing Kimi MXFP4 vLLM-disagg 1P+2D layout.

Registers kimik2.5-int4-mi355x-vllm-disagg in amd-master.yaml for 1k1k and 8k1k (conc 8–512): one prefill worker (TP8, EP1) plus two decode workers (TP8, EP8), with PREFILL_NODES/DECODE_NODES and VLLM_MORIIO_CONNECTOR_READ_MODE=1.

Adds Kimi-K2.5-INT4 in models_vllm.yaml (prefill/decode CLI flags, MoRI on decode, INT4 quick-reduce / V2 model-runner env, models--moonshotai--Kimi-K2.5 cache dir) and a new multinode launcher kimik2.5_int4_mi355x_vllm-disagg.sh that maps matrix EP/DP settings and submits via amd_utils/submit.sh. Documents the change in perf-changelog.yaml.

^{Reviewed by Cursor Bugbot for commit 0aceaf2. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-05-28T15:29:08Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 47cb1c7. Configure here.}

cursor · 2026-05-28T19:05:11Z

+    exit 1
+fi
+
+echo "$JOB_ID"


New launcher script is verbatim duplicate of existing scripts

Low Severity

kimik2.5_int4_mi355x_vllm-disagg.sh is a verbatim copy of minimaxm2.5_fp8_mi355x_vllm-disagg.sh and nearly identical to kimik2.5_fp4_mi355x_vllm-disagg.sh (differing only by two comment lines). The sglang-disagg launcher scripts (glm5_fp8, qwen3.5_fp8, dsr1_fp8) also share the same logic. There are now 6+ near-identical disagg launchers. Since none contain model-specific logic (all configuration comes from environment variables and models_vllm.yaml), a single shared script could replace all of them, reducing maintenance burden and risk of inconsistent fixes across copies.

^{Reviewed by Cursor Bugbot for commit 47cb1c7. Configure here.}

claude · 2026-05-28T19:08:35Z

+Kimi-K2.5-INT4:
+  prefill_flags: "--tensor-parallel-size 8 --compilation-config '{\"cudagraph_mode\":\"PIECEWISE\"}' --no-enable-prefix-caching --gpu-memory-utilization 0.9 --mm-encoder-tp-mode data --trust-remote-code"
+  decode_flags: "--tensor-parallel-size 8 --all2all-backend mori --compilation-config '{\"cudagraph_mode\":\"PIECEWISE\"}' --no-enable-prefix-caching --gpu-memory-utilization 0.9 --mm-encoder-tp-mode data --trust-remote-code"
+  env: "VLLM_ROCM_USE_AITER=1 VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_USE_V2_MODEL_RUNNER=1 VLLM_ENGINE_READY_TIMEOUT_S=3600"
+  hf_dir: "models--moonshotai--Kimi-K2.5"


🔴 The new Kimi-K2.5-INT4 entry in benchmarks/multi_node/amd_utils/models_vllm.yaml sets hf_dir: 'models--moonshotai--Kimi-K2.5' (the BF16 base-model cache), but .github/configs/amd-master.yaml's kimik2.5-int4-mi355x-vllm-disagg uses model: moonshotai/Kimi-K2.5-INT4. Per the HF cache convention used everywhere else in this file (models--<org>--<repo>), and consistent with the sibling INT4 single-node entries which use model: moonshotai/Kimi-K2.5 + runtime VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4, the pair is internally inconsistent — job.slurm searches ${MODEL_DIR}/${DISK_DIR_NAME} first, so on a node where the BF16 cache already exists (very likely, since the sibling single-node INT4 entries use the same BF16 base), the disagg run will silently load BF16 weights and produce meaningless numbers. Fix by either (a) setting hf_dir: 'models--moonshotai--Kimi-K2.5-INT4' (if the INT4 checkpoint is intended), or (b) changing model: to moonshotai/Kimi-K2.5 to match the single-node INT4 recipe.

Extended reasoning...

Bug

In benchmarks/multi_node/amd_utils/models_vllm.yaml (lines 33–37 of the new entry), the Kimi-K2.5-INT4 key declares:

Kimi-K2.5-INT4: ... env: "... VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 ..." hf_dir: "models--moonshotai--Kimi-K2.5"

The hf_dir points at the BF16 base model's HF cache directory, but the matching .github/configs/amd-master.yaml entry kimik2.5-int4-mi355x-vllm-disagg is keyed to model: moonshotai/Kimi-K2.5-INT4. Every other entry in this file follows the HuggingFace cache convention models--<org>--<repo> — e.g. Kimi-K2.5-MXFP4 → models--amd--Kimi-K2.5-MXFP4 (model amd/Kimi-K2.5-MXFP4), MiniMax-M2.5 → models--MiniMaxAI--MiniMax-M2.5 (model MiniMaxAI/MiniMax-M2.5). For model moonshotai/Kimi-K2.5-INT4 the canonical hf_dir would be models--moonshotai--Kimi-K2.5-INT4.

How it manifests at runtime

runners/launch_mi355x-amds.sh:37 derives MODEL_NAME=${MODEL##*/}, so for moonshotai/Kimi-K2.5-INT4 it becomes Kimi-K2.5-INT4, which is what job.slurm uses to look the entry up in models_vllm.yaml. In benchmarks/multi_node/amd_utils/job.slurm lines 128–166, the vLLM-disagg branch then:

Extracts DISK_DIR_NAME from the hf_dir field (line 129–132) → models--moonshotai--Kimi-K2.5.

Searches in this order (lines 149–154):

${MODEL_DIR}/${DISK_DIR_NAME} — the BF16 cache path

${MODEL_DIR}/${MODEL_NAME}

/nfsdata/hf_hub_cache-0/${DISK_DIR_NAME}

/nfsdata/hf_hub_cache-0/${MODEL_NAME}

The first hit wins. Since the sibling single-node entries kimik2.5-int4-mi{300,325,355}x-vllm use model: moonshotai/Kimi-K2.5 (no -INT4 suffix) with runtime VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4, the BF16 cache under models--moonshotai--Kimi-K2.5 is almost certainly already on disk. The disagg run will resolve to it on entry 1 of SEARCH_PATHS and load BF16 weights silently — no error, no warning, just wrong numbers.

Internal inconsistency, even ignoring the cache race

Even if the BF16 cache is absent and the search later fails, the configuration is internally inconsistent. The sibling single-node INT4 configs treat INT4 as a runtime quant of the BF16 checkpoint (env var VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4), so the model identifier is moonshotai/Kimi-K2.5. The new disagg config keeps that same env var but switches the model identifier to moonshotai/Kimi-K2.5-INT4, with no obvious explanation for why an INT4-quantized checkpoint should also need the runtime quant flag. One of the two must be wrong.

Step-by-step proof

CI launches the matrix: launch_mi355x-amds.sh reads model: moonshotai/Kimi-K2.5-INT4 from amd-master.yaml:1525.

MODEL_NAME=${MODEL##*/} → Kimi-K2.5-INT4 (launch_mi355x-amds.sh:37).

job.slurm is invoked, finds Kimi-K2.5-INT4: in models_vllm.yaml, extracts hf_dir → DISK_DIR_NAME=models--moonshotai--Kimi-K2.5.

SEARCH_PATHS[0] = ${MODEL_DIR}/models--moonshotai--Kimi-K2.5 — which is the BF16 cache path.

On any cluster node where the BF16 cache for the sibling single-node INT4 configs is already populated, this path exists.

check_model_path succeeds at index 0; the run boots vLLM against BF16 weights with VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 set.

The benchmark completes and reports numbers labelled as Kimi-K2.5-INT4 disagg, but the loaded weights are BF16.

Fix

Two equivalent fixes, pick one:

A — set hf_dir: 'models--moonshotai--Kimi-K2.5-INT4' in models_vllm.yaml, if the intent is to benchmark against an actual INT4 checkpoint at moonshotai/Kimi-K2.5-INT4.

B — change model: in amd-master.yaml from moonshotai/Kimi-K2.5-INT4 to moonshotai/Kimi-K2.5 and drop the hf_dir override (or keep hf_dir: 'models--moonshotai--Kimi-K2.5'), matching the existing single-node INT4 recipe that relies on runtime INT4 quant of the BF16 checkpoint.

Given that all three sibling single-node INT4 configs (kimik2.5-int4-mi{300,325,355}x-vllm) consistently use B, that's likely the intended choice — but either way, the pair as currently shipped is broken.

Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-28T19:25:00Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26596955818
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26596955818

github-actions · 2026-05-30T01:34:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26666418976
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26666418976

haic0 self-assigned this May 28, 2026

github-project-automation Bot added this to InferenceMAX Board May 28, 2026

haic0 changed the title ~~[Rocm_DI] add the kimik2.5_int4_mi355x_vllm-disagg support for AMD GPU.~~ [AMD][MI355X] add the kimik2.5_int4_mi355x_vllm-disagg support for AMD GPU. May 28, 2026

functionstackx force-pushed the haic0/kimik2.5-int4-mi355x-vllm-disagg branch from b2aac29 to 47cb1c7 Compare May 28, 2026 18:59

functionstackx marked this pull request as ready for review May 28, 2026 18:59

functionstackx requested a review from a team May 28, 2026 18:59

functionstackx requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners May 28, 2026 18:59

functionstackx added the sweep-enabled label May 28, 2026

cursor Bot reviewed May 28, 2026

View reviewed changes

claude Bot reviewed May 28, 2026

View reviewed changes

haic0 and others added 2 commits May 28, 2026 15:22

[AMD] Add Kimi K2.5 INT4 MI355X vLLM disagg

88e318c

Co-authored-by: Cursor <cursoragent@cursor.com>

[AMD] Trigger Kimi K2.5 INT4 MI355X sweep

0aceaf2

Co-authored-by: Cursor <cursoragent@cursor.com>

functionstackx force-pushed the haic0/kimik2.5-int4-mi355x-vllm-disagg branch from 47cb1c7 to 0aceaf2 Compare May 28, 2026 19:22

This was referenced May 29, 2026

fix(process_result): fail loudly on zero-throughput disagg runs (no more masked ZeroDivisionError) #1590

Open

ci(disagg): fail before writing result file + surface real failure class #1591

Open

arygupt force-pushed the haic0/kimik2.5-int4-mi355x-vllm-disagg branch from 00f74d8 to 0aceaf2 Compare May 30, 2026 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][MI355X] add the kimik2.5_int4_mi355x_vllm-disagg support for AMD GPU.#1581

[AMD][MI355X] add the kimik2.5_int4_mi355x_vllm-disagg support for AMD GPU.#1581
haic0 wants to merge 2 commits into
mainfrom
haic0/kimik2.5-int4-mi355x-vllm-disagg

haic0 commented May 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 28, 2026

Uh oh!

claude Bot May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haic0 commented May 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 28, 2026

Choose a reason for hiding this comment

New launcher script is verbatim duplicate of existing scripts

Uh oh!

claude Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haic0 commented May 28, 2026 •

edited by cursor Bot

Loading