Skip to content

[AMD][MI355X] add the kimik2.5_int4_mi355x_vllm-disagg support for AMD GPU.#1581

Open
haic0 wants to merge 2 commits into
mainfrom
haic0/kimik2.5-int4-mi355x-vllm-disagg
Open

[AMD][MI355X] add the kimik2.5_int4_mi355x_vllm-disagg support for AMD GPU.#1581
haic0 wants to merge 2 commits into
mainfrom
haic0/kimik2.5-int4-mi355x-vllm-disagg

Conversation

@haic0
Copy link
Copy Markdown
Collaborator

@haic0 haic0 commented May 28, 2026

Summary

  • Add Kimi-K2.5 INT4 to the MI355X vLLM disaggregated multi-node model registry.
  • Add a matching MI355X vLLM-disagg multi-node config and launcher following the Kimi-K2.5 INT4 topology.
    -Add Kimi-K2.5-INT4 disagg inference configs (1P2D)
    -Consolidate amd_utils to support both sglang and vllm disagg engines

Test plan

  • Parsed edited YAML files with python3/yaml.
  • Generated the new multi-node matrix config with generate_sweep_configs.py.
    -Verify CI passes on amd-master config with vllm-disagg entries
    -Validate multi-node sglang benchmarks still work (no regressions from amd_utils refactor)
    -Run vllm-disagg multi-node benchmark on MI355X cluster
    -Confirm Kimi-K2.5 INT4 disagg recipes launch correctly

Note

Low Risk
Benchmark and CI matrix wiring only; no runtime application or auth paths change.

Overview
Adds Kimi-K2.5 INT4 disaggregated prefill–decode benchmarking on MI355X with vLLM (vllm/vllm-openai-rocm:v0.21.0), aligned with the existing Kimi MXFP4 vLLM-disagg 1P+2D layout.

Registers kimik2.5-int4-mi355x-vllm-disagg in amd-master.yaml for 1k1k and 8k1k (conc 8–512): one prefill worker (TP8, EP1) plus two decode workers (TP8, EP8), with PREFILL_NODES/DECODE_NODES and VLLM_MORIIO_CONNECTOR_READ_MODE=1.

Adds Kimi-K2.5-INT4 in models_vllm.yaml (prefill/decode CLI flags, MoRI on decode, INT4 quick-reduce / V2 model-runner env, models--moonshotai--Kimi-K2.5 cache dir) and a new multinode launcher kimik2.5_int4_mi355x_vllm-disagg.sh that maps matrix EP/DP settings and submits via amd_utils/submit.sh. Documents the change in perf-changelog.yaml.

Reviewed by Cursor Bugbot for commit 0aceaf2. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@haic0 haic0 changed the title [Rocm_DI] add the kimik2.5_int4_mi355x_vllm-disagg support for AMD GPU. [AMD][MI355X] add the kimik2.5_int4_mi355x_vllm-disagg support for AMD GPU. May 28, 2026
@functionstackx functionstackx force-pushed the haic0/kimik2.5-int4-mi355x-vllm-disagg branch from b2aac29 to 47cb1c7 Compare May 28, 2026 18:59
@functionstackx functionstackx marked this pull request as ready for review May 28, 2026 18:59
@functionstackx functionstackx requested a review from a team May 28, 2026 18:59
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 47cb1c7. Configure here.

exit 1
fi

echo "$JOB_ID"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New launcher script is verbatim duplicate of existing scripts

Low Severity

kimik2.5_int4_mi355x_vllm-disagg.sh is a verbatim copy of minimaxm2.5_fp8_mi355x_vllm-disagg.sh and nearly identical to kimik2.5_fp4_mi355x_vllm-disagg.sh (differing only by two comment lines). The sglang-disagg launcher scripts (glm5_fp8, qwen3.5_fp8, dsr1_fp8) also share the same logic. There are now 6+ near-identical disagg launchers. Since none contain model-specific logic (all configuration comes from environment variables and models_vllm.yaml), a single shared script could replace all of them, reducing maintenance burden and risk of inconsistent fixes across copies.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 47cb1c7. Configure here.

Comment on lines +33 to +37
Kimi-K2.5-INT4:
prefill_flags: "--tensor-parallel-size 8 --compilation-config '{\"cudagraph_mode\":\"PIECEWISE\"}' --no-enable-prefix-caching --gpu-memory-utilization 0.9 --mm-encoder-tp-mode data --trust-remote-code"
decode_flags: "--tensor-parallel-size 8 --all2all-backend mori --compilation-config '{\"cudagraph_mode\":\"PIECEWISE\"}' --no-enable-prefix-caching --gpu-memory-utilization 0.9 --mm-encoder-tp-mode data --trust-remote-code"
env: "VLLM_ROCM_USE_AITER=1 VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_USE_V2_MODEL_RUNNER=1 VLLM_ENGINE_READY_TIMEOUT_S=3600"
hf_dir: "models--moonshotai--Kimi-K2.5"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new Kimi-K2.5-INT4 entry in benchmarks/multi_node/amd_utils/models_vllm.yaml sets hf_dir: 'models--moonshotai--Kimi-K2.5' (the BF16 base-model cache), but .github/configs/amd-master.yaml's kimik2.5-int4-mi355x-vllm-disagg uses model: moonshotai/Kimi-K2.5-INT4. Per the HF cache convention used everywhere else in this file (models--<org>--<repo>), and consistent with the sibling INT4 single-node entries which use model: moonshotai/Kimi-K2.5 + runtime VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4, the pair is internally inconsistent — job.slurm searches ${MODEL_DIR}/${DISK_DIR_NAME} first, so on a node where the BF16 cache already exists (very likely, since the sibling single-node INT4 entries use the same BF16 base), the disagg run will silently load BF16 weights and produce meaningless numbers. Fix by either (a) setting hf_dir: 'models--moonshotai--Kimi-K2.5-INT4' (if the INT4 checkpoint is intended), or (b) changing model: to moonshotai/Kimi-K2.5 to match the single-node INT4 recipe.

Extended reasoning...

Bug

In benchmarks/multi_node/amd_utils/models_vllm.yaml (lines 33–37 of the new entry), the Kimi-K2.5-INT4 key declares:

Kimi-K2.5-INT4:
  ...
  env: "... VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 ..."
  hf_dir: "models--moonshotai--Kimi-K2.5"

The hf_dir points at the BF16 base model's HF cache directory, but the matching .github/configs/amd-master.yaml entry kimik2.5-int4-mi355x-vllm-disagg is keyed to model: moonshotai/Kimi-K2.5-INT4. Every other entry in this file follows the HuggingFace cache convention models--<org>--<repo> — e.g. Kimi-K2.5-MXFP4models--amd--Kimi-K2.5-MXFP4 (model amd/Kimi-K2.5-MXFP4), MiniMax-M2.5models--MiniMaxAI--MiniMax-M2.5 (model MiniMaxAI/MiniMax-M2.5). For model moonshotai/Kimi-K2.5-INT4 the canonical hf_dir would be models--moonshotai--Kimi-K2.5-INT4.

How it manifests at runtime

runners/launch_mi355x-amds.sh:37 derives MODEL_NAME=${MODEL##*/}, so for moonshotai/Kimi-K2.5-INT4 it becomes Kimi-K2.5-INT4, which is what job.slurm uses to look the entry up in models_vllm.yaml. In benchmarks/multi_node/amd_utils/job.slurm lines 128–166, the vLLM-disagg branch then:

  1. Extracts DISK_DIR_NAME from the hf_dir field (line 129–132) → models--moonshotai--Kimi-K2.5.
  2. Searches in this order (lines 149–154):
    • ${MODEL_DIR}/${DISK_DIR_NAME} — the BF16 cache path
    • ${MODEL_DIR}/${MODEL_NAME}
    • /nfsdata/hf_hub_cache-0/${DISK_DIR_NAME}
    • /nfsdata/hf_hub_cache-0/${MODEL_NAME}

The first hit wins. Since the sibling single-node entries kimik2.5-int4-mi{300,325,355}x-vllm use model: moonshotai/Kimi-K2.5 (no -INT4 suffix) with runtime VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4, the BF16 cache under models--moonshotai--Kimi-K2.5 is almost certainly already on disk. The disagg run will resolve to it on entry 1 of SEARCH_PATHS and load BF16 weights silently — no error, no warning, just wrong numbers.

Internal inconsistency, even ignoring the cache race

Even if the BF16 cache is absent and the search later fails, the configuration is internally inconsistent. The sibling single-node INT4 configs treat INT4 as a runtime quant of the BF16 checkpoint (env var VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4), so the model identifier is moonshotai/Kimi-K2.5. The new disagg config keeps that same env var but switches the model identifier to moonshotai/Kimi-K2.5-INT4, with no obvious explanation for why an INT4-quantized checkpoint should also need the runtime quant flag. One of the two must be wrong.

Step-by-step proof

  1. CI launches the matrix: launch_mi355x-amds.sh reads model: moonshotai/Kimi-K2.5-INT4 from amd-master.yaml:1525.
  2. MODEL_NAME=${MODEL##*/}Kimi-K2.5-INT4 (launch_mi355x-amds.sh:37).
  3. job.slurm is invoked, finds Kimi-K2.5-INT4: in models_vllm.yaml, extracts hf_dirDISK_DIR_NAME=models--moonshotai--Kimi-K2.5.
  4. SEARCH_PATHS[0] = ${MODEL_DIR}/models--moonshotai--Kimi-K2.5 — which is the BF16 cache path.
  5. On any cluster node where the BF16 cache for the sibling single-node INT4 configs is already populated, this path exists.
  6. check_model_path succeeds at index 0; the run boots vLLM against BF16 weights with VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4 set.
  7. The benchmark completes and reports numbers labelled as Kimi-K2.5-INT4 disagg, but the loaded weights are BF16.

Fix

Two equivalent fixes, pick one:

  • A — set hf_dir: 'models--moonshotai--Kimi-K2.5-INT4' in models_vllm.yaml, if the intent is to benchmark against an actual INT4 checkpoint at moonshotai/Kimi-K2.5-INT4.
  • B — change model: in amd-master.yaml from moonshotai/Kimi-K2.5-INT4 to moonshotai/Kimi-K2.5 and drop the hf_dir override (or keep hf_dir: 'models--moonshotai--Kimi-K2.5'), matching the existing single-node INT4 recipe that relies on runtime INT4 quant of the BF16 checkpoint.

Given that all three sibling single-node INT4 configs (kimik2.5-int4-mi{300,325,355}x-vllm) consistently use B, that's likely the intended choice — but either way, the pair as currently shipped is broken.

haic0 and others added 2 commits May 28, 2026 15:22
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@functionstackx functionstackx force-pushed the haic0/kimik2.5-int4-mi355x-vllm-disagg branch from 47cb1c7 to 0aceaf2 Compare May 28, 2026 19:22
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@arygupt arygupt force-pushed the haic0/kimik2.5-int4-mi355x-vllm-disagg branch from 00f74d8 to 0aceaf2 Compare May 30, 2026 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants