Skip to content

gpt-oss-fp4-mi355x: pin to v0.19 + switch to AITER-env-based recipe#1531

Draft
xiaohuguo2023 wants to merge 5 commits into
mainfrom
gptoss-mi355x-pin-v019-aiter-recipe
Draft

gpt-oss-fp4-mi355x: pin to v0.19 + switch to AITER-env-based recipe#1531
xiaohuguo2023 wants to merge 5 commits into
mainfrom
gptoss-mi355x-pin-v019-aiter-recipe

Conversation

@xiaohuguo2023
Copy link
Copy Markdown
Collaborator

@xiaohuguo2023 xiaohuguo2023 commented May 20, 2026

  • Pinning back to vllm 0.19.0 (was bumped to 0.21.0 in Update gptoss-fp4-mi355x-vllm vLLM ROCm image to v0.21.0 #1406). vllm 0.21 hits
    a ~10% regression on MI355x gpt-oss-fp4 — bisected to ~4% from the new ROCm
    fuse_allreduce_rms pass (recoverable via --compilation-config) and ~6%
    from the ROCm 7.0201 → 7.0202 patch in the v0.21 image (not launcher-fixable).
    Staying on 0.19 while we sort both. Heads-up: Klaud Cold will try to re-bump
    on its next cron.

  • Also reworked the launcher: dropped AMDGCN_USE_BUFFER_OPS,
    VLLM_ROCM_USE_AITER_TRITON_ROPE, the explicit --attention-backend,
    fuse_rope_kvcache, and use_inductor_graph_partition. Added the AITER env
    vars (VLLM_ROCM_USE_AITER_MOE/RMSNORM/UNIFIED_ATTENTION/FUSED_MOE_A16W4 +
    HSA_NO_SCRATCH_RECLAIM=1), plus --max-num-seqs 256 and --async-scheduling.

Local numbers (single MI355x, v0.19.0 image, 3 runs each, median):

TP ISL/OSL conc dashboard this PR Δ
1 1k/1k 4 1653 1662 +0.5%
1 1k/1k 8 2668 2787 +4.5%
1 1k/1k 16 4189 4549 +8.6%
1 1k/1k 32 6275 6785 +8.1%
1 1k/1k 64 8812 10573 +20.0%
1 1k/1k 128 13186 16067 +21.8%
4 1k/1k 4 1203 2498 +107.6%
4 1k/1k 8 2100 4253 +102.5%
8 1k/1k 4 2152 2505 +16.4%
8 1k/1k 8 3950 4678 +18.4%
8 1k/1k 16 6668 8571 +28.5%
1 8k/1k 4 5974 7222 +20.9%
1 8k/1k 8 9210 11154 +21.1%
1 8k/1k 16 14499 16663 +14.9%
1 8k/1k 32 20528 23445 +14.2%
1 8k/1k 64 29448 32071 +8.9%
1 8k/1k 128 38532 39333 +2.1%
4 8k/1k 4 4826 10520 +118.0%
8 8k/1k 4 7966 10602 +33.1%
8 8k/1k 8 14198 19205 +35.3%

Avg +30.3% across all 20 combos vs the 2026-03-09 dashboard. TP=4 sees the
biggest wins (~100%+).


Note

Low Risk
Benchmark-only image, env, sweep, and launcher changes for one MI355X vLLM config; no production app or security surface.

Overview
Pins gptoss-fp4-mi355x-vllm back to vllm/vllm-openai-rocm:v0.19.0 (reverting the v0.21.0 bump) and widens the fixed-seq-len sweep with TP=2 and higher TP=4 concurrency ranges on 1k/1k and 8k/1k.

The MI355X launcher (gptoss_fp4_mi355x.sh) moves to an AITER env-var recipe (MOE, RMSNorm, unified attention, fused MoE A16W4, HSA_NO_SCRATCH_RECLAIM), drops the old MEC-firmware gate, buffer-ops, explicit attention backend, and fuse_rope_kvcache compile flags, and adds --max-num-seqs 256 and --async-scheduling.

A discarded pre-flight warmup (CONC×5 prompts at full concurrency) runs before the measured benchmark so AITER Triton autotune does not under-report cold-cache CI throughput. perf-changelog.yaml documents the change.

Reviewed by Cursor Bugbot for commit e658541. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment thread perf-changelog.yaml Outdated
@chunfangamd
Copy link
Copy Markdown
Collaborator

@claude /sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 20, 2026

Claude finished @chunfangamd's task in 0s —— View job


I'll analyze this and get back to you.

Copy link
Copy Markdown
Collaborator

@chunfangamd chunfangamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u instead fix it in vllm upstream & switch to using nightly and then switch to that? what are the in progress PRs to fix it in vllm & rocm?

regression from ROCm 7.0201 → 7.0202
regression from v0.19 -> v0.21.0

fuse_allreduce_rms pass (recoverable via --compilation-config)

add it to recipes? https://recipes.vllm.ai/openai/gpt-oss-120b?hardware=mi355x

@chunfangamd
Copy link
Copy Markdown
Collaborator

@functionstackx There seems to be a misunderstanding. If we check the script changes, we can find that this PR is not intended to address the regression issue; that fix is separate and currently ongoing. @xiaohuguo2023 based his work on version 0.19.0 and has also applied it to the latest image. However, the performance gain in the latest version is lower than that of version 0.19.0, which was anticipated.

Absolutely, we should update the recipe page

@chunfangamd
Copy link
Copy Markdown
Collaborator

@xiaohuguo2023, could you please double-check the performance gain? It seems not to align with the local test.

@functionstackx
Copy link
Copy Markdown
Collaborator

and currently ongoing. @xiaohuguo2023 based his work on version 0.19.0 and has also applied it to the latest image. However, the performance gain in the latest version is lower than that of version 0.19.0, which was anticipated.

what is the vLLM PRs that fixes this in upstream HEAD

@github-actions
Copy link
Copy Markdown
Contributor

@xiaohuguo2023 xiaohuguo2023 force-pushed the gptoss-mi355x-pin-v019-aiter-recipe branch from 88e13ec to 1b98226 Compare May 21, 2026 12:50
@chunfangamd
Copy link
Copy Markdown
Collaborator

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm

@github-actions
Copy link
Copy Markdown
Contributor

@chunfangamd Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26227187638
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm
Pinned ref: 1b98226
Approval: not required (trusted collaborator).

@github-actions
Copy link
Copy Markdown
Contributor

@chunfangamd chunfangamd marked this pull request as draft May 23, 2026 17:45
@xiaohuguo2023
Copy link
Copy Markdown
Collaborator Author

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm

@github-actions
Copy link
Copy Markdown
Contributor

@xiaohuguo2023 Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26358364428
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm
Pinned ref: 1b98226
Approval: not required (trusted collaborator).

@chunfangamd chunfangamd marked this pull request as ready for review May 25, 2026 06:36
@chunfangamd
Copy link
Copy Markdown
Collaborator

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 25, 2026

Claude finished @chunfangamd's task in 1m 33s —— View job


PR Review: gpt-oss-fp4-mi355x pin to v0.19 + AITER recipe

  • Gather context and read changed files
  • Validate master config changes
  • Validate benchmark script style and correctness
  • Validate perf-changelog entry
  • Post review findings

Summary

One blocking issue found: the new perf-changelog.yaml entry landed in the middle of the file (line ~3086 of 3147) instead of at the end, because the Merge branch 'main' commit brought in newer entries that now sit after it. Move the block to the end of the file to restore chronological order.

Everything else looks good:

  • Config: Image pin to v0.19.0 is valid, model prefix gptoss is correct, search-space extensions (TP=2 added, TP=4 conc-end widened) are reasonable.
  • Benchmark script: vllm serve args are properly formatted on separate lines. The pre-flight warmup pass is well-motivated (AITER Triton JIT cold-cache bias) and correctly discards results to /tmp/. HSA_NO_SCRATCH_RECLAIM=1 is now unconditional (previously gated on MEC FW version) — intentional per the AITER recipe.
  • Image accessibility: vllm/vllm-openai-rocm:v0.19.0 is a public Docker Hub image.
  • PR link: perf-changelog.yaml correctly references pull/1531.

@github-actions
Copy link
Copy Markdown
Contributor

@chunfangamd Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26387067776
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi355x-vllm
Pinned ref: 6b98c26
Approval: not required (trusted collaborator).

Comment thread perf-changelog.yaml Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6b98c26. Configure here.

--num-prompts "$((CONC * 5))" \
--max-concurrency "$CONC" \
--result-filename "warmup_$RESULT_FILENAME" \
--result-dir /tmp/ > /dev/null 2>&1 || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warmup failures silently ignored

Medium Severity

The new pre-flight run_benchmark_serving warmup discards all output and ends with || true, so warmup errors never fail the job. The measured benchmark can still write RESULT_FILENAME.json while kernels and compile caches are cold, understating throughput—the issue this warmup was meant to fix.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6b98c26. Configure here.

--num-prompts "$((CONC * 5))" \
--max-concurrency "$CONC" \
--result-filename "warmup_$RESULT_FILENAME" \
--result-dir /tmp/ > /dev/null 2>&1 || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eval-only still runs warmup

Medium Severity

The new pre-flight run_benchmark_serving warmup always runs, but run_benchmark_serving in benchmark_lib.sh skips the measured throughput pass when EVAL_ONLY is true. Eval-only workflow jobs still load the server with CONC*5 prompts before eval, wasting time and skewing GPU state versus eval-only intent.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6b98c26. Configure here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bot's claim is incorrect. Looking at benchmark_lib.sh lines 188-192:

  run_benchmark_serving() {
      # In eval-only mode, skip the throughput benchmark entirely.
      if [ "${EVAL_ONLY}" = "true" ]; then
          echo "EVAL_ONLY mode: skipping throughput benchmark"
          return 0
      fi
      ...
  }

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — benchmark recipe update (image pin + AITER env vars + sweep-space expansion); only finding is a nit on the warmup-comment wording.

Extended reasoning...

Overview

This PR is scoped to the gpt-oss-fp4-mi355x-vllm benchmark recipe: it pins the vLLM image back from v0.21.0 to v0.19.0 (regression hedge), switches benchmarks/single_node/gptoss_fp4_mi355x.sh to an AITER-env-based recipe (MOE/RMSNORM/UnifiedAttn/A16W4 + HSA_NO_SCRATCH_RECLAIM=1), drops legacy launcher flags, adds --max-num-seqs 256 and --async-scheduling, adds a pre-flight warmup pass, and expands the search-space in .github/configs/amd-master.yaml (adds tp=2 and widens tp=4). A matching perf-changelog.yaml entry is included, and pr-link was correctly substituted to #1531 after my prior comment.

Security risks

None. Changes are confined to benchmark configuration, a launcher shell script, and a human-readable changelog. No auth, crypto, permissions, network, or user-input handling code is touched.

Level of scrutiny

Low. This is benchmark/CI infrastructure (the launcher is invoked by the sweep system, not production serving), and the substantive choices (image pin rationale, AITER flag set, sweep grid) are domain-specific tuning decisions that the AMD recipe owners have already iterated on in-thread — chunfangamd and xiaohuguo2023 have run multiple /sweep test-config runs against this branch.

Other factors

The only finding from the bug-hunting pass is a [Nit] inline comment noting the warmup-pass rationale mis-describes the underlying tool — purely a maintainer-readability concern, no functional impact. My earlier review's blocker (literal XXX placeholder in the pr-link) has been resolved in the current diff. Sweep runs have been kicked off by the recipe owners, which is the appropriate validation for this kind of change.

Comment on lines +62 to +69
# per-shape Triton JIT autotune and torch.compile cache misses bias the first
# ~20-40s of the real benchmark window low (CI cold-cache measurements showed
# a ~10% under-report). vllm bench's built-in --num-warmups defaults to 16
# serial prompts, which doesn't exercise the full-concurrency kernel variants.
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new pre-flight warmup block's comment (and the matching perf-changelog.yaml entry for PR #1531) describes the wrong tool: the script doesn't invoke upstream vllm bench, the wrapper hard-codes --num-warmups to 2 * max_concurrency (not 16), and that warmup is dispatched concurrently via asyncio.gather under an asyncio.Semaphore(max_concurrency) — so it's not serial. The pre-flight pass itself is harmless, but the rationale is misleading for future maintainers who might try to consolidate by tuning the wrapper's --num-warmups. Suggest rewording to describe what's actually happening: the wrapper's existing 2*CONC concurrent warmup is too short for AITER's first-shape Triton JIT autotune to converge, so an additional CONC*5-prompt pass at full concurrency is added so the measured window starts at steady state.

Extended reasoning...

What the comment claims vs. what the code does\n\nThe new comment in benchmarks/single_node/gptoss_fp4_mi355x.sh (lines 62-69) and the third bullet of the new perf-changelog.yaml entry justify the pre-flight pass with:\n\n> "vllm bench's built-in --num-warmups defaults to 16 serial prompts, which doesn't exercise the full-concurrency kernel variants."\n\nThis sentence is wrong on three independent points when checked against the code this PR actually runs.\n\n### 1. The script doesn't use upstream vllm bench\n\nThe new pre-flight block calls run_benchmark_serving, which is sourced from benchmarks/benchmark_lib.sh. At benchmark_lib.sh:344-364 that helper invokes the in-repo benchmark:\n\nbash\nlocal benchmark_cmd=(\n python3 "$workspace_dir/utils/bench_serving/benchmark_serving.py"\n ...\n)\n\n\nIt never shells out to vllm bench. So the "vllm bench's built-in default" framing is referring to a tool that isn't on the codepath.\n\n### 2. The wrapper hard-codes --num-warmups to 2*max_concurrency, not 16\n\nbenchmark_lib.sh:360 adds:\n\nbash\n--num-warmups "$((2 * max_concurrency))"\n\n\nfor every call. For CONC=64 that's already 128 warmup requests; for CONC=128 it's 256. The in-repo benchmark_serving.py argparse default is in fact 0 (parser.add_argument('--num-warmups', type=int, default=0) at line 1235) — neither 16 nor anything else. So the 16-prompt figure does not correspond to either the wrapper-set value or the underlying default.\n\n### 3. The warmup is concurrent, not serial\n\nutils/bench_serving/benchmark_serving.py:507-522:\n\npython\nif num_warmups > 0:\n ...\n warmup_semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else None\n async def warmup_limited_req_fn():\n if warmup_semaphore is None:\n return await request_func(request_func_input=test_input, pbar=warmup_pbar)\n async with warmup_semaphore:\n return await request_func(request_func_input=test_input, pbar=warmup_pbar)\n warmup_tasks = []\n for _ in range(num_warmups):\n task = asyncio.create_task(warmup_limited_req_fn())\n warmup_tasks.append(task)\n _ = await asyncio.gather(*warmup_tasks)\n\n\nAll num_warmups tasks are created up front and awaited via asyncio.gather, gated only by an asyncio.Semaphore(max_concurrency). The warmup therefore already runs at full max_concurrency — exactly the "full-concurrency kernel variants" the comment claims it doesn't exercise.\n\n### Step-by-step proof at CONC=64\n\n1. The sweep invokes gptoss_fp4_mi355x.sh with CONC=64.\n2. The script reaches the new pre-flight block and calls run_benchmark_serving ... --num-prompts $((64*5))=320 --max-concurrency 64.\n3. Inside run_benchmark_serving (benchmark_lib.sh:344-364), this becomes python3 utils/bench_serving/benchmark_serving.py ... --num-warmups $((2*64))=128 --max-concurrency 64 .... Note: vllm bench is never invoked.\n4. benchmark_serving.py constructs warmup_semaphore = asyncio.Semaphore(64), schedules 128 warmup tasks via asyncio.create_task, and asyncio.gathers them. So the warmup-before-the-warmup is already 128 requests at concurrency 64.\n5. The same applies to the real benchmark run that follows it (also at concurrency 64).\n\nNet effect: the comment's three concrete claims (the tool name, the "16" default, and "serial") all disagree with what the code does.\n\n### Impact\n\nFunctionally none — the extra pre-flight pass is harmless and may genuinely help (e.g., it samples fresh random prompts rather than re-issuing the single test_input that benchmark_serving.py uses for warmup). But the documented reason is incorrect on three points, and the same misleading text is copied verbatim into perf-changelog.yaml. A future maintainer trying to consolidate this pre-flight pass by tuning the wrapper's --num-warmups would, following the comment, conclude that the existing warmup is serial and capped at 16 — both of which are false — and could be misled into the wrong intervention.\n\n### Suggested fix\n\nReplace the rationale (in both the script comment and the changelog bullet) with something that matches the code, e.g.:\n\n> The wrapper's existing 2*CONC concurrent warmup is still too short for AITER's first-shape Triton JIT autotune (and torch.compile cache misses) to converge, so we add another CONC*5-prompt pass at full concurrency before the measured window starts.

@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Pins the image back to vllm/vllm-openai-rocm:v0.19.0 (was bumped to
v0.21.0 in #1406). v0.21 introduces a ROCm/AITER perf regression on
MI355x for gpt-oss that we're still tracking down; staying on v0.19
in the meantime.

Also rewrites the launcher to enable the AITER kernel paths via env
vars (AITER MOE/RMSNorm/UnifiedAttn/A16W4 + HSA_NO_SCRATCH_RECLAIM=1)
and drops the now-obsolete TRITON_ROPE/BUFFER_OPS/--attention-backend/
fuse_rope_kvcache/use_inductor_graph_partition bits. Also adds
--max-num-seqs 256 and --async-scheduling.
@xiaohuguo2023 xiaohuguo2023 force-pushed the gptoss-mi355x-pin-v019-aiter-recipe branch from 6b98c26 to e658541 Compare May 25, 2026 10:42
@chunfangamd chunfangamd marked this pull request as draft May 25, 2026 10:59
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants