Skip to content

Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1443

Open
Klaud-Cold wants to merge 5 commits into
mainfrom
claude/issue-1154-qwen3.5-bf16-mi355x-sglang-mtp
Open

Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1443
Klaud-Cold wants to merge 5 commits into
mainfrom
claude/issue-1154-qwen3.5-bf16-mi355x-sglang-mtp

Conversation

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

@Klaud-Cold Klaud-Cold commented May 17, 2026

Summary

  • Update qwen3.5-bf16-mi355x-sglang image from lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260415 to lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517
  • Update qwen3.5-bf16-mi355x-sglang-mtp image from lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260415 to lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517

Ref #1154

Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

3 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward SGLang ROCm image bump; please replace the placeholder in the perf-changelog pr-link with before merge.

Extended reasoning...

Overview

This PR bumps the SGLang ROCm image for two configs (qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp) in .github/configs/amd-master.yaml from v0.5.10rc0-rocm720-mi35x-20260415 to v0.5.12-rocm720-mi35x-20260517, and adds a corresponding entry to perf-changelog.yaml. Two files, four lines of actual change plus a 7-line changelog block.

Security risks

None. This is a benchmark/CI config data file — no executable code, no auth, no secrets. The new image tag points to a vendor-published SGLang ROCm image used by the perf sweep runners; the same image registry (lmsysorg/sglang-rocm) is already in use across many configs in this file.

Level of scrutiny

Low. Image version bumps for benchmark configs are mechanical and follow an established pattern visible throughout this PR's surrounding context (qwen3.5-bf16-mi325x-sglang, glm5-fp8-mi355x-sglang, etc. all live in the same file and get bumped similarly via dedicated PRs — #1394, #1416, #1429 are the immediate predecessors in the changelog).

Other factors

The single issue surfaced by the bug hunter is a cosmetic placeholder (pull/XXX instead of pull/1443) in the changelog pr-link — already attached as an inline comment. It doesn't affect runtime behavior but should be fixed before merge so the changelog back-reference resolves. No CODEOWNER-sensitive paths, no logic changes, and the PR is labeled full-sweep-enabled so CI will exercise the new image.

Comment thread perf-changelog.yaml Outdated
- qwen3.5-bf16-mi355x-sglang-mtp
description:
- "Update SGLang ROCm image from v0.5.10rc0-rocm720-mi35x-20260415 to v0.5.12-rocm720-mi35x-20260517"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry added by this PR contains a placeholder pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX at line 2632 instead of the real PR number (1443). Every other recent entry in the file uses the actual PR number, so this should be replaced with /pull/1443 before merge to avoid leaving a broken changelog link.

Extended reasoning...

What the bug is: The new perf-changelog.yaml entry added by this PR (#1443) at lines 2626-2632 documents the SGLang ROCm image bump for qwen3.5-bf16-mi355x-sglang and qwen3.5-bf16-mi355x-sglang-mtp. The pr-link field on line 2632 is set to a literal placeholder string:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

The specific code path that triggers it: This is a static YAML data file (the perf changelog), so the bug manifests not at runtime but the moment the PR is merged — the URL .../pull/XXX does not resolve to a real PR, and any tooling or human reader that follows the link from the changelog entry hits a 404 (or a different, unrelated PR if one ever exists with that path).

Why existing code doesn't prevent it: There is no schema validation or CI check that enforces pr-link references a numeric PR ID — it's a plain string field. The author replaced every other placeholder with concrete values (model names, image tags, dates) but left this one with the boilerplate XXX.

Step-by-step proof:

  1. Open perf-changelog.yaml at line 2632 in this PR's diff.
  2. Observe the value: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
  3. Compare to the immediately preceding entries in the same file:
    • line 2606: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1423
    • line 2612: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1429
    • line 2619: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1416
    • line 2625: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1394
  4. The PR metadata (and the description, which references issue [Auto] Docker Image Updates Available - 2026-04-25 #1154) confirms this PR's number is 1443.
  5. After merge, navigating to https://github.com/SemiAnalysisAI/InferenceX/pull/XXX returns 404, so the changelog entry has no working back-reference.

Impact: Cosmetic / documentation only — no runtime behavior is affected, no benchmark is impacted. But the perf-changelog exists specifically so future readers can trace any image/version bump back to the PR that introduced it; a placeholder link defeats that purpose for this entry.

How to fix: Replace XXX with 1443 on line 2632 before merge:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1443

@github-actions
Copy link
Copy Markdown
Contributor

@Klaud-Cold
Copy link
Copy Markdown
Collaborator Author

e2e test result: SLURM infrastructure failure

Run 26000387400: All single-node and eval jobs failed — SLURM salloc cancelled while waiting for GPU allocation (same mi355x cluster contention). Not related to the image update.

@github-actions
Copy link
Copy Markdown
Contributor

…2-rocm720-mi35x-20260517

Ref #1154

Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>
@functionstackx functionstackx force-pushed the claude/issue-1154-qwen3.5-bf16-mi355x-sglang-mtp branch from 7695122 to 5f42a2e Compare May 17, 2026 23:03
Drops the -20260517 nightly suffix so the recipe uses the
lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x release tag rather than
a date-pinned nightly build.
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx changed the title Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x May 17, 2026
Docker Hub does not publish a clean lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x
release tag — only the dated nightly variant. The earlier switch to the
un-suffixed tag was a mistake (caused 'manifest not found' on every job).

Restoring the dated nightly tag that does exist.
@functionstackx functionstackx changed the title Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 May 18, 2026
# Conflicts:
#	perf-changelog.yaml
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant