Skip to content

Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1441

Open
Klaud-Cold wants to merge 7 commits into
mainfrom
claude/issue-1154-glm5.1-fp4-mi355x-sglang
Open

Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517#1441
Klaud-Cold wants to merge 7 commits into
mainfrom
claude/issue-1154-glm5.1-fp4-mi355x-sglang

Conversation

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

@Klaud-Cold Klaud-Cold commented May 17, 2026

Summary

  • Update glm5.1-fp4-mi355x-sglang image from lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260415 to lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517

Ref #1154

Generated with Claude Code

…mi35x-20260517

Ref #1154

Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward SGLang ROCm image tag bump for glm5.1-fp4-mi355x-sglang with a corresponding changelog entry.

Extended reasoning...

Overview

This PR is a single-line image tag bump for the glm5.1-fp4-mi355x-sglang config in .github/configs/amd-master.yaml (v0.5.10rc0-rocm720-mi35x-20260415v0.5.12-rocm720-mi35x-20260517) plus a matching new entry appended to perf-changelog.yaml. No code or runtime logic changes.

Security risks

None. The change is data-only YAML, just selecting a different upstream container image tag. No auth, crypto, secrets, or permission surfaces are touched.

Level of scrutiny

Low. This is the standard recurring pattern for benchmark image bumps already established across many sibling configs in the same file (e.g., the qwen3.5 and glm5 entries use the same image-tag shape). It's a mechanical, easy-to-revert change.

Other factors

The bug hunting system flagged one nit — the new perf-changelog entry's pr-link still contains the literal XXX placeholder instead of the PR number. That's a documentation traceability issue only and is captured as an inline comment; it doesn't affect the runtime correctness of the bump itself, so I'm comfortable approving rather than blocking.

Comment thread perf-changelog.yaml Outdated
- glm5.1-fp4-mi355x-sglang
description:
- "Update SGLang ROCm image from v0.5.10rc0-rocm720-mi35x-20260415 to v0.5.12-rocm720-mi35x-20260517"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — the XXX placeholder was never substituted with the real PR number. This link will 404 and breaks changelog traceability; replace XXX with 1441 before merging.

Extended reasoning...

What the bug is

Line 2631 of perf-changelog.yaml (the newly added entry in this PR) contains:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

The XXX is a literal placeholder left over from the changelog-entry template and was not substituted with the actual PR number.

How it manifests

Anyone clicking the link from the perf-changelog (or any tooling that follows pr-link URLs to attribute changes to a PR) will hit https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which 404s — there is no PR numbered XXX. The changelog loses its traceability for this entry: a reader can no longer jump from the SGLang ROCm image bump (v0.5.10rc0 → v0.5.12) back to the PR that introduced it.

Why surrounding code didn't catch it

The file is data-only YAML; there is no schema validator that asserts pr-link resolves to an existing PR, and no pre-commit/CI hook that rejects literal XXX placeholders. Every nearby entry (PRs 1394, 1416, 1423, 1429) correctly uses a numeric PR id in the same URL shape, so the convention is unambiguous — this one was just a missed find-and-replace.

Step-by-step proof

  1. Open this PR's metadata — the PR number is #1441.
  2. Open the diff for perf-changelog.yaml (the only changelog change in this PR). The new entry ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
  3. XXX is not a number, so the URL points to no real GitHub resource. Fetching https://github.com/SemiAnalysisAI/InferenceX/pull/XXX will return GitHub's standard 404 page.
  4. Compare to the immediately preceding entry in the same diff context (the vLLM v0.15.1→v0.20.2 bump for gptoss-fp4-b200-vllm), which correctly carries pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1394 — confirming the convention is a numeric PR id, not a placeholder.

Fix

Replace pull/XXX with pull/1441 on line 2631:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1441

Impact

Documentation-only — no runtime behavior is affected, and no automation in the repo appears to parse pr-link strictly. But the link is the changelog's only pointer back to the source change, so leaving it broken degrades the changelog's auditability. Trivially fixable before merge, hence severity nit.

@github-actions
Copy link
Copy Markdown
Contributor

@Klaud-Cold
Copy link
Copy Markdown
Collaborator Author

e2e test result: SLURM infrastructure failure

Run 26000058779: All 4 single-node jobs failed — SLURM salloc cancelled while waiting for GPU allocation (same mi355x cluster contention seen across all dispatches today). Not related to the image update.

@github-actions
Copy link
Copy Markdown
Contributor

claude-fix-bot and others added 2 commits May 17, 2026 19:03
Keep main's perf-changelog.yaml and append glm5.1-fp4-mi355x-sglang
entry at tail. amd-master.yaml auto-merged cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the -20260517 nightly suffix so the recipe uses the
lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x release tag rather than
a date-pinned nightly build.
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx changed the title Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x May 17, 2026
Docker Hub does not publish a clean lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x
release tag — only the dated nightly variant. The earlier switch to the
un-suffixed tag was a mistake (caused 'manifest not found' on every job).

Restoring the dated nightly tag that does exist.
@functionstackx functionstackx changed the title Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 May 18, 2026
# Conflicts:
#	perf-changelog.yaml
@github-actions
Copy link
Copy Markdown
Contributor

functionstackx added a commit that referenced this pull request May 18, 2026
)

Root-caused via the failed sweeps on #1431, #1432, #1440, #1441,
#1443 — every failure landed on either:

  mia1-p01-g09  pyxis: failed to create container filesystem
                (extended attributes not supported on the destination
                filesystem; pyxis can't mount the squashfs)
  mia1-p01-g11  permission denied while trying to connect to docker.sock
                (cluster-cleanup `docker stop` step fails; cascading
                into pyxis-init failure)

Both are already known-bad per KLAUD_DEBUG.md §5.1 / §5.2, but the
launcher wasn't excluding them. This mirrors the existing pattern in
runners/launch_mi300x-amds.sh (#1462 — pin to known-good nodes) and
runners/launch_mi325x-amds.sh (#1477 — exclude chi-mi325x-pod1-121).

Once this lands the 5 affected mi355x PRs can be rebased to pick it up
and the failed jobs will land on healthy nodes only.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	perf-changelog.yaml
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants