mi355x dsv4 retry#1160
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| - config-keys: | ||
| - dsv4-fp8-mi355x-sglang | ||
| description: | ||
| - "Drop --mem-fraction-static 0.88 and --max-total-tokens from dsv4_fp8_mi355x.sh" | ||
| - "Bump --chunked-prefill-size from 4096 to 8192" | ||
| - "Retrigger dsv4-fp8-mi355x-sglang" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1159 |
There was a problem hiding this comment.
🟡 The newly added perf-changelog entry sets pr-link: .../pull/1159, but this is PR #1160 ("mi355x dsv4 retry"); #1159 is the prior already-merged retrigger PR (commit d578b7e). The entry was copy-pasted from the unlinked block above without bumping the number — following the convention used for the analogous retrigger pattern at lines 1799-1805 (where #1148 self-references), this should point to pull/1160.
Extended reasoning...
What the bug is
The new perf-changelog.yaml entry added at lines 1830-1836 sets:
- config-keys:
- dsv4-fp8-mi355x-sglang
description:
- "Drop --mem-fraction-static 0.88 and --max-total-tokens from dsv4_fp8_mi355x.sh"
- "Bump --chunked-prefill-size from 4096 to 8192"
- "Retrigger dsv4-fp8-mi355x-sglang"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1159But this PR is #1160 ("mi355x dsv4 retry"), not #1159. PR #1159 ("retrigger mi355x dsv4") is the prior, already-merged retrigger PR — see commit d578b7e in git log. The new entry was clearly copy-pasted from the unlinked sibling block at lines 1823-1828 (which belongs to #1159) and only had pr-link appended, without updating the PR number.
Why this is a bug (convention violation)
The established convention in this file is that each entry's pr-link points to its own PR. The directly analogous retrigger pattern is at lines 1799-1805:
- config-keys:
- dsv4-fp8-mi355x-sglang
description:
- "Bump MI355X SLURM allocation from --time=180 to --time=300 ..."
...
- "Retriggering dsv4-fp8-mi355x-sglang"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1148Per git show --stat 694f206, PR #1148 was a retrigger PR that only modified perf-changelog.yaml (no recipe changes — just adding the duplicate linked entry), and its pr-link self-references #1148 even though the actual recipe changes were made in PR #1147. This PR (#1160) is structurally identical — it's a retrigger of #1159's recipe changes that only adds a duplicate changelog entry — but uses pull/1159 instead of pull/1160, breaking the precedent.
Step-by-step proof
- PR metadata says this is mi355x dsv4 retry #1160 ("mi355x dsv4 retry"), opened 2026-04-25.
git logshows commitd578b7e= "retrigger mi355x dsv4 (retrigger mi355x dsv4 #1159)" — already merged.- The diff adds exactly one block at lines 1830-1836 with
pr-link: ...pull/1159. - Lines 1823-1828 (already on
main, added by retrigger mi355x dsv4 #1159) contain the sameconfig-keys/descriptionas lines 1830-1836 but with nopr-linkfield — the new block is a copy-paste of that prior unlinked block. - Following the Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang #1147 → Increase timeout #1148 precedent (retrigger PR self-references its own PR number), the correct value here is
https://github.com/SemiAnalysisAI/InferenceX/pull/1160.
Impact and fix
Severity is nit: perf-changelog.yaml is documentation/metadata only — no runtime or functional impact. The misleading link still resolves to a real (related) PR, so a reader can still find their way, just not directly. But it does break traceability and the file's self-consistency. Fix: change 1159 → 1160 on line 1836.
No description provided.