[Klaud Cold] Update dsv4-fp8-h200-vllm (+mtp) vLLM image to v0.21.0#1461
[Klaud Cold] Update dsv4-fp8-h200-vllm (+mtp) vLLM image to v0.21.0#1461functionstackx wants to merge 3 commits into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
3 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26006220868 |
There was a problem hiding this comment.
LGTM — straightforward vLLM image bump to v0.21.0; the stale-comment nit is non-functional.
Extended reasoning...
Overview
This PR bumps the container image for two YAML entries in .github/configs/nvidia-master.yaml (dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp) to vllm/vllm-openai:v0.21.0, and appends a matching entry to perf-changelog.yaml. The total diff is 4 changed lines plus a 7-line changelog block — purely a config tag update with no logic changes.
Security risks
None. These are container image tags consumed by CI/benchmark sweeps; no auth, crypto, permissions, or input-handling code is touched. The new tag v0.21.0 is the canonical upstream vLLM release tag that several other recipes in this repo were already updated to (see prior PRs #1404, #1394, #1416 in the recent history).
Level of scrutiny
Low. This follows the exact same mechanical pattern as multiple recently merged image-bump PRs in this repo, the change is fully reversible by reverting two lines, and the full-sweep-enabled label gates correctness on the actual benchmark sweep rather than human eyeballing.
Other factors
The only finding is a nit posted inline: the explanatory comments above both entries still reference the old deepseekv4-cu129 and v0.20.1 tags. It is purely a documentation staleness issue with no runtime impact, and not worth blocking the PR on. The author can address it in a follow-up or as a quick amend.
| # flag is omitted. Max-model-len is pinned at 800k per the recipe. | ||
| dsv4-fp8-h200-vllm: | ||
| image: vllm/vllm-openai:deepseekv4-cu129 | ||
| image: vllm/vllm-openai:v0.21.0 |
There was a problem hiding this comment.
🟡 The two header comments above dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp are now stale: lines 2843–2845 still say "Uses the cu129 image" and lines 2867–2869 still say "Uses the canonical v0.20.1 image (the non-MTP entry above is still on the deepseekv4-cu129 tag)". After this PR both entries are on v0.21.0, so all three of those tag references are factually wrong. Please update both comment blocks to match the new image tag.
Extended reasoning...
What's stale
This PR bumps the image for both DSv4 entries to vllm/vllm-openai:v0.21.0, but the explanatory comments that document the reason for the previous tags were not updated. Two distinct comment blocks are now incorrect:
Block 1 — above dsv4-fp8-h200-vllm (lines 2843–2845):
# DeepSeek-V4-Pro H200 recipe from https://vllm.ai/blog/deepseek-v4
# Uses the cu129 image. H200 has no FP4 path, so the FP4 indexer cache
# flag is omitted. Max-model-len is pinned at 800k per the recipe.The new image vllm/vllm-openai:v0.21.0 (line 2847) contains no cu129 designator, so "Uses the cu129 image" is wrong.
Block 2 — above dsv4-fp8-h200-vllm-mtp (lines 2867–2869):
# MTP variant of dsv4-fp8-h200-vllm. Uses the canonical v0.20.1 image
# (the non-MTP entry above is still on the deepseekv4-cu129 tag) and adds
# --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.Both factual claims in the parenthetical are now false: the MTP entry is on v0.21.0 (not v0.20.1), and the non-MTP entry above is also on v0.21.0 (not deepseekv4-cu129).
Step-by-step proof
- Before this PR,
dsv4-fp8-h200-vllm.imagewasvllm/vllm-openai:deepseekv4-cu129— matching "Uses the cu129 image" in block 1, and matching "non-MTP entry above is still on the deepseekv4-cu129 tag" in block 2. - Before this PR,
dsv4-fp8-h200-vllm-mtp.imagewasvllm/vllm-openai:v0.20.1@sha256:9eff9734...— matching "Uses the canonical v0.20.1 image" in block 2. - This PR's diff at line 2847 mutates the first image to
vllm/vllm-openai:v0.21.0, and the diff at line 2871 mutates the second image tovllm/vllm-openai:v0.21.0. - Neither comment block is touched by the diff, so all three tag references in the comments are now factually invalidated by this PR.
Impact
No runtime effect — the comments are descriptive, not load-bearing. The harm is that a maintainer cross-referencing comments with images (e.g., when triaging another image bump or debugging a tag mismatch) will be actively misled about what container is in use. The PR's perf-changelog.yaml entry correctly describes the swap, which suggests the in-file comments were simply overlooked.
Suggested fix
Update both blocks to reflect v0.21.0. For example:
# DeepSeek-V4-Pro H200 recipe from https://vllm.ai/blog/deepseek-v4
# Uses the v0.21.0 image. H200 has no FP4 path, so the FP4 indexer cache
# flag is omitted. Max-model-len is pinned at 800k per the recipe.# MTP variant of dsv4-fp8-h200-vllm. Uses the v0.21.0 image (matching the
# non-MTP entry above) and adds
# --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.Re: refutation
One verifier flagged bug_002 as a duplicate of bug_003. The synthesis agent has already merged the two original bugs (bug_001 covering block 1 and bug_002 covering block 2) into a single report that covers both stale comment blocks in one place, which addresses the fragmentation concern.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26006222386 |
vLLM v0.21.0 uses more GPU memory at model load than the old custom deepseekv4-cu129 / v0.20.1 builds the recipe was previously pinned to. At --gpu-memory-utilization 0.95 the new image OOMs on GPU 2 during weight loading (CUDA out of memory: 138.83/139.81 GiB already used, need 1008 MiB more). Drop to 0.90 in both dsv4_fp8_h200.sh and dsv4_fp8_h200_mtp.sh (matches the pattern we use for other vLLM B200/B300 recipes since the v0.20.x->v0.21.x bump expanded the runtime footprint).
Diagnosis + fix attempt: lowering
|
Summary
dsv4-fp8-h200-vllmfromvllm/vllm-openai:deepseekv4-cu129(custom DSV4 tag, 11d old) tovllm/vllm-openai:v0.21.0.dsv4-fp8-h200-vllm-mtpfrom SHA-pinnedvllm/vllm-openai:v0.20.1@sha256:9eff9734...(11d old) tovllm/vllm-openai:v0.21.0.Test plan
full-sweep-enabledlabel.🤖 Generated with Claude Code