[Klaud Cold] Update dsv4-fp8-h200-vllm (+mtp) vLLM image to v0.21.0 by functionstackx · Pull Request #1461 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-17T23:46:05Z

Summary

Bumps dsv4-fp8-h200-vllm from vllm/vllm-openai:deepseekv4-cu129 (custom DSV4 tag, 11d old) to vllm/vllm-openai:v0.21.0.
Bumps dsv4-fp8-h200-vllm-mtp from SHA-pinned vllm/vllm-openai:v0.20.1@sha256:9eff9734... (11d old) to vllm/vllm-openai:v0.21.0.

Test plan

Full sweep passes with full-sweep-enabled label.

🤖 Generated with Claude Code

github-actions · 2026-05-17T23:46:13Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-17T23:46:13Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-17T23:46:13Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-17T23:46:13Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-17T23:46:35Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26006220868
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26006220868

claude

LGTM — straightforward vLLM image bump to v0.21.0; the stale-comment nit is non-functional.

Extended reasoning...

Overview

This PR bumps the container image for two YAML entries in .github/configs/nvidia-master.yaml (dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp) to vllm/vllm-openai:v0.21.0, and appends a matching entry to perf-changelog.yaml. The total diff is 4 changed lines plus a 7-line changelog block — purely a config tag update with no logic changes.

Security risks

None. These are container image tags consumed by CI/benchmark sweeps; no auth, crypto, permissions, or input-handling code is touched. The new tag v0.21.0 is the canonical upstream vLLM release tag that several other recipes in this repo were already updated to (see prior PRs #1404, #1394, #1416 in the recent history).

Level of scrutiny

Low. This follows the exact same mechanical pattern as multiple recently merged image-bump PRs in this repo, the change is fully reversible by reverting two lines, and the full-sweep-enabled label gates correctness on the actual benchmark sweep rather than human eyeballing.

Other factors

The only finding is a nit posted inline: the explanatory comments above both entries still reference the old deepseekv4-cu129 and v0.20.1 tags. It is purely a documentation staleness issue with no runtime impact, and not worth blocking the PR on. The author can address it in a follow-up or as a quick amend.

claude · 2026-05-17T23:51:36Z

 # flag is omitted. Max-model-len is pinned at 800k per the recipe.
 dsv4-fp8-h200-vllm:
-  image: vllm/vllm-openai:deepseekv4-cu129
+  image: vllm/vllm-openai:v0.21.0


🟡 The two header comments above dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp are now stale: lines 2843–2845 still say "Uses the cu129 image" and lines 2867–2869 still say "Uses the canonical v0.20.1 image (the non-MTP entry above is still on the deepseekv4-cu129 tag)". After this PR both entries are on v0.21.0, so all three of those tag references are factually wrong. Please update both comment blocks to match the new image tag.

Extended reasoning...

What's stale

This PR bumps the image for both DSv4 entries to vllm/vllm-openai:v0.21.0, but the explanatory comments that document the reason for the previous tags were not updated. Two distinct comment blocks are now incorrect:

Block 1 — above dsv4-fp8-h200-vllm (lines 2843–2845):

# DeepSeek-V4-Pro H200 recipe from https://vllm.ai/blog/deepseek-v4 # Uses the cu129 image. H200 has no FP4 path, so the FP4 indexer cache # flag is omitted. Max-model-len is pinned at 800k per the recipe.

The new image vllm/vllm-openai:v0.21.0 (line 2847) contains no cu129 designator, so "Uses the cu129 image" is wrong.

Block 2 — above dsv4-fp8-h200-vllm-mtp (lines 2867–2869):

# MTP variant of dsv4-fp8-h200-vllm. Uses the canonical v0.20.1 image # (the non-MTP entry above is still on the deepseekv4-cu129 tag) and adds # --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.

Both factual claims in the parenthetical are now false: the MTP entry is on v0.21.0 (not v0.20.1), and the non-MTP entry above is also on v0.21.0 (not deepseekv4-cu129).

Step-by-step proof

Before this PR, dsv4-fp8-h200-vllm.image was vllm/vllm-openai:deepseekv4-cu129 — matching "Uses the cu129 image" in block 1, and matching "non-MTP entry above is still on the deepseekv4-cu129 tag" in block 2.

Before this PR, dsv4-fp8-h200-vllm-mtp.image was vllm/vllm-openai:v0.20.1@sha256:9eff9734... — matching "Uses the canonical v0.20.1 image" in block 2.

This PR's diff at line 2847 mutates the first image to vllm/vllm-openai:v0.21.0, and the diff at line 2871 mutates the second image to vllm/vllm-openai:v0.21.0.

Neither comment block is touched by the diff, so all three tag references in the comments are now factually invalidated by this PR.

Impact

No runtime effect — the comments are descriptive, not load-bearing. The harm is that a maintainer cross-referencing comments with images (e.g., when triaging another image bump or debugging a tag mismatch) will be actively misled about what container is in use. The PR's perf-changelog.yaml entry correctly describes the swap, which suggests the in-file comments were simply overlooked.

Suggested fix

Update both blocks to reflect v0.21.0. For example:

# DeepSeek-V4-Pro H200 recipe from https://vllm.ai/blog/deepseek-v4 # Uses the v0.21.0 image. H200 has no FP4 path, so the FP4 indexer cache # flag is omitted. Max-model-len is pinned at 800k per the recipe.

# MTP variant of dsv4-fp8-h200-vllm. Uses the v0.21.0 image (matching the # non-MTP entry above) and adds # --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.

Re: refutation

One verifier flagged bug_002 as a duplicate of bug_003. The synthesis agent has already merged the two original bugs (bug_001 covering block 1 and bug_002 covering block 2) into a single report that covers both stale comment blocks in one place, which addresses the fragmentation concern.

github-actions · 2026-05-18T00:57:37Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26006222386
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26006222386

vLLM v0.21.0 uses more GPU memory at model load than the old custom deepseekv4-cu129 / v0.20.1 builds the recipe was previously pinned to. At --gpu-memory-utilization 0.95 the new image OOMs on GPU 2 during weight loading (CUDA out of memory: 138.83/139.81 GiB already used, need 1008 MiB more). Drop to 0.90 in both dsv4_fp8_h200.sh and dsv4_fp8_h200_mtp.sh (matches the pattern we use for other vLLM B200/B300 recipes since the v0.20.x->v0.21.x bump expanded the runtime footprint).

functionstackx · 2026-05-18T01:52:03Z

Diagnosis + fix attempt: lowering `--gpu-memory-utilization` 0.95 → 0.90

Failing run (pre-fix sweep): https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26006222386

Representative failing job: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26006222386/job/76438133541 (dsv4_8k1k fp8 h200 vllm | tp=8 ep=1 dpa=false | conc-256 | eval-only)

What I read in the log

All 8 TP workers (Worker_TP0…TP7) crash during model loading with:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1008.00 MiB.
GPU 2 has a total capacity of 139.81 GiB of which 993.44 MiB is free.
Including non-PyTorch memory, this process has 138.83 GiB memory in use.
[gpu_model_runner.py:4957] Failed to load model — not enough GPU memory.
Try lowering --gpu-memory-utilization to free memory for weights, …

The server is launched with --gpu-memory-utilization 0.95. By the time the weights stream in, every GPU is already at 138.83 / 139.81 GiB used and the final 1008 MiB allocation tips it over.

Why this is the v0.21.0 bump, not a pre-existing issue

This recipe was previously pinned to the SHA-pinned vllm/vllm-openai:deepseekv4-cu129 custom DSV4 build (Off variant) and vllm/vllm-openai:v0.20.1@sha256:9eff97... (MTP variant). Both ran cleanly at the same --gpu-memory-utilization 0.95. The image bump in this PR is to vllm/vllm-openai:v0.21.0 (generic, non-custom). v0.21.0 has expanded its runtime footprint (CUDA-graph profiler, larger weight-cast buffers) — the same pattern we've already hit on:

Update kimik2.5-fp4-b200-vllm vLLM image to v0.21.0 #1395 — kimik2.5-fp4-b200-vllm v0.20.2 CUDA-graph profiler ate ~57 GB/GPU upfront (fixed with VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 + 0.90)
Update gptoss-fp4-mi300x-vllm vLLM ROCm image to v0.21.0 #1403 — gptoss-fp4-mi300x-vllm v0.20.2 OOMed at 0.95 on MI300X (fixed with 0.95 → 0.90)

The fix I just pushed (`49570ada`)

Drop --gpu-memory-utilization from 0.95 to 0.90 in both:

benchmarks/single_node/dsv4_fp8_h200.sh:65
benchmarks/single_node/dsv4_fp8_h200_mtp.sh:73

0.90 leaves ~14 GB/GPU headroom (vs. the ~1 GB we currently have at 0.95) — enough room for v0.21.0's larger load-time footprint while still giving the KV cache the bulk of HBM.

Fallbacks if this isn't enough

Also set export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 before vllm serve (the kimik2.5_fp4_b200 pattern). v0.21.0 is reported to enable the estimator by default — disabling it can claw back ~20+ GB/GPU of pre-reserved budget.
Drop further to 0.85 if 0.90 still OOMs.
Revert the image bump and stay on vllm/vllm-openai:deepseekv4-cu129 / v0.20.1 until v0.21.0's footprint stabilizes.

[Klaud Cold] Update dsv4-fp8-h200-vllm (+mtp) vLLM image to v0.21.0

778120f

functionstackx requested a review from a team May 17, 2026 23:46

functionstackx added the full-sweep-enabled label May 17, 2026

functionstackx requested review from jgangani and kedarpotdar-nv as code owners May 17, 2026 23:46

functionstackx added the full-sweep-enabled label May 17, 2026

github-project-automation Bot added this to InferenceMAX Board May 17, 2026

chore: fill pr-link for #1461

c43b994

claude Bot reviewed May 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] Update dsv4-fp8-h200-vllm (+mtp) vLLM image to v0.21.0#1461

[Klaud Cold] Update dsv4-fp8-h200-vllm (+mtp) vLLM image to v0.21.0#1461
functionstackx wants to merge 3 commits into
mainfrom
update-dsv4-fp8-h200-vllm-v0.21.0

functionstackx commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot May 17, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 17, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude Bot May 17, 2026

Choose a reason for hiding this comment

What's stale

Step-by-step proof

Impact

Suggested fix

Re: refutation

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Diagnosis + fix attempt: lowering --gpu-memory-utilization 0.95 → 0.90

What I read in the log

Why this is the v0.21.0 bump, not a pre-existing issue

The fix I just pushed (49570ada)

Fallbacks if this isn't enough

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Diagnosis + fix attempt: lowering `--gpu-memory-utilization` 0.95 → 0.90

The fix I just pushed (`49570ada`)