Skip to content

[Klaud Cold] Update dsv4-fp8-h200-vllm (+mtp) vLLM image to v0.21.0#1461

Open
functionstackx wants to merge 3 commits into
mainfrom
update-dsv4-fp8-h200-vllm-v0.21.0
Open

[Klaud Cold] Update dsv4-fp8-h200-vllm (+mtp) vLLM image to v0.21.0#1461
functionstackx wants to merge 3 commits into
mainfrom
update-dsv4-fp8-h200-vllm-v0.21.0

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

  • Bumps dsv4-fp8-h200-vllm from vllm/vllm-openai:deepseekv4-cu129 (custom DSV4 tag, 11d old) to vllm/vllm-openai:v0.21.0.
  • Bumps dsv4-fp8-h200-vllm-mtp from SHA-pinned vllm/vllm-openai:v0.20.1@sha256:9eff9734... (11d old) to vllm/vllm-openai:v0.21.0.

Test plan

  • Full sweep passes with full-sweep-enabled label.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

3 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward vLLM image bump to v0.21.0; the stale-comment nit is non-functional.

Extended reasoning...

Overview

This PR bumps the container image for two YAML entries in .github/configs/nvidia-master.yaml (dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp) to vllm/vllm-openai:v0.21.0, and appends a matching entry to perf-changelog.yaml. The total diff is 4 changed lines plus a 7-line changelog block — purely a config tag update with no logic changes.

Security risks

None. These are container image tags consumed by CI/benchmark sweeps; no auth, crypto, permissions, or input-handling code is touched. The new tag v0.21.0 is the canonical upstream vLLM release tag that several other recipes in this repo were already updated to (see prior PRs #1404, #1394, #1416 in the recent history).

Level of scrutiny

Low. This follows the exact same mechanical pattern as multiple recently merged image-bump PRs in this repo, the change is fully reversible by reverting two lines, and the full-sweep-enabled label gates correctness on the actual benchmark sweep rather than human eyeballing.

Other factors

The only finding is a nit posted inline: the explanatory comments above both entries still reference the old deepseekv4-cu129 and v0.20.1 tags. It is purely a documentation staleness issue with no runtime impact, and not worth blocking the PR on. The author can address it in a follow-up or as a quick amend.

# flag is omitted. Max-model-len is pinned at 800k per the recipe.
dsv4-fp8-h200-vllm:
image: vllm/vllm-openai:deepseekv4-cu129
image: vllm/vllm-openai:v0.21.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The two header comments above dsv4-fp8-h200-vllm and dsv4-fp8-h200-vllm-mtp are now stale: lines 2843–2845 still say "Uses the cu129 image" and lines 2867–2869 still say "Uses the canonical v0.20.1 image (the non-MTP entry above is still on the deepseekv4-cu129 tag)". After this PR both entries are on v0.21.0, so all three of those tag references are factually wrong. Please update both comment blocks to match the new image tag.

Extended reasoning...

What's stale

This PR bumps the image for both DSv4 entries to vllm/vllm-openai:v0.21.0, but the explanatory comments that document the reason for the previous tags were not updated. Two distinct comment blocks are now incorrect:

Block 1 — above dsv4-fp8-h200-vllm (lines 2843–2845):

# DeepSeek-V4-Pro H200 recipe from https://vllm.ai/blog/deepseek-v4
# Uses the cu129 image. H200 has no FP4 path, so the FP4 indexer cache
# flag is omitted. Max-model-len is pinned at 800k per the recipe.

The new image vllm/vllm-openai:v0.21.0 (line 2847) contains no cu129 designator, so "Uses the cu129 image" is wrong.

Block 2 — above dsv4-fp8-h200-vllm-mtp (lines 2867–2869):

# MTP variant of dsv4-fp8-h200-vllm. Uses the canonical v0.20.1 image
# (the non-MTP entry above is still on the deepseekv4-cu129 tag) and adds
# --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.

Both factual claims in the parenthetical are now false: the MTP entry is on v0.21.0 (not v0.20.1), and the non-MTP entry above is also on v0.21.0 (not deepseekv4-cu129).

Step-by-step proof

  1. Before this PR, dsv4-fp8-h200-vllm.image was vllm/vllm-openai:deepseekv4-cu129 — matching "Uses the cu129 image" in block 1, and matching "non-MTP entry above is still on the deepseekv4-cu129 tag" in block 2.
  2. Before this PR, dsv4-fp8-h200-vllm-mtp.image was vllm/vllm-openai:v0.20.1@sha256:9eff9734... — matching "Uses the canonical v0.20.1 image" in block 2.
  3. This PR's diff at line 2847 mutates the first image to vllm/vllm-openai:v0.21.0, and the diff at line 2871 mutates the second image to vllm/vllm-openai:v0.21.0.
  4. Neither comment block is touched by the diff, so all three tag references in the comments are now factually invalidated by this PR.

Impact

No runtime effect — the comments are descriptive, not load-bearing. The harm is that a maintainer cross-referencing comments with images (e.g., when triaging another image bump or debugging a tag mismatch) will be actively misled about what container is in use. The PR's perf-changelog.yaml entry correctly describes the swap, which suggests the in-file comments were simply overlooked.

Suggested fix

Update both blocks to reflect v0.21.0. For example:

# DeepSeek-V4-Pro H200 recipe from https://vllm.ai/blog/deepseek-v4
# Uses the v0.21.0 image. H200 has no FP4 path, so the FP4 indexer cache
# flag is omitted. Max-model-len is pinned at 800k per the recipe.
# MTP variant of dsv4-fp8-h200-vllm. Uses the v0.21.0 image (matching the
# non-MTP entry above) and adds
# --speculative-config '{"method":"mtp","num_speculative_tokens":2}'.

Re: refutation

One verifier flagged bug_002 as a duplicate of bug_003. The synthesis agent has already merged the two original bugs (bug_001 covering block 1 and bug_002 covering block 2) into a single report that covers both stale comment blocks in one place, which addresses the fragmentation concern.

@github-actions
Copy link
Copy Markdown
Contributor

vLLM v0.21.0 uses more GPU memory at model load than the old custom
deepseekv4-cu129 / v0.20.1 builds the recipe was previously pinned to.
At --gpu-memory-utilization 0.95 the new image OOMs on GPU 2 during
weight loading (CUDA out of memory: 138.83/139.81 GiB already used,
need 1008 MiB more).

Drop to 0.90 in both dsv4_fp8_h200.sh and dsv4_fp8_h200_mtp.sh (matches
the pattern we use for other vLLM B200/B300 recipes since the
v0.20.x->v0.21.x bump expanded the runtime footprint).
@functionstackx
Copy link
Copy Markdown
Collaborator Author

Diagnosis + fix attempt: lowering --gpu-memory-utilization 0.95 → 0.90

Failing run (pre-fix sweep): https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26006222386

Representative failing job: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26006222386/job/76438133541 (dsv4_8k1k fp8 h200 vllm | tp=8 ep=1 dpa=false | conc-256 | eval-only)

What I read in the log

All 8 TP workers (Worker_TP0TP7) crash during model loading with:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1008.00 MiB.
GPU 2 has a total capacity of 139.81 GiB of which 993.44 MiB is free.
Including non-PyTorch memory, this process has 138.83 GiB memory in use.
[gpu_model_runner.py:4957] Failed to load model — not enough GPU memory.
Try lowering --gpu-memory-utilization to free memory for weights, …

The server is launched with --gpu-memory-utilization 0.95. By the time the weights stream in, every GPU is already at 138.83 / 139.81 GiB used and the final 1008 MiB allocation tips it over.

Why this is the v0.21.0 bump, not a pre-existing issue

This recipe was previously pinned to the SHA-pinned vllm/vllm-openai:deepseekv4-cu129 custom DSV4 build (Off variant) and vllm/vllm-openai:v0.20.1@sha256:9eff97... (MTP variant). Both ran cleanly at the same --gpu-memory-utilization 0.95. The image bump in this PR is to vllm/vllm-openai:v0.21.0 (generic, non-custom). v0.21.0 has expanded its runtime footprint (CUDA-graph profiler, larger weight-cast buffers) — the same pattern we've already hit on:

The fix I just pushed (49570ada)

Drop --gpu-memory-utilization from 0.95 to 0.90 in both:

  • benchmarks/single_node/dsv4_fp8_h200.sh:65
  • benchmarks/single_node/dsv4_fp8_h200_mtp.sh:73

0.90 leaves ~14 GB/GPU headroom (vs. the ~1 GB we currently have at 0.95) — enough room for v0.21.0's larger load-time footprint while still giving the KV cache the bulk of HBM.

Fallbacks if this isn't enough

  1. Also set export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 before vllm serve (the kimik2.5_fp4_b200 pattern). v0.21.0 is reported to enable the estimator by default — disabling it can claw back ~20+ GB/GPU of pre-reserved budget.
  2. Drop further to 0.85 if 0.90 still OOMs.
  3. Revert the image bump and stay on vllm/vllm-openai:deepseekv4-cu129 / v0.20.1 until v0.21.0's footprint stabilizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant