Skip to content

feat(inference): update DGX Station vLLM to DeepSeek V4 Flash#4867

Merged
cv merged 2 commits into
mainfrom
feat/vllm-dgx-station-deepseek-v4-flash
Jun 5, 2026
Merged

feat(inference): update DGX Station vLLM to DeepSeek V4 Flash#4867
cv merged 2 commits into
mainfrom
feat/vllm-dgx-station-deepseek-v4-flash

Conversation

@zyang-dev
Copy link
Copy Markdown
Contributor

@zyang-dev zyang-dev commented Jun 5, 2026

Summary

Updates the DGX Station managed vLLM recipe to serve DeepSeek V4 Flash with the NVIDIA vLLM 26.05.post1 container.

Related Issue

Changes

  • Set the DGX Station managed vLLM profile to nvcr.io/nvidia/vllm:26.05.post1-py3.
  • Add deepseek-ai/DeepSeek-V4-Flash to the managed vLLM model registry with the DGX Station launch flags.
  • Move the default --gpu-memory-utilization 0.7 setting from shared vLLM args into the model entries that use it.
  • Update managed vLLM docs and command references with the new deepseek-v4-flash slug and DGX Station default.
  • Add tests for DGX Station profile selection and DeepSeek V4 Flash command generation.
  • Add vLLM profile tests for DGX Station, DGX Spark, and generic Linux, plus pull-image watchdog coverage for managed vLLM image downloads.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • npm run docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Your Name your-email@example.com

Summary by CodeRabbit

  • New Features

    • Added DeepSeek V4 Flash as a managed vLLM model and made it the DGX Station default.
  • Documentation

    • Updated inference and command-reference docs to reflect the new model slug and DGX Station default; removed the prior slug from the recognized list.
  • Refactor

    • Serving defaults moved to per-model configuration, changing generated serve command behavior.
  • Tests

    • Added/expanded tests for vLLM profile detection, image pulling, and the new model configuration.

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 5, 2026

Linter diff in the way? Review this PR in Change Stack to focus on meaningful changes and expand context only when needed.

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 12181078-7de2-4d94-af35-d14197c5df58

📥 Commits

Reviewing files that changed from the base of the PR and between de40cc6 and ca05927.

📒 Files selected for processing (1)
  • test/detect-vllm-profile.test.ts

📝 Walkthrough

Walkthrough

This PR moves gpu-memory-utilization into per-model args, adds a DeepSeek V4 Flash vLLM registry entry with extensive serve flags, sets DGX Station to use a new NGC image and DeepSeek V4 Flash by default, and updates tests and docs to match.

Changes

vLLM Model Registry and Station Profile Update

Layer / File(s) Summary
GPU Memory Utilization Per-Model
src/lib/inference/vllm-models.ts
--gpu-memory-utilization removed from shared args and added into individual model modelArgs for affected models.
DeepSeek V4 Flash Model Registry & Tests
src/lib/inference/vllm-models.ts, src/lib/inference/vllm-models.test.ts
Added deepseek-ai/DeepSeek-V4-Flash registry entry with model-specific vLLM flags; tests verify registry entry, env-based selection, and generated vllm serve command for DGX Station.
DGX Station Profile Reconfiguration
src/lib/inference/vllm.ts
Station profile now uses ngc2605Post1 image and defaults to deepseekV4FlashModel() (DeepSeek V4 Flash); removed DEFAULT_VLLM_MODEL import and updated comments.
vLLM Infrastructure & Profile Testing
src/lib/inference/vllm.test.ts, test/detect-vllm-profile.test.ts
New tests for profile detection and image pulling; updated Station profile test expectations for image tag and default model env value.
Documentation Updates
docs/inference/inference-options.mdx, docs/reference/commands.mdx, docs/reference/commands-nemohermes.mdx
Docs updated to list deepseek-v4-flash as the DGX Station default slug and to remove qwen3.6-27b from recognized slugs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4810: Modifies DGX vLLM image tag configuration and profile behavior in src/lib/inference/vllm.ts.
  • NVIDIA/NemoClaw#4619: Related changes to vLLM model registry and serve-command generation logic affecting model defaults.

Suggested labels

area: inference, Provider: vLLM, Platform: Station, feature

Suggested reviewers

  • cv

Poem

🐰 A flash of DeepSeek brightens Station's hall,
Per-model memory whispers, no more shared call,
Tests hop and watch the image pull along,
Docs sing the slug's new, steadfast song,
Happy hops — the v4 lights the wall.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: updating DGX Station managed vLLM to use DeepSeek V4 Flash instead of the previous Qwen model, which is the central theme across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/vllm-dgx-station-deepseek-v4-flash

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-raw-output.txt

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

E2E Scenario Advisor Recommendation

Required scenario E2E: None
Optional scenario E2E: None

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Failed: Could not parse JSON from advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-scenario-advisor-raw-output.txt

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

PR Review Advisor

Findings: 0 needs attention, 2 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 2 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • New Station default still relies on mutable Hugging Face remote code (src/lib/inference/vllm-models.ts:282): DGX Station now defaults to `deepseek-ai/DeepSeek-V4-Flash`, and the shared vLLM arguments still include `--trust-remote-code`. The model is downloaded and served by mutable Hugging Face model id, and the NGC image is tag-pinned but not digest-pinned. This is an installer-trust and supply-chain concern for a managed inference lifecycle, even though the selected model and flags are allowlisted static registry entries.
    • Recommendation: Pin a reviewed Hugging Face revision for the default model and, where feasible, an NGC image digest; or document why mutable refs are required. Consider making `--trust-remote-code` a per-model opt-in with rationale instead of a shared default.
    • Evidence: `src/lib/inference/vllm-models.ts:98` adds `deepseek-ai/DeepSeek-V4-Flash`; `src/lib/inference/vllm.ts:141-142` makes Station use `nvcr.io/nvidia/vllm:26.05.post1-py3` and `deepseekV4FlashModel()`; `src/lib/inference/vllm-models.ts:282` includes shared `--trust-remote-code`; `downloadModel()` and `buildVllmServeCommand()` use `model.id` without a revision.
  • Runtime validation is still recommended for the new Station vLLM recipe (src/lib/inference/vllm.test.ts:34): The added tests cover profile selection, command-string contents, and pull watchdog behavior with mocked Docker/runner boundaries. The changed behavior is still an infrastructure path that depends on the 26.05.post1 container, DGX Station GPU selection, DeepSeek V4 Flash launch flags, shell quoting, model download, and `/v1/models` readiness.
    • Recommendation: Add or identify targeted runtime/integration validation that DGX Station starts `deepseek-ai/DeepSeek-V4-Flash` on `nvcr.io/nvidia/vllm:26.05.post1-py3` and reaches `/v1/models`. Also add focused unit coverage for mixed-GPU GB300 selection, invalid `NEMOCLAW_VLLM_MODEL` rejection before Docker/runShell, HF token non-leakage in the full generated docker command, and preservation of DeepSeek JSON args through final command construction.
    • Evidence: `src/lib/inference/vllm.test.ts` mocks Docker, runner, and GPU detection; `src/lib/inference/vllm-models.test.ts` asserts command substrings but does not launch the container; deterministic test-depth context marked the changed inference/network/runtime surfaces as `runtime_validation_recommended`.

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — On DGX Station, validate that managed vLLM starts `deepseek-ai/DeepSeek-V4-Flash` with `nvcr.io/nvidia/vllm:26.05.post1-py3` and `/v1/models` returns a model list.. Unit tests now cover the prior stale Station expectation, registry selection, mocked profile selection, command-string contents, and pull watchdog mapping. Because the PR changes a managed Docker/vLLM runtime path, confidence would improve with targeted validation of the actual Station container launch, GPU selection, shell quoting, model download, and `/v1/models` readiness.
  • **Runtime validation** — Add a unit test that Station `buildDockerRunFlags()` selects only GB300 devices on a mixed-GPU host, emits the quoted multi-device form for multiple GB300s, and falls back to `all` when no GB300 is detected.. Unit tests now cover the prior stale Station expectation, registry selection, mocked profile selection, command-string contents, and pull watchdog mapping. Because the PR changes a managed Docker/vLLM runtime path, confidence would improve with targeted validation of the actual Station container launch, GPU selection, shell quoting, model download, and `/v1/models` readiness.
  • **Runtime validation** — Add a unit test that `installVllm()` with `NEMOCLAW_VLLM_MODEL='bad;touch /tmp/pwn'` fails before `dockerPullWithProgressWatchdog`, `dockerSpawn`, or `runShell` are called.. Unit tests now cover the prior stale Station expectation, registry selection, mocked profile selection, command-string contents, and pull watchdog mapping. Because the PR changes a managed Docker/vLLM runtime path, confidence would improve with targeted validation of the actual Station container launch, GPU selection, shell quoting, model download, and `/v1/models` readiness.
  • **Runtime validation** — Add a unit test that the full generated `docker run` command forwards HF credentials only as `-e HF_TOKEN` or `-e HUGGING_FACE_HUB_TOKEN` and never embeds the actual token value.. Unit tests now cover the prior stale Station expectation, registry selection, mocked profile selection, command-string contents, and pull watchdog mapping. Because the PR changes a managed Docker/vLLM runtime path, confidence would improve with targeted validation of the actual Station container launch, GPU selection, shell quoting, model download, and `/v1/models` readiness.
  • **Runtime validation** — Add a unit test that DeepSeek V4 `--compilation-config` and `--speculative-config` JSON strings are preserved in the final `docker run ... /bin/bash -lc ...` command, not just in `buildVllmServeCommand()`.. Unit tests now cover the prior stale Station expectation, registry selection, mocked profile selection, command-string contents, and pull watchdog mapping. Because the PR changes a managed Docker/vLLM runtime path, confidence would improve with targeted validation of the actual Station container launch, GPU selection, shell quoting, model download, and `/v1/models` readiness.
  • **Runtime validation is still recommended for the new Station vLLM recipe** — Add or identify targeted runtime/integration validation that DGX Station starts `deepseek-ai/DeepSeek-V4-Flash` on `nvcr.io/nvidia/vllm:26.05.post1-py3` and reaches `/v1/models`. Also add focused unit coverage for mixed-GPU GB300 selection, invalid `NEMOCLAW_VLLM_MODEL` rejection before Docker/runShell, HF token non-leakage in the full generated docker command, and preservation of DeepSeek JSON args through final command construction.
Since last review details

Current findings:

  • New Station default still relies on mutable Hugging Face remote code (src/lib/inference/vllm-models.ts:282): DGX Station now defaults to `deepseek-ai/DeepSeek-V4-Flash`, and the shared vLLM arguments still include `--trust-remote-code`. The model is downloaded and served by mutable Hugging Face model id, and the NGC image is tag-pinned but not digest-pinned. This is an installer-trust and supply-chain concern for a managed inference lifecycle, even though the selected model and flags are allowlisted static registry entries.
    • Recommendation: Pin a reviewed Hugging Face revision for the default model and, where feasible, an NGC image digest; or document why mutable refs are required. Consider making `--trust-remote-code` a per-model opt-in with rationale instead of a shared default.
    • Evidence: `src/lib/inference/vllm-models.ts:98` adds `deepseek-ai/DeepSeek-V4-Flash`; `src/lib/inference/vllm.ts:141-142` makes Station use `nvcr.io/nvidia/vllm:26.05.post1-py3` and `deepseekV4FlashModel()`; `src/lib/inference/vllm-models.ts:282` includes shared `--trust-remote-code`; `downloadModel()` and `buildVllmServeCommand()` use `model.id` without a revision.
  • Runtime validation is still recommended for the new Station vLLM recipe (src/lib/inference/vllm.test.ts:34): The added tests cover profile selection, command-string contents, and pull watchdog behavior with mocked Docker/runner boundaries. The changed behavior is still an infrastructure path that depends on the 26.05.post1 container, DGX Station GPU selection, DeepSeek V4 Flash launch flags, shell quoting, model download, and `/v1/models` readiness.
    • Recommendation: Add or identify targeted runtime/integration validation that DGX Station starts `deepseek-ai/DeepSeek-V4-Flash` on `nvcr.io/nvidia/vllm:26.05.post1-py3` and reaches `/v1/models`. Also add focused unit coverage for mixed-GPU GB300 selection, invalid `NEMOCLAW_VLLM_MODEL` rejection before Docker/runShell, HF token non-leakage in the full generated docker command, and preservation of DeepSeek JSON args through final command construction.
    • Evidence: `src/lib/inference/vllm.test.ts` mocks Docker, runner, and GPU detection; `src/lib/inference/vllm-models.test.ts` asserts command substrings but does not launch the container; deterministic test-depth context marked the changed inference/network/runtime surfaces as `runtime_validation_recommended`.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
@cv cv merged commit 4f0ae44 into main Jun 5, 2026
34 checks passed
@cv cv deleted the feat/vllm-dgx-station-deepseek-v4-flash branch June 5, 2026 21:17
miyoungc added a commit that referenced this pull request Jun 6, 2026
## Summary
- Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the
dev announcement from discussion #4877.
- Fills the source-doc gaps found during release-prep review across
inference, policy tiers, command behavior, security boundaries, Hermes
dashboard/tooling, runtime context, and troubleshooting.
- Refreshes generated agent skills under `.agents/skills/` from the
current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`.

## Source summary
- #4037 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
system-only runtime context that stays out of visible chat.
- #4875 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
try-first sandbox network/filesystem guidance and clearer failure
classification.
- #4788 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents shared OpenClaw
device-approval policy for startup and connect.
- #4768 -> `docs/reference/network-policies.mdx`,
`docs/network-policy/integration-policy-examples.mdx`,
`docs/get-started/quickstart.mdx`,
`docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`:
Documents `weather`, `public-reference`, and Hermes managed-tool gateway
preset behavior.
- #3788 and #4864 -> `docs/reference/network-policies.mdx`,
`docs/reference/commands.mdx`: Documents non-interactive policy-tier
fail-fast behavior and interactive prompt fallback.
- #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware
default sandbox resolution for `list`, `status`, and `tunnel` commands.
- #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel
status` behavior.
- #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy
preset descriptions in `policy-list`.
- #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents
package-managed OpenShell gateway service and Docker-driver
gateway-marker behavior.
- #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent
gateway/dashboard cleanup isolation by sandbox name and port.
- #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU
patch rollback behavior.
- #4610 -> `docs/reference/troubleshooting.mdx`,
`docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission
guidance aligned and removes skipped experimental wording.
- #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling
for custom `onboard --from <Dockerfile>` contexts in generated skills.
- #4870 -> `docs/reference/commands.mdx`,
`docs/manage-sandboxes/runtime-controls.mdx`: Documents
`NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage.
- #4641 -> `docs/inference/inference-options.mdx`,
`docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM
platform-digest pulls and served-model id adoption.
- #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents
stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash
coverage.
- #4852 -> `docs/inference/use-local-inference.mdx`,
`docs/reference/troubleshooting.mdx`: Documents Ollama model fit
filtering, 16K context floor, cold-load retry, and failed-model
exclusion.
- #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents
API-family sync, Hermes `api_mode`, and Bedrock Runtime exception.
- #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents
Nemotron managed-inference native tool-search fallback.
- #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents
interactive multimodal input prompting.
- #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass
normalization in generated troubleshooting coverage.
- #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents
prebuilt Hermes dashboard assets and TUI recovery without runtime
rebuilds.
- #4854 -> `docs/inference/switch-inference-providers.mdx`,
`docs/reference/commands.mdx`: Documents Hermes proxy API-key
placeholder preservation during inference switches.
- #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`.agents/skills/`: Keeps messaging enrollment behavior aligned with
manifest-hook implementation.
- #4771 -> `docs/security/best-practices.mdx`,
`docs/security/credential-storage.mdx`: Documents Hermes
placeholder-only secret boundary for sandbox-visible runtime files.
- #4787 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents expanded memory scanner
examples for OpenAI project keys and Slack app-level tokens.
- #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill
install mirroring into the agent home directory.
- #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep
structure and generated `.agents/skills/` refresh as the template for
this release.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/
--prefix nemoclaw-user --doc-platform fern-mdx --dry-run`
- `npm run docs`
- `git diff --check`
- skip-term scan across `docs/`, `.agents/skills/`, and `skills/`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit and pre-push hook suites, including markdownlint, gitleaks,
env-var docs gate, docs-to-skills verification, and skills YAML tests

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* DeepSeek-V4-Flash now available as default inference model for DGX
Station.
* Hermes dashboard improved with dedicated port and OAuth-authenticated
tool gateway selection.
* Added weather and public-reference policy presets for expanded agent
capabilities.
* Enhanced Ollama model selection with GPU memory filtering and
automatic retry for timeouts.

* **Bug Fixes**
  * Improved policy tier validation to prevent invalid configurations.
* Better sandbox cleanup scoping by port to prevent conflicts across
deployments.
  * Added GPU patch failure recovery with automatic rollback.

* **Documentation**
* Expanded troubleshooting guides for inference, security, and sandbox
lifecycle.
  * Added .dockerignore best practices for custom deployments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants