fix(preflight): reject WDDM placeholder GPU names on non-NVIDIA firmware#4062
Conversation
…are (#3988) `detectGpu()`'s primary nvidia-smi path used to accept any name the CLI returned, so a Snapdragon X WSL2 d3d12 shim that publishes `nvidia-smi.exe` (and prints a `JMJWOA-Generic-GPU` placeholder for the Snapdragon iGPU) made preflight report "NVIDIA GPU detected" on hosts with no NVIDIA hardware. Onboard then proceeded with GPU paths that later failed in rebuild's CDI check. Gate the primary path with `isPlausibleNvidiaGpuName()` so a name without an `NVIDIA` token and without a recognised NVIDIA product family is only trusted when the firmware platform vouches for it (`spark` / `station` / `jetson`). Real DGX Spark, which also returns the same placeholder string (#3510), keeps detection via the firmware classification path. Fixes #3988 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a GPU-name plausibility check and firmware-gated filtering to NVIDIA detection, rejecting WSL2/WDDM shim placeholders like "JMJWOA-Generic-GPU" on non-NVIDIA platforms and retaining them only when firmware confirms an NVIDIA platform. ChangesNVIDIA GPU detection robustness
Sequence DiagramsequenceDiagram
participant detectGpu
participant detectNvidiaPlatform
participant nvidia-smi
participant isPlausibleNvidiaGpuName
detectGpu->>detectNvidiaPlatform: detectNvidiaPlatform()
detectGpu->>nvidia-smi: run --query-gpu=name,memory.total
nvidia-smi-->>detectGpu: parsed GPU rows
alt firmware confirms NVIDIA
detectGpu->>detectGpu: trust all parsed GPUs and compute totals
else firmware does not confirm
detectGpu->>isPlausibleNvidiaGpuName: filter parsed rows by name
isPlausibleNvidiaGpuName-->>detectGpu: trustedRows
alt trustedRows non-empty
detectGpu->>detectGpu: compute totals from trustedRows
else trustedRows empty
detectGpu-->>detectGpu: return null
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
|
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/inference/nim.test.ts (1)
321-381: ⚡ Quick winAdd a generic-firmware rejection test for
NVIDIA JMJWOA-Generic-GPU.These new tests are good, but they only reject the unprefixed placeholder. Add one case asserting
detectGpu()returnsnullfor vendor-prefixed placeholder on generic firmware to lock the intended guardrail.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/inference/nim.test.ts` around lines 321 - 381, Add a new test case in src/lib/inference/nim.test.ts that mirrors the existing "rejects WDDM placeholder names..." test but returns the vendor-prefixed placeholder string "NVIDIA JMJWOA-Generic-GPU, 65471" from the mocked runCapture; use loadNimWithMockedRunner to get nimModule and wrap the assertion in withFirmwareModel("Microsoft Corporation Virtual Machine", ...) and assert nimModule.detectGpu() === null to ensure vendor-prefixed placeholders are also rejected on generic firmware.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/inference/nim.test.ts`:
- Around line 321-381: Add a new test case in src/lib/inference/nim.test.ts that
mirrors the existing "rejects WDDM placeholder names..." test but returns the
vendor-prefixed placeholder string "NVIDIA JMJWOA-Generic-GPU, 65471" from the
mocked runCapture; use loadNimWithMockedRunner to get nimModule and wrap the
assertion in withFirmwareModel("Microsoft Corporation Virtual Machine", ...) and
assert nimModule.detectGpu() === null to ensure vendor-prefixed placeholders are
also rejected on generic firmware.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c459a9c1-e934-46eb-9b6e-b6397430341f
📒 Files selected for processing (2)
src/lib/inference/nim.test.tssrc/lib/inference/nim.ts
PR Review AdvisorRecommendation: blocked This is an automated advisory review. A human maintainer must make the final merge decision. Limitations: Review used trusted deterministic PR metadata and the supplied diff only; no scripts, tests, package-manager commands, or E2E jobs were executed by this advisor.; Issue #3988 had no comments in trusted context, so acceptance mapping uses the issue body clauses plus PR/E2E/review-thread metadata.; Runtime behavior on real WSL2 Snapdragon X and DGX Spark hardware is not proven by the available evidence.; The PR review advisor is advisory and does not replace maintainer review. Full advisor summaryPR Review AdvisorBase: The targeted GPU-name validation fix now covers the vendor-prefixed WDDM placeholder, but merge is blocked by GitHub mergeStateStatus=BLOCKED, missing required E2E for the current head SHA, and monolith-growth policy blockers. Gate status
🔴 Blockers
🟡 Warnings
🔵 Suggestions
Acceptance coverage
Security review
Test / E2E status
✅ What looks good
Review completeness
|
PR #4062 review fix. The original name guard accepted any name matching `\bNVIDIA\b`, which made `NVIDIA JMJWOA-Generic-GPU` plausible on generic Linux firmware. Some WDDM/d3d12 shims (the same family that originally surfaced as bare `JMJWOA-Generic-GPU` on Snapdragon X WSL2) may add the vendor prefix, which would re-open #3988. Add `NVIDIA_GPU_NAME_DENYLIST_PATTERN` so the placeholder is treated as suspect regardless of any `NVIDIA ` prefix. Real DGX Spark (#3510) still passes via the firmware-vouch path in `detectGpu()`. Adds a regression test mirroring the CodeRabbit review comment on PR #4062. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
|
✅ Brev E2E (gpu): PASSED on branch |
ericksoa
left a comment
There was a problem hiding this comment.
Approved. The WDDM placeholder guard is narrowly scoped to non-NVIDIA firmware and includes the vendor-prefixed placeholder regression coverage CodeRabbit asked for. Live checks are green at 655efbe, including WSL and dedicated Brev GPU branch validation.
## Summary Refreshes the NemoClaw docs for the v0.0.49 hardening release, including release notes, command reference updates, troubleshooting guidance, version metadata, and regenerated user skills. ## Changes - #3796, #3854, #3863, #3866, #3984, #4001, #4011, #4013, #4020, #4022, #4023, #4060, #4062 -> `docs/about/release-notes.mdx`: Adds the v0.0.49 hardening release summary covering gateway reliability, status/doctor/shields and debug UX, OpenClaw compatibility, messaging channel teardown, Hermes policy scoping, snapshots, source installs and Docker group security note, GPU preflight, CLI usage, E2E, and CI improvements. - #3796 -> `docs/manage-sandboxes/backup-restore.mdx` and `docs/reference/commands.mdx`: Documents `snapshot restore --to` overwrite protection and the `--force` opt-in. - #3863, #4013, #4020, #4023 -> `docs/reference/commands.mdx`: Documents missing channel argument usage, sandbox-scoped custom preset matching, session policy preset sync, and gateway failure classification (uses the real probe states from `src/lib/status-command-deps.ts`). - #4022, #4060, #4062 -> `docs/reference/troubleshooting.mdx`: Adds guidance for gateway-down `connect`, source checkout OpenShell bootstrapping, WDDM placeholder GPU names, and Jetson sandbox GPU passthrough. - Release prep -> `docs/project.json`, `docs/versions1.json`, `.agents/skills/nemoclaw-user-*`: Bumps docs metadata to 0.0.49 and refreshes generated user skills from the Fern docs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) \`make docs\` was attempted locally but did not complete because \`npm\` returned \`403 Forbidden\` while fetching \`fern-api\` from \`registry.npmjs.org\` in the sandboxed environment. --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Released v0.0.49 with reliability and compatibility improvements including faster gateway failure diagnostics and safer snapshot restore behavior * Enhanced snapshot restore documentation with `--to` cloning and `--force` overwrite requirements * Expanded troubleshooting guides for source installs, GPU setup, and gateway recovery * Clarified Docker group access requirements and improved CLI command reference * **Chores** * Version bumped to 0.0.49 <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4078?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
On Snapdragon X WSL2 hosts (Windows on ARM, no NVIDIA hardware), a d3d12/WDDM
nvidia-smi.exeshim is published into the WSL distro and returnsJMJWOA-Generic-GPUas the GPU name.detectGpu()'s primarynvidia-smi --query-gpu=name,memory.totalpath accepted any non-empty name, so preflight reported a fake "NVIDIA GPU detected (JMJWOA-Generic-GPU, 65471 MB)" and onboard proceeded down GPU-enabled code paths that later broke in rebuild's stricter CDI check.Real DGX Spark legitimately reports the same placeholder string (see #3510), so the rejection must be conditional: trust the name when it contains
NVIDIAor a known NVIDIA product family, otherwise require the firmware platform to vouch for it (spark/station/jetson).Related Issue
Fixes #3988
Changes
src/lib/inference/nim.ts: addisPlausibleNvidiaGpuName()andNVIDIA_GPU_NAME_PATTERN. PrimarydetectGpu()path filters parsed rows through this check unlessdetectNvidiaPlatform()already classifies the host asspark/station/jetson; if all rows fail validation, returnsnullinstead of fabricating an NVIDIA GPU.src/lib/inference/nim.test.ts: two regression tests — Snapdragon X-style WSL2 placeholder rejected on generic Linux firmware; same placeholder accepted under DGX Spark firmware.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests