fix(install): auto-generate missing NVIDIA CDI device spec during installer preflight#3428
Conversation
…taller preflight Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughInstaller preflight now auto-detects missing NVIDIA CDI device specs, computes the expected spec path, ensures the target directory exists, attempts a systemd refresh unit, or runs ChangesCDI Device Spec Auto-Generation
Sequence DiagramsequenceDiagram
participant Installer as Installer Shell
participant NodePreflight as Node preflight module
participant System as systemd (systemctl)
participant CTK as nvidia-ctk
participant FS as Host filesystem (/etc/cdi)
Installer->>NodePreflight: assessHost()
NodePreflight-->>Installer: cdiNvidiaGpuSpecMissing + dockerCdiSpecDirs
Installer->>System: systemctl enable --now nvidia-cdi-refresh*
System-->>Installer: success/failure
alt refresh success
System->>CTK: refresh unit triggers generation
CTK->>FS: write /etc/cdi/nvidia.yaml
CTK-->>Installer: cdi list shows nvidia.com/gpu
else refresh failed
Installer->>Installer: sudo -v (if non-root)
Installer->>Installer: sudo mkdir -p <specDir>
Installer->>CTK: sudo nvidia-ctk cdi generate --output=<specPath>
CTK->>FS: write <specPath>
CTK-->>Installer: nvidia-ctk cdi list -> contains nvidia.com/gpu
end
Installer->>Installer: continue onboarding
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
✨ --- |
…eneration Signed-off-by: zyang-dev <267119621+zyang-dev@users.noreply.github.com>
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
## Summary Refreshes the NemoClaw docs for the 0.0.43 release window, covering GPU onboarding fixes, installer CDI repair behavior, and Linux uninstall cleanup. Updates the docs version metadata and regenerates the user skill references from the source docs. ## Related Issue None. ## Changes - #3428 -> `docs/reference/troubleshooting.md`: Documents the installer path that repairs missing NVIDIA CDI device specs before onboarding. - #3515 and #3543 -> `docs/about/release-notes.md` and `docs/reference/troubleshooting.md`: Documents the Linux Docker-driver GPU proof permission fix for `/proc/<pid>/task/<tid>/comm` writes. - #3536 -> `docs/reference/commands.md`: Documents that `nemoclaw uninstall` removes Linux gateway state under `~/.local/state/nemoclaw`. - Refreshes generated `nemoclaw-user-*` skill references from the updated source docs. - Bumps `docs/project.json` and `docs/versions1.json` to 0.0.43. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved GPU onboarding on Linux Docker-driver with automatic CDI spec repair and fallback mechanisms. * Fixed permission issues affecting GPU proof writes during Linux onboarding. * Enhanced uninstall to properly clean up gateway state and auth proxy processes on Linux. * **Documentation** * Updated release notes, command references, and troubleshooting guides for v0.0.43. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3613) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Summary
Auto-repair missing NVIDIA CDI device specs during the standard NemoClaw installer flow so fresh DGX Spark installs do not stop before onboarding. Direct
nemoclaw onboardstill reports the manual remediation if users bypass the installer.Related Issue
Fixes #3252
Changes
nvidia.com/gpuCDI specs during installer host preflight.nvidia-ctk cdi generatewhen available.nvidia-ctkoutput while preserving failure diagnostics.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: zyang-dev 267119621+zyang-dev@users.noreply.github.com
Summary by CodeRabbit
Documentation
Improvements
Tests