fix(preflight): bootstrap nvidia-container-toolkit when nvidia-ctk is missing#3507
Conversation
… missing (#3506) Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds detection of NVIDIA Container Toolkit to host assessment and updates remediation planning: if ChangesNVIDIA Container Toolkit Detection and Remediation
Sequence DiagramsequenceDiagram
participant Client
participant assessHost
participant HostAssessment
participant planHostRemediation
participant buildContainerToolkitBootstrapCommands
participant RemediationPlan
Client->>assessHost: run host assessment
assessHost->>HostAssessment: probe for nvidia-ctk, set nvidiaContainerToolkitInstalled
Client-->>planHostRemediation: pass HostAssessment
planHostRemediation->>planHostRemediation: prepare generateCommands
alt nvidiaContainerToolkitInstalled == true
planHostRemediation->>RemediationPlan: emit generate_nvidia_cdi_spec action
else
planHostRemediation->>buildContainerToolkitBootstrapCommands: request install sequence + generateCommands
buildContainerToolkitBootstrapCommands-->>planHostRemediation: return install + generate sequence
planHostRemediation->>RemediationPlan: emit install_nvidia_container_toolkit action (blocking)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard/preflight.test.ts (1)
937-1029: ⚡ Quick winAdd a RHEL-family package-manager test for the new toolkit bootstrap branch.
Current coverage validates
aptandunknown, but the new helper also has a dedicateddnf/yumpath. A small parameterized case would lock that branch against regressions.Suggested test shape
+ it.each(["dnf", "yum"] as const)( + "emits install_nvidia_container_toolkit with %s bootstrap when nvidia-ctk is missing", + (packageManager) => { + const actions = planHostRemediation({ + platform: "linux", + isWsl: false, + runtime: "docker", + packageManager, + systemctlAvailable: true, + dockerServiceActive: true, + dockerServiceEnabled: true, + dockerInstalled: true, + dockerRunning: true, + dockerReachable: true, + nodeInstalled: true, + openshellInstalled: true, + dockerCgroupVersion: "v2", + dockerDefaultCgroupnsMode: "unknown", + isContainerRuntimeUnderProvisioned: false, + hasNestedOverlayConflict: false, + requiresHostCgroupnsFix: false, + isUnsupportedRuntime: false, + isHeadlessLikely: false, + hasNvidiaGpu: true, + dockerCdiSpecDirs: ["/etc/cdi"], + cdiNvidiaGpuSpecMissing: true, + nvidiaContainerToolkitInstalled: false, + notes: [], + }); + + const action = actions.find((entry) => entry.id === "install_nvidia_container_toolkit"); + expect(action).toBeTruthy(); + expect(action?.commands[1]).toBe(`sudo ${packageManager} install -y nvidia-container-toolkit`); + }, + );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/preflight.test.ts` around lines 937 - 1029, Tests cover apt and unknown packageManager branches for install_nvidia_container_toolkit but miss the RHEL-family (dnf/yum) bootstrap branch; add a parameterized test calling planHostRemediation with packageManager set to "dnf" (and/or "yum") and the same input flags (hasNvidiaGpu: true, cdiNvidiaGpuSpecMissing: true, nvidiaContainerToolkitInstalled: false) then assert the returned install_nvidia_container_toolkit action exists and that its commands include the RHEL-specific install command (e.g., "sudo dnf install -y nvidia-container-toolkit" or the appropriate yum equivalent) and still include the cdi generate command; place this alongside the existing apt/unknown tests and reuse the same expectation checks (action?.kind, .blocking, .title, .reason, cdi generate presence, and ordering relative to install command).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard/preflight.test.ts`:
- Around line 937-1029: Tests cover apt and unknown packageManager branches for
install_nvidia_container_toolkit but miss the RHEL-family (dnf/yum) bootstrap
branch; add a parameterized test calling planHostRemediation with packageManager
set to "dnf" (and/or "yum") and the same input flags (hasNvidiaGpu: true,
cdiNvidiaGpuSpecMissing: true, nvidiaContainerToolkitInstalled: false) then
assert the returned install_nvidia_container_toolkit action exists and that its
commands include the RHEL-specific install command (e.g., "sudo dnf install -y
nvidia-container-toolkit" or the appropriate yum equivalent) and still include
the cdi generate command; place this alongside the existing apt/unknown tests
and reuse the same expectation checks (action?.kind, .blocking, .title, .reason,
cdi generate presence, and ordering relative to install command).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8f83134f-2b4d-40d7-9e43-5692e5c7937f
📒 Files selected for processing (2)
src/lib/onboard/preflight.test.tssrc/lib/onboard/preflight.ts
…issing-hint Signed-off-by: Tinson Lai <tinsonl@nvidia.com> # Conflicts: # src/lib/onboard/preflight.ts
Selective E2E Results — ✅ All requested jobs passedRun: 25896582408
|
Summary
Host preflight currently emits a
nvidia-ctk cdi generate ...hint whenever Docker advertises CDI dirs and nonvidia.com/gpuspec is present, but it never checks whethernvidia-ctkitself is on PATH. On a fresh Ubuntu 24.04 host that has the NVIDIA driver but not the Container Toolkit, the hint fails withnvidia-ctk: command not foundand onboarding can't proceed.This PR adds a
nvidiaContainerToolkitInstalledprobe to the host assessment and branches the remediation action: when the toolkit is missing, preflight emits aninstall_nvidia_container_toolkitaction whose commands prepend the apt (or dnf/yum) bootstrap before the existingnvidia-ctk cdi generatestep. On unknown package managers, it points at NVIDIA's official install guide.Related Issue
Fixes #3506
Changes
src/lib/onboard/preflight.ts: newnvidiaContainerToolkitInstalledfield onHostAssessment, probed via the existingcommandExists("nvidia-ctk", ...)helper. Remediation forcdiNvidiaGpuSpecMissingnow branches on it: toolkit-present keeps the currentgenerate_nvidia_cdi_specaction verbatim; toolkit-missing emitsinstall_nvidia_container_toolkitwith bootstrap commands from a new exported helperbuildContainerToolkitBootstrapCommands(packageManager, generateCommands)(apt / dnf / yum / docs-pointer fallback).src/lib/onboard/preflight.test.ts: existing CDI fixtures updated withnvidiaContainerToolkitInstalled: trueto keep their behaviour. Two new cases cover the toolkit-missing branch on apt and on the docs-pointer fallback (unknown package manager).Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Tests