fix(onboard): ensure GPU recreate has SYS_PTRACE + apparmor=unconfined for proc/comm write#3515
Conversation
…d for proc/comm write (#3511) Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds logic and tests so GPU-recreated sandbox containers include the ChangesGPU Container Recreation Permission Fix
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
|
❌ Brev E2E (full): FAILED on branch |
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
|
❌ Brev E2E (full): FAILED on branch |
|
❌ Brev E2E (full): FAILED on branch |
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-double-onboard-e2e | ⏭️ skipped |
| gpu-e2e | ⏭️ skipped |
|
@cv @jyaunches adding for review, this resolves a QA blocker (#3511). |
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
## Summary Refreshes the NemoClaw docs for the 0.0.43 release window, covering GPU onboarding fixes, installer CDI repair behavior, and Linux uninstall cleanup. Updates the docs version metadata and regenerates the user skill references from the source docs. ## Related Issue None. ## Changes - #3428 -> `docs/reference/troubleshooting.md`: Documents the installer path that repairs missing NVIDIA CDI device specs before onboarding. - #3515 and #3543 -> `docs/about/release-notes.md` and `docs/reference/troubleshooting.md`: Documents the Linux Docker-driver GPU proof permission fix for `/proc/<pid>/task/<tid>/comm` writes. - #3536 -> `docs/reference/commands.md`: Documents that `nemoclaw uninstall` removes Linux gateway state under `~/.local/state/nemoclaw`. - Refreshes generated `nemoclaw-user-*` skill references from the updated source docs. - Bumps `docs/project.json` and `docs/versions1.json` to 0.0.43. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved GPU onboarding on Linux Docker-driver with automatic CDI spec repair and fallback mechanisms. * Fixed permission issues affecting GPU proof writes during Linux onboarding. * Enhanced uninstall to properly clean up gateway state and auth proxy processes on Linux. * **Documentation** * Updated release notes, command references, and troubleshooting guides for v0.0.43. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3613) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Summary
The Docker-driver GPU patch (src/lib/onboard/docker-gpu-patch.ts) recreates the OpenShell-managed sandbox container with
--gpus allplus a reconstruction of the baseline container's flags, then runs a three-step GPU proof in src/lib/onboard/initial-policy.ts — includingPROC_COMM_WRITE_PROBE, which writes to/proc/<pid>/task/<tid>/comm.On DGX Spark hosts (#3511) and other Docker-driver Linux hosts where the OpenShell-created baseline container's
CapAdddoes not includeSYS_PTRACEand itsSecurityOptdoes not includeapparmor=unconfined, the recreate inherits a flag set the kernel/LSM stack rejects for that proc write. The proof aborts onboarding:The reporter confirmed a bare
docker run --rm --gpus all ubuntu sh -c "echo test > /proc/1/task/1/comm"succeeds on the same host, so the kernel itself allows the write under the right Docker flags — the problem is which flags reach the recreated container.This PR makes the GPU recreate self-sufficient for the operations the GPU proof checks, regardless of the non-GPU baseline:
--cap-add SYS_PTRACEinto the clone, deduped via aSetso baselines that already have it stay flat.--security-opt apparmor=unconfinedonly when the baseline did not pin a specific apparmor profile. Docker rejects duplicate--security-opt apparmor=…entries, so a baseline that explicitly choseapparmor=docker-default(or similar) is respected — this stays scoped to the GPU recreate path and does not override deliberate operator choices.Related Issue
Resolves #3511
Related context:
/proc/commwrite probe. This PR hardens the recreate to satisfy what the probe checks across more baselines.Changes
src/lib/onboard/docker-gpu-patch.ts: dedupCapAdd/SecurityOptviaSet, always injectSYS_PTRACE, and injectapparmor=unconfinedonly when no apparmor profile is pinned by the baseline.src/lib/onboard/docker-gpu-patch.test.ts: four new cases — SYS_PTRACE always added; SYS_PTRACE deduped when baseline already has it; apparmor=unconfined injected on empty baselines; baseline-pinned apparmor profile preserved.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests