fix(onboard): ensure GPU recreate has SYS_PTRACE + apparmor=unconfined for proc/comm write by laitingsheng · Pull Request #3515 · NVIDIA/NemoClaw

laitingsheng · 2026-05-14T13:26:53Z

Summary

The Docker-driver GPU patch (src/lib/onboard/docker-gpu-patch.ts) recreates the OpenShell-managed sandbox container with --gpus all plus a reconstruction of the baseline container's flags, then runs a three-step GPU proof in src/lib/onboard/initial-policy.ts — including PROC_COMM_WRITE_PROBE, which writes to /proc/<pid>/task/<tid>/comm.

On DGX Spark hosts (#3511) and other Docker-driver Linux hosts where the OpenShell-created baseline container's CapAdd does not include SYS_PTRACE and its SecurityOpt does not include apparmor=unconfined, the recreate inherits a flag set the kernel/LSM stack rejects for that proc write. The proof aborts onboarding:

✗ GPU proof failed: /proc/<pid>/task/<tid>/comm write
Error: GPU proof failed: /proc/<pid>/task/<tid>/comm write (status 2):
  sh: 1: cannot create /proc/<pid>/task/<tid>/comm: Permission denied

The reporter confirmed a bare docker run --rm --gpus all ubuntu sh -c "echo test > /proc/1/task/1/comm" succeeds on the same host, so the kernel itself allows the write under the right Docker flags — the problem is which flags reach the recreated container.

This PR makes the GPU recreate self-sufficient for the operations the GPU proof checks, regardless of the non-GPU baseline:

Always inject --cap-add SYS_PTRACE into the clone, deduped via a Set so baselines that already have it stay flat.
Inject --security-opt apparmor=unconfined only when the baseline did not pin a specific apparmor profile. Docker rejects duplicate --security-opt apparmor=… entries, so a baseline that explicitly chose apparmor=docker-default (or similar) is respected — this stays scoped to the GPU recreate path and does not override deliberate operator choices.

Related Issue

Resolves #3511

Related context:

fix(onboard): repair Docker GPU sandbox readiness #3434 — PR that introduced the Docker-driver GPU recreate flow and the /proc/comm write probe. This PR hardens the recreate to satisfy what the probe checks across more baselines.

Changes

src/lib/onboard/docker-gpu-patch.ts: dedup CapAdd / SecurityOpt via Set, always inject SYS_PTRACE, and inject apparmor=unconfined only when no apparmor profile is pinned by the baseline.
src/lib/onboard/docker-gpu-patch.test.ts: four new cases — SYS_PTRACE always added; SYS_PTRACE deduped when baseline already has it; apparmor=unconfined injected on empty baselines; baseline-pinned apparmor profile preserved.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

Bug Fixes
- GPU-enabled containers are now configured with the necessary kernel security capabilities during clone/recreate operations.
- Security profile handling was fixed to preserve existing security options and avoid unintended overrides or duplicate entries.
Tests
- Added tests covering GPU container clone behavior across various capability and security-profile scenarios to prevent regressions.

…d for proc/comm write (#3511) Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

copy-pr-bot · 2026-05-14T13:26:58Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-05-14T13:27:01Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8072a2c5-57c3-40bd-b102-dbe1529afe00

📥 Commits

Reviewing files that changed from the base of the PR and between 86f25ac and 12f5cc8.

📒 Files selected for processing (1)

src/lib/onboard/docker-gpu-patch.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/onboard/docker-gpu-patch.test.ts

📝 Walkthrough

Walkthrough

Adds logic and tests so GPU-recreated sandbox containers include the SYS_PTRACE capability and conditionally add apparmor=unconfined only when the baseline has no AppArmor entry, preserving other baseline capability and security options.

Changes

GPU Container Recreation Permission Fix

Layer / File(s)	Summary
Permission-aware GPU recreate args `src/lib/onboard/docker-gpu-patch.ts`	Builds capability and security-option sets from the baseline, ensures `SYS_PTRACE` via `--cap-add`, preserves `--cap-drop`, and injects `--security-opt apparmor=unconfined` only if no existing `apparmor` entry is present.
Test coverage for capability and AppArmor handling `src/lib/onboard/docker-gpu-patch.test.ts`	Adds four tests validating SYS_PTRACE is added when missing, not duplicated when present, `apparmor=unconfined` is injected when absent, and existing AppArmor/securityOpt entries are preserved.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

NVIDIA/NemoClaw#3434: Related edits to docker-gpu-patch and buildDockerGpuCloneRunArgs around SYS_PTRACE and AppArmor handling.

Suggested labels

Docker, v0.0.41

Suggested reviewers

ericksoa
prekshivyas
jyaunches

Poem

🐰 I nudge the args with careful paws,
Adding SYS_PTRACE without a cause,
If AppArmor's missing, I unbind,
Else I keep what the host defined,
Hopping off—GPU proofs applaud!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: ensuring GPU recreate has SYS_PTRACE and apparmor=unconfined capabilities for /proc/comm write operations, which directly addresses the core issue.
Linked Issues check	✅ Passed	The code changes meet the primary objective from issue `#3511`: the implementation adds SYS_PTRACE capability and conditional apparmor=unconfined to the GPU recreate flow, directly fixing the /proc/comm write permission denial.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the GPU recreate Docker argument construction and adding corresponding test coverage; no unrelated modifications are present.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/3511-gpu-recreate-proc-comm-flags

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-14T13:28:39Z

E2E Advisor Recommendation

Required E2E: gpu-e2e
Optional E2E: gpu-double-onboard-e2e

Dispatch hint: gpu-e2e

Auto-dispatched E2E: gpu-e2e via nightly-e2e.yaml at f07f682a9c1d079d9700116db8a2cee61ff6bd79 — nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

gpu-e2e (high): Directly exercises the real local Ollama GPU onboarding flow, including Docker GPU patch activation, sandbox GPU proof, OpenShell reconnect path, and inference through the recreated GPU-capable sandbox. This is the required regression check for changes to Docker GPU clone/security flags.

Optional E2E

gpu-double-onboard-e2e (high): Useful adjacent coverage for running GPU onboarding twice with Ollama and verifying the sandbox/proxy still works after re-onboard. It also exercises the GPU sandbox lifecycle path, but the changed code is primarily covered by the single GPU onboarding E2E.

New E2E recommendations

docker-gpu-security-options (medium): Existing GPU E2E validates the Docker GPU patch indirectly through proof/inference, but it does not appear to explicitly assert that the recreated sandbox container received SYS_PTRACE and the intended AppArmor behavior. A targeted assertion would catch future regressions in the exact security flags changed here.
- Suggested test: Extend the GPU E2E proof to inspect the recreated sandbox container and verify HostConfig.CapAdd contains SYS_PTRACE and that AppArmor handling matches the expected Docker GPU patch mode without duplicating or overriding pinned profiles.

Dispatch hint

Workflow: nightly-e2e.yaml
jobs input: gpu-e2e

github-actions · 2026-05-14T13:35:40Z

❌ Brev E2E (full): FAILED on branch fix/3511-gpu-recreate-proc-comm-flags — See logs

github-actions · 2026-05-14T13:36:16Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25863084990
Target ref: 86f25ac28e5b47261c387a5520cc3e33d9ce9e1a
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

github-actions · 2026-05-14T13:48:50Z

❌ Brev E2E (full): FAILED on branch fix/3511-gpu-recreate-proc-comm-flags — See logs

github-actions · 2026-05-14T14:40:29Z

❌ Brev E2E (full): FAILED on branch fix/3511-gpu-recreate-proc-comm-flags — See logs

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-05-14T14:55:19Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25867131940
Target ref: 12f5cc80851d0ebcd3b9a03ba1b29b291788ddb5
Workflow ref: main
Requested jobs: gpu-e2e,gpu-double-onboard-e2e
Summary: 0 passed, 0 failed, 2 skipped

Job	Result
gpu-double-onboard-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped

wscurran · 2026-05-14T16:14:27Z

@cv @jyaunches adding for review, this resolves a QA blocker (#3511).

github-actions · 2026-05-14T21:43:37Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25887406164
Target ref: f07f682a9c1d079d9700116db8a2cee61ff6bd79
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

## Summary Refreshes the NemoClaw docs for the 0.0.43 release window, covering GPU onboarding fixes, installer CDI repair behavior, and Linux uninstall cleanup. Updates the docs version metadata and regenerates the user skill references from the source docs. ## Related Issue None. ## Changes - #3428 -> `docs/reference/troubleshooting.md`: Documents the installer path that repairs missing NVIDIA CDI device specs before onboarding. - #3515 and #3543 -> `docs/about/release-notes.md` and `docs/reference/troubleshooting.md`: Documents the Linux Docker-driver GPU proof permission fix for `/proc/<pid>/task/<tid>/comm` writes. - #3536 -> `docs/reference/commands.md`: Documents that `nemoclaw uninstall` removes Linux gateway state under `~/.local/state/nemoclaw`. - Refreshes generated `nemoclaw-user-*` skill references from the updated source docs. - Bumps `docs/project.json` and `docs/versions1.json` to 0.0.43. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **Bug Fixes** * Improved GPU onboarding on Linux Docker-driver with automatic CDI spec repair and fallback mechanisms. * Fixed permission issues affecting GPU proof writes during Linux onboarding. * Enhanced uninstall to properly clean up gateway state and auth proxy processes on Linux. * **Documentation** * Updated release notes, command references, and troubleshooting guides for v0.0.43.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3613)   Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

fix(onboard): ensure GPU recreate has SYS_PTRACE + apparmor=unconfine…

86f25ac

…d for proc/comm write (#3511) Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). fix v0.0.42 Release target labels May 14, 2026

laitingsheng removed the v0.0.42 Release target label May 14, 2026

laitingsheng marked this pull request as ready for review May 14, 2026 13:33

test(onboard): trim issue number from new docker GPU patch test names

12f5cc8

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

wscurran added the v0.0.42 Release target label May 14, 2026

wscurran requested review from cv and jyaunches May 14, 2026 16:13

cv added v0.0.43 Release target and removed v0.0.42 Release target labels May 14, 2026

Merge branch 'main' into fix/3511-gpu-recreate-proc-comm-flags

f07f682

cv approved these changes May 14, 2026

View reviewed changes

cv merged commit f42b9d9 into main May 14, 2026
24 checks passed

coderabbitai Bot mentioned this pull request May 14, 2026

fix(onboard): allow proc writes for Docker GPU patch #3543

Merged

12 tasks

miyoungc mentioned this pull request May 15, 2026

docs(release): refresh 0.0.43 docs #3613

Merged

12 tasks

coderabbitai Bot mentioned this pull request May 21, 2026

fix(hermes): restore Spark GPU recreate startup #3963

Merged

Conversation

laitingsheng commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

wscurran commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

laitingsheng commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading