Skip to content

fix(sandbox): attribute supervisor-helper peers via pid-1 netns fallback#956

Open
Kh4L wants to merge 2 commits intoNVIDIA:mainfrom
Kh4L:fix/helper-netns-peer-attribution
Open

fix(sandbox): attribute supervisor-helper peers via pid-1 netns fallback#956
Kh4L wants to merge 2 commits intoNVIDIA:mainfrom
Kh4L:fix/helper-netns-peer-attribution

Conversation

@Kh4L
Copy link
Copy Markdown
Contributor

@Kh4L Kh4L commented Apr 24, 2026

Summary

Fix peer attribution for policy-proxy CONNECTs that originate from supervisor helpers.

parse_proc_net_tcp currently searches only /proc/<entrypoint_pid>/net/tcp to identify the peer process behind a CONNECT. Supervisor helpers introduced in #953 share the supervisor's network namespace, not the workload's, so helper-originated connections never appear in the workload procfs view. The proxy logs "failed to resolve peer binary: No ESTABLISHED TCP connection found" and denies the request even when the host is allowlisted.

This PR adds a narrow fallback: check the workload's procfs view first, then fall back to /proc/1/net/tcp, where pid 1 is the sandbox supervisor.

This is stacked on #953 and should land after that PR. Until #953 lands, the diff against main will include the supervisor-helpers commit as well as this procfs fix.

Related Issue

Related to #681 and #684.

Depends on #953.

Why pid 1

The previous #684 proposal used /proc/net/tcp, which is broader and raised a port-collision concern because it searches the init-namespace global view.

This PR keeps the fallback inside the sandbox pod boundary:

Approach Search space Port-collision risk
Current behavior /proc/<workload_pid>/net/tcp None, but misses helpers and some redirected flows
#684 /proc/net/tcp Broader host/global view
This PR /proc/1/net/tcp Same pod namespace; workload checked first

Inside a sandbox pod, pid 1 is the supervisor. Helpers share the supervisor network namespace by design. Workload peers still win because the workload procfs view is checked first, so existing workload behavior is unchanged unless the original lookup already missed.

Downstream identity checks in proxy::resolve_process_identity apply unchanged. Binary hash TOFU and ancestor walking still catch binary swaps or unexpected process lineage.

Changes

  • Extended crates/openshell-sandbox/src/procfs.rs to search [entrypoint_pid, 1].
  • Updated the peer-resolution error message to mention both lookup paths.
  • Added comments explaining the lookup order and security boundary.

Testing

Manual validation confirmed helper-originated CONNECTs to api.telegram.org and openrouter.ai are attributed and allowed by policy instead of failing with "No ESTABLISHED TCP connection found".

Checklist

Kh4L added 2 commits April 23, 2026 19:15
Declarative JSON config for daemons the supervisor should spawn alongside
the workload with operator-declared ambient capabilities. Each listed
helper is forked by the supervisor before the workload, has its requested
caps raised into the inheritable and ambient sets via capset(2) and
prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, ...) in the pre-exec child,
and is then execve'd. After execve the ambient set becomes part of the
helper's permitted and effective sets, so it keeps its capabilities even
though the workload path still runs with PR_SET_NO_NEW_PRIVS=1.

Helpers run in the supervisor's context and do not inherit the per-workload
seccomp filter, PR_SET_NO_NEW_PRIVS, Landlock, or the uid drop. The
supervisor seccomp prelude added in NVIDIA#891 is applied after helpers spawn
and only blocks mount/fs*/pivot_root/bpf/kexec/module/userfaultfd syscalls,
so capset, prctl, clone, and execve remain available to the helper spawn
path.

Validation rejects empty names, duplicate names, empty commands,
non-absolute command[0], and unknown capability names. A cap outside the
supervisor's permitted set (which the pod's bounding set clamps) fails at
capset with EPERM and the supervisor exits before the workload starts.

The SSH handshake secret is scrubbed from the helper's inherited
environment, matching what the workload sees. Each helper start emits an
OCSF AppLifecycle Start event with the helper name and pid.

This generalizes the hardcoded DNS-proxy-shaped pattern into a public,
declarative API so operators can register audited daemons (capability
brokers, privileged IPC bridges) without patching the supervisor binary.

Intentionally out of scope for this v0 landing (tracked for follow-ups):
per-helper Landlock, workload rendezvous, restart policy, readiness fd,
stdio routing into OCSF JSONL, per-helper cgroup limits, and runAsUser
for dropping to non-root while retaining caps. All additive on the v0
schema.

Flag: --helpers-config <path> (env: OPENSHELL_HELPERS_CONFIG).
Architecture: architecture/supervisor-helpers.md.
User docs: docs/sandboxes/supervisor-helpers.mdx.

Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@Serges-Mac-mini.local>
parse_proc_net_tcp only searches /proc/<entrypoint_pid>/net/tcp to
identify the peer of a proxy CONNECT. Supervisor helpers (declared
via --helpers-config, spawned pre-workload by helpers::spawn_helpers)
share the supervisor's netns, not the workload's, so their connections
never appear in the workload's procfs view. The policy proxy logs
"failed to resolve peer binary: No ESTABLISHED TCP connection found"
and denies every allowlisted host, breaking helpers that legitimately
need to reach external services (capability brokers, inference
gateways running alongside the sandboxed agent).

Add a fallback: try the workload's netns first, then fall back to
pid 1's (the supervisor's). Workload-originated connections always
win the race because the workload netns is checked first, and
downstream identity checks (binary hash TOFU + ancestor walk) apply
unchanged — a helper swapped for a different binary still trips
integrity verification the same way a workload peer would.

This closes the peer-attribution gap that kept sclaw's mediator-
spawned openclaw gateway helper from reaching api.telegram.org
through the policy proxy.

Signed-off-by: Serge Panev <spanev@Serges-Mac-mini.local>
Signed-off-by: Serge Panev <spanev@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant