refactor(openshell-sandbox): Split `sandbox` into `process` and `network` subcrates. by rrhubenov · Pull Request #1650 · NVIDIA/OpenShell

rrhubenov · 2026-06-01T13:03:36Z

Summary

This PR is an implementation of #1305.
For full context on why this is required, please read #981 and #1305 in that order.

In summary, this split has been proposed for 2 main reasons:

As a precursor to a possible split-pod topology of the supervisor (previously sandbox)
In general a better decomposition of the 2 main "categories" of functionality ("process" and "network")

Although this change has been mostly evaluated from the perspective of the "split pod" topology, this splitting allows for a more agile topology regardless of environment. If for any reason the "process" and "network" functionality need to work as standalone processes, this split is required.

Review suggestions:

This refactor is hard to review due to a big collection of changes that are mostly movements of files and/or code blocks. For this reason, I suggest that reviewers take the approach of reviewing the changes commit by commit. I have taken care that each commit is as small as possible, moving only one functionality (module, function, etc.) in one commit.

Bear in mind that during the implementation, one very important architectural decision is that the 2 new crates (supervisor-network and supervisor-process) do not have a dependency on one another.
Only the "orcherstrator" crate, that being the openshell-supervisor should have a dependency on both of them.
One other important point was that no behavioural changes should occur as a side-effect from this refactoring.

NOTE:
I haven't renamed the openshell-sandbox crate to openshell-supervisor since if done fully (e.g. renaming the binary as well), would result in a breaking change which I see as out of scope for this PR.

I have also tried to minimise the amount of architectural design decisions in this PR, since it is already big enough.

Related Issue

Closes #1305

Changes

Testing

mise run pre-commit passes
Unit tests added/updated (no update was required due to this change being only a refactor) N/A
E2E tests added/updated (if applicable) N/A

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable) N/A - review determined that docs do not go into implementation details

Lifts TLS state generation, network namespace setup, proxy startup, bypass monitor spawn, and SSH-side proxy URL / netns FD computation out of run_sandbox into a sibling async fn `run_networking` that returns a Networking struct. The identity cache moves with it (only consumed by the proxy). Entrypoint PID allocation moves just above the call site so it can be passed in. No behavior changes — same OCSF emits, same async order, same RAII lifetimes for the proxy and bypass-monitor handles, now held by the returned Networking value in run_sandbox's frame. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Lifts the post-networking tail of `run_sandbox` (zombie reaper, SSH server, supervisor session, process spawn, OPA probe, policy poll loop, denial aggregator, wait/exit) into a sibling async fn `run_process`. Also moves network namespace creation out of `run_networking` into a new `create_netns_for_proxy` helper invoked from `run_sandbox`, so `run_networking` is purely the proxy component (OPA evaluation, TLS interception, credential injection, inference routing, gRPC control API). The netns is then borrowed into both `run_networking` and `run_process`. No behavior change. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…ell-supervisor-process crates Add empty placeholder crates that subsequent commits will populate as the sandbox decomposition proceeds. Both crates compile clean as part of the workspace and are picked up automatically by the existing `members = ["crates/*"]` glob. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

The DenialEvent struct is emitted by both the proxy/L7 layer (networking-side) and the bypass monitor (process-side), and crosses the run_networking -> run_process API boundary. Move it to openshell-core so the eventual supervisor-networking and supervisor-process crates can both reference it without depending on each other. DenialAggregator and the channel/flush helpers stay in openshell-sandbox for now. A thin `pub use openshell_core::DenialEvent;` re-export from denial_aggregator.rs keeps every existing `crate::denial_aggregator::DenialEvent` call site resolving without further edits. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move the lexical path-normalization helper from openshell-policy to openshell-core::paths so it can be reached from crates that sit below openshell-policy in the dependency graph. openshell-policy keeps its existing public API via a `pub use` re-export, so all current call sites (e.g. openshell-sandbox/src/policy.rs) continue to resolve unchanged. This is a prerequisite for lifting openshell-sandbox/src/policy.rs into openshell-core: that file's `From<ProtoFilesystemPolicy>` impl calls normalize_path, and lifting it as-is would cycle through openshell-policy. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move openshell-sandbox/src/policy.rs (SandboxPolicy, NetworkPolicy, ProxyPolicy, FilesystemPolicy, LandlockPolicy, ProcessPolicy, NetworkMode, LandlockCompatibility, plus their Proto* TryFrom/From impls) to openshell-core/src/policy.rs. Both prospective supervisor leaves (networking and process) dispatch on SandboxPolicy. Hosting it in openshell-core lets either leaf reach for it without depending on the other (or on the future orchestrator). The From<ProtoFilesystemPolicy> impl now calls the in-crate openshell_core::paths::normalize_path lifted in the previous commit, which is what made this move cycle-free. Update all crate::policy::* call sites in openshell-sandbox to openshell_core::policy::*. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

child_env (proxy_env_vars, tls_env_vars) is process-side only — consumed by process.rs and ssh.rs. With the orchestrator staying in openshell-sandbox (Shape A), openshell-sandbox depends on the new leaf crates, so process-only modules can land in openshell-supervisor-process directly. Add openshell-supervisor-process as a path dependency of openshell-sandbox. Update process.rs and ssh.rs to import from openshell_supervisor_process::child_env. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move the static skills installer (and its embedded resource directory) out of openshell-sandbox into openshell-supervisor-process. The module is process-side only — invoked once during sandbox start to drop agent skill files into the workspace — and has no cross-leaf consumers. Adds miette as a dependency and tempfile as a dev-dependency on openshell-supervisor-process. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…ll-sandbox Move the mechanistic mapper (HTTP method/path → operation classifier that derives policy proposals from connection summaries) out of openshell-sandbox into openshell-supervisor-networking. Single internal caller (run_policy_poll_loop in lib.rs) and only depends on openshell-core + tracing — no cross-leaf entanglement. First population of the openshell-supervisor-networking crate; adds openshell-core and tracing as dependencies. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move procfs (PID lookups, ancestor walking, /proc/net/tcp socket-owner resolution, file SHA256 hashing) from openshell-sandbox into openshell-core. The module is consumed cross-leaf — by bypass_monitor on the process side and by identity / proxy on the networking side — so it has to sit below both leaves. Adds tracing, sha2, and hex as dependencies on openshell-core. Updates the three call sites in openshell-sandbox to import from openshell_core::procfs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move BinaryIdentityCache (path → SHA256 cache used to identify the process behind an outbound connection) from openshell-sandbox into openshell-supervisor-networking. The cache is consumed only by the networking-side proxy and the orchestrator; with procfs already in openshell-core there are no remaining cross-leaf dependencies. Adds miette as a dependency and tempfile as a dev-dependency on openshell-supervisor-networking. Adds a Default impl for BinaryIdentityCache to satisfy clippy::new_without_default now that the type is publicly exposed across crates. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…l-sandbox Move AGENT_PROPOSALS_ENABLED, agent_proposals_enabled(), and the test-only ProposalsFlagGuard out of openshell-sandbox into openshell-supervisor-process::proposals. The flag is read only by the process-side policy_local route handler and the orchestrator; lifting it to openshell-core would have made core carry sandbox-owned runtime state without buying anything. The test-only ProposalsFlagGuard is still consumed from networking-side l7/rest tests today (until the wider Q2 OCSF/gRPC injection work lands). Expose it via a new optional `test-helpers` feature on openshell-supervisor-process so test crates opt in explicitly without pulling tokio sync primitives into production builds. openshell-sandbox keeps its existing crate-private path (`crate::AGENT_PROPOSALS_ENABLED`, `crate::test_helpers`) via re-exports so call sites and tests are unchanged. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move crates/openshell-sandbox/src/secrets.rs to crates/openshell-core/src/secrets.rs so both supervisor leaves can reach SecretResolver and the placeholder helpers without depending on openshell-sandbox. Add base64 to openshell-core deps (only stdlib + base64 are used). Promote previously pub(crate) constructors and methods on SecretResolver to pub since cross-crate callers (provider_credentials, proxy/L7 tests) now name them across the crate boundary. Update import paths in proxy.rs, l7/{rest,relay,websocket}.rs, and provider_credentials.rs from crate::secrets to openshell_core::secrets. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move crates/openshell-sandbox/src/provider_credentials.rs to crates/openshell-core/src/provider_credentials.rs. Both supervisor leaves now name ProviderCredentialState in their function signatures (run_networking takes &ProviderCredentialState, run_process takes ProviderCredentialState by value), and under Shape A leaves can't depend on openshell-sandbox, so the type must live in openshell-core. The orchestrator (run_sandbox in openshell-sandbox) remains the only writer: it constructs ProviderCredentialState::from_environment and the policy poll loop calls install_environment on credential rotation. Both leaves stay pure readers via snapshot()/resolver(). Update import paths in proxy.rs, ssh.rs, and lib.rs from crate::provider_credentials to openshell_core::provider_credentials. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move the process-wide OCSF SandboxContext OnceLock + LazyLock fallback + getter from openshell-sandbox/src/lib.rs into a new openshell-ocsf::ctx module. The type already lives in openshell-ocsf, so its singleton lives next to it. Add openshell_ocsf::ctx::set_ctx() and openshell_ocsf::ctx::ctx(). The orchestrator (run_sandbox) now calls set_ctx during startup. Sandbox keeps a pub(crate) use openshell_ocsf::ctx::ctx as ocsf_ctx; re-export so the 138 existing crate::ocsf_ctx() call sites resolve unchanged. When the sandbox modules themselves migrate into the leaf crates, they'll import openshell_ocsf::ctx directly and the re-export goes away. Under Shape A neither leaf can depend on openshell-sandbox; both already depend on openshell-ocsf to construct events, so this adds no new dep edge. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Both prospective leaves (supervisor-networking and supervisor-process) need CachedOpenShellClient, AuthedChannel, and the connect/fetch helpers. Under Shape A the leaves cannot depend on openshell-sandbox, so the type has to live below them. openshell-core already pulls in tonic and miette; this enables tonic's channel/tls features and adds tokio as a direct dep. Updates all crate::grpc_client::* call sites in openshell-sandbox to openshell_core::grpc_client::*. No re-export shim — the call-site count was small enough to update directly. See architecture/plans/sandbox-split-design-choices.md for the full rationale and trade-offs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…l-sandbox DenialAggregator and FlushableDenialSummary belong with the proxy and L7 layer that emit denials. Moves the file into openshell-supervisor-networking; adds tokio as a regular dep there since DenialAggregator uses tokio::sync::mpsc. Drops the pub use openshell_core::DenialEvent re-export inside the moved file (no longer needed cross-crate). Updates bypass_monitor.rs, proxy.rs, and lib.rs to import openshell_core::DenialEvent directly. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

LogPushLayer is a process-side tracing layer that streams sandbox logs to the gateway via gRPC. Moves into openshell-supervisor-process; adds openshell-core, openshell-ocsf, tokio-stream, tracing, and tracing-subscriber as direct deps there. Updates the only external call site (openshell-sandbox/src/main.rs) to import from openshell_supervisor_process::log_push. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

bypass_monitor reads /dev/kmsg for nftables drop log lines and emits denial events. Pure process-side concern, called only from run_networking which spawns it on the netns. Moves into openshell-supervisor-process; all deps (openshell-core, openshell-ocsf, tokio, tracing) were already declared there. Replaces crate::ocsf_ctx() shim calls inside the moved file with openshell_ocsf::ctx::ctx() — first leaf-side caller to import the OCSF context singleton directly instead of going through openshell-sandbox's re-export. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

debug_rpc is the CLI subcommand handler that exercises authenticated gRPC calls (issue-token, refresh-token, get-config, etc.). Pure process-side concern, called only from openshell-sandbox/main.rs. Adds base64, hex, serde_json, sha2, and tonic (with channel/tls features) as direct deps on openshell-supervisor-process. Updates the single call site in main.rs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…sandbox supervisor_session opens a bidirectional gRPC stream that lets the gateway initiate shells inside the sandbox. Pure process-side concern, called only from run_process. Adds uuid as a direct dep on openshell-supervisor-process. Replaces crate::ocsf_ctx() shim calls inside the moved file with openshell_ocsf::ctx::ctx() — same pattern as bypass_monitor. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…shell-sandbox The MANAGED_CHILDREN set tracks PIDs of supervisor-spawned children (entrypoint + SSH sessions) so the orchestrator's SIGCHLD reaper can distinguish them from incidental zombies. Pure process-side concern, moves to openshell_supervisor_process::managed_children with three public fns: register, unregister, is_managed. Updates lib.rs reaper, process.rs, and ssh.rs to call through the new module path. Drops the now-unused HashSet import from lib.rs. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…andbox Lift the process-only hardening pieces (landlock, seccomp, PreparedSandbox, prepare/enforce, log_sandbox_readiness, top-level apply, and apply_supervisor_startup_hardening) from crates/openshell-sandbox/src/sandbox/ to crates/openshell-supervisor-process/src/sandbox/. Leave netns.rs and nft_ruleset.rs in openshell-sandbox for now, since both eventual leaf crates (supervisor-networking and supervisor-process) read from NetworkNamespace and its final home is decided when run_networking and run_process are extracted. Replace crate::ocsf_ctx() shims in landlock.rs and the new linux/mod.rs with direct openshell_ocsf::ctx::ctx() calls. Update call sites in lib.rs, process.rs, and ssh.rs to import sandbox from openshell_supervisor_process while keeping the netns import unchanged. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move proposals.rs (AGENT_PROPOSALS_ENABLED OnceLock + agent_proposals_enabled reader + test_helpers::ProposalsFlagGuard) from openshell-supervisor-process to openshell-core so both eventual leaf crates can read it without depending on each other. The flag is process-wide singleton state initialised once during sandbox startup and read by both the policy.local route (networking-side) and the skills installer (process-side) — same shape as openshell_ocsf::ctx. Move the test-helpers Cargo feature alongside it: openshell-core gains the feature, openshell-supervisor-process loses it, and openshell-sandbox's dev-dependency now enables openshell-core/test-helpers. Update the sandbox re-export shim to point at openshell_core::proposals. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Move NetworkNamespace and the nft_ruleset bypass-rule generator from crates/openshell-sandbox/src/sandbox/linux/ to crates/openshell-core/src/netns/. Both eventual leaf crates (supervisor-networking and supervisor-process) read from NetworkNamespace, so it must live somewhere both can depend on without violating the Shape A no-leaf-to-leaf rule. Replace crate::ocsf_ctx() shims in netns with direct openshell_ocsf::ctx::ctx() calls, matching the pattern used in already-migrated process modules. Update super::nft_ruleset references inside netns to nft_ruleset since the module is now a sibling sub-module of netns/mod.rs. Add openshell-ocsf and uuid as linux-only dependencies of openshell-core, and gate pub mod netns on target_os = "linux" since the implementation uses netlink, ip(8), and namespace fds. Delete the now-empty sandbox/{mod.rs, linux/mod.rs} stubs and update NetworkNamespace import paths in lib.rs and process.rs to point at openshell_core::netns. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…ll-sandbox Lift the entrypoint process spawn module and the embedded SSH server module into openshell-supervisor-process. openshell-sandbox now re-exports ProcessHandle/ProcessStatus and calls openshell_supervisor_process::ssh::run_ssh_server directly. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…om openshell-sandbox Lift the egress proxy, L7 enforcement modules, OPA engine, and policy.local advisor API into openshell-supervisor-networking. Move accompanying data files (sandbox-policy.rego), test fixtures (testdata/), and integration tests (system_inference, websocket_upgrade). Sandbox lib.rs now references these via openshell_supervisor_networking::* and ProxyHandle::start_with_bind_addr is exposed as pub for the orchestrator call site. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

…orchestrator Move the symlink-resolver, policy poll loop, and denial-aggregator flush spawns out of run_process and into run_sandbox so run_process no longer needs OpaEngine, retained_proto, the local policy context, the sandbox name, the gateway endpoint for telemetry, the OCSF flag, or the denial receiver. These long-running orchestrator-owned tasks now live alongside the other sandbox-startup wiring, matching the design log decision in architecture/plans/sandbox-split-design-choices.md (Q5). Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

Lift the workload supervision entry point (zombie reaper, SSH server spawn, supervisor session, entrypoint child spawn, exit-with-timeout) into its own module in openshell-supervisor-process. The orchestrator in openshell-sandbox now calls openshell_supervisor_process::run::run_process directly. With this move run_process names only types from openshell-core, openshell-ocsf, openshell-supervisor-process itself, std, and tokio — no openshell-supervisor-networking dependency. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

derekwaynecarr · 2026-06-01T14:35:13Z

/ok to test 1e22061

copy-pr-bot · 2026-06-01T14:35:18Z

/ok to test 1e22061

@derekwaynecarr, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

derekwaynecarr · 2026-06-01T14:37:12Z

/ok to test 1e22061

copy-pr-bot · 2026-06-01T14:37:17Z

/ok to test 1e22061

@derekwaynecarr, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

TaylorMutch · 2026-06-01T15:46:15Z

/ok to test 33e00fd

github-actions · 2026-06-01T18:09:06Z

Label test:e2e applied for 33e00fd. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

drew

This is looking good, thanks for taking this on.

I haven't renamed the openshell-sandbox crate to openshell-supervisor since if done fully (e.g. renaming the binary as well), would result in a breaking change which I see as out of scope for this PR.

Good call, we can figure this out separately.

One area I focused the review on were packages moved to openshell-core. Generally these looked good. Moving grpc client helpers, policy infra, provider components to core all make sense. Left a comment around the netns/nft_ruleset moves.

drew · 2026-06-02T06:11:59Z

+        let resolve_proto = proto.clone();
+        let resolve_pid = entrypoint_pid.clone();
+        tokio::spawn(async move {
+            let pid = resolve_pid.load(Ordering::Acquire);


(from codex review)

run_networking starts before run_process stores the child PID, so loading entrypoint_pid once here usually captures 0. The retry loop then keeps probing /proc/0/root and never reloads OPA with the real workload root, which regresses symlinked binary policies such as /usr/bin/python3.

Suggested fix: wait for a non-zero PID separately, then start the existing /proc//root probe loop. That keeps the old 5s retry window focused on “is proc root ready?” instead of spending it before the process has even spawned.

let mut pid = 0; for attempt in 1..=20 { pid = resolve_pid.load(Ordering::Acquire); if pid != 0 { break; } debug!(attempt, "Entrypoint PID not available yet, retrying symlink resolution"); tokio::time::sleep(Duration::from_millis(250)).await; } if pid == 0 { warn!("Entrypoint PID not available; binary symlink resolution skipped"); return; } let probe_path = format!("/proc/{pid}/root/"); for attempt in 1..=10 { tokio::time::sleep(Duration::from_millis(500)).await; if std::fs::metadata(&probe_path).is_ok() { // existing reload_from_proto_with_pid(&resolve_proto, pid) block return; } }

drew · 2026-06-02T06:14:02Z

 name = "openshell-sandbox"
 path = "src/main.rs"

 [dependencies]


Now that openshell-sandbox is acting as the orchestrator over the process/network leaf crates, this manifest still appears to carry a lot of leaf-only dependencies from the old monolith (regorus, russh, rcgen, tokio-rustls, ipnet, apollo-parser, openshell-router, etc.).

Can we prune the direct deps here to only what the orchestrator actually uses?

drew · 2026-06-02T06:21:06Z

 pub mod metadata;
 pub mod net;
+#[cfg(target_os = "linux")]
+pub mod netns;


netns/nft_ruleset probably don't belong in openshell-core. These modules are privileged Linux supervisor runtime implementation: they create namespaces, manage veths, invoke ip/nsenter/nft, install bypass rules, and emit sandbox OCSF events. That makes core own sandbox enforcement machinery, not just shared types/config.

Can we keep this out of core? The process leaf appears to only need the namespace fd for setns, so one option is for the network/orchestrator side to own the NetworkNamespace RAII handle and pass an Option<i32> fd into run_process. If more sharing is needed, a small supervisor-runtime/netns crate would preserve the no process <-> network dependency rule without expanding openshell-core into privileged runtime code.

Absolutely, I've generally tried to keep the number of additions to openshell-core as minimal as possible, but this specific module was hard to position correctly, since it's shared by both crates and I didn't want to create a 3rd crate since it would have put even more strain on the PR.

Thank you for the solution with passing the fd! I hadn't seen that this simplification was possible. I was generally trying not to introduce API changes to existing functions, structs, etc., again to simplify the PR.

But now looking at this with a fresh pair of eyes, I would even take this a step further and argue that all network isolation and nftables logic should be moved fully to the supervisor-process crate instead. My logic being that this also would classify as "process isolation", even tough it's "networking related".
This is from the perspective that in my view, if the network (previously proxy) part would be one day deployable by itself, it would not be responsible for these things. Rather the supervisor-process would have that responsibility.
I would even apply this logic to the bypass monitor!

We can leave this as a discussion for the future, if you would prefer that.

I think this boils down to strictly identifying/defining the set of responsibilities of both supervisor-network and supervisor-process.
In my opinion it should be:

supervisor-network: roughly credential injection, policy enforcement, TLS interception, HTTP CONNECT proxy

supervisor-process: roughly everything else (all process isolation techniques, bypass monitoring, ssh server)

That split works for me 🚢

drew · 2026-06-02T06:27:58Z

/ok to test 8526c54

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

The procfs/bypass_monitor/proxy test modules use libc::{fork, exec, fcntl, kill, waitpid} but the dep wasn't declared in this crate's Cargo.toml. It was previously satisfied transitively when these modules lived in openshell-core; the move left the test target unable to resolve libc. Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

drew · 2026-06-05T03:38:23Z

/ok to test 7a3fdd7

The denial aggregator and mechanistic mapper consume denial events produced by the proxy and (subsequently) the bypass monitor. With both supervisor leaves becoming pure producers of `DenialEvent`, the consumer-side aggregation belongs in the orchestrator, not in either leaf. Move `denial_aggregator.rs` and `mechanistic_mapper.rs` from `openshell-supervisor-network` to `openshell-sandbox` (the orchestrator). The orchestrator now owns the unbounded denial channel: it constructs `(tx, rx)`, hands `tx` to `run_networking` for the proxy to clone, drains `rx` via the aggregator task, and runs the gateway flush helper itself. `run_networking`'s signature gains a `denial_tx` parameter and loses its internal channel construction, aggregator spawn, and `flush_proposals_to_gateway` helper. `DenialEvent` stays in `openshell-supervisor-network` for now; a follow-up commit will lift it to `openshell-core` alongside the bypass monitor relocation.

`bypass_monitor` is process-isolation machinery: it tails the kernel log via `dmesg --follow`, parses nftables LOG lines emitted from the workload's network namespace, resolves PIDs via `/proc`, and emits OCSF events plus optional `DenialEvent`s. None of this touches the proxy, OPA, TLS, or any other supervisor-network state — it only shared the denial channel because both feed the same aggregator. Move `bypass_monitor.rs` from `openshell-supervisor-network` to `openshell-supervisor-process` (as `bypass_monitor/mod.rs`). Spawn it in `run_process` where the netns name and entrypoint PID are already in scope. The orchestrator hands an extra `bypass_denial_tx` clone of the denial channel sender to `run_process` for this purpose. Lift `DenialEvent` from `openshell-supervisor-network` to `openshell-core`. Both supervisor leaves now produce it, so it needs a shared location that neither leaf depends on. This reverses an earlier commit that pulled the type into the network leaf when it was the only producer. Copy the minimal subset of `/proc` parsers used by `bypass_monitor` into a private `bypass_monitor::procfs` submodule. The alternative — extracting a shared procfs crate — is a much larger refactor that this commit does not need; supervisor-network's `procfs.rs` continues to serve the proxy and identity cache.

The ssh_netns_fd was computed in run_networking purely to forward it through the Networking struct and back into run_process. supervisor-network never read it. Move the derivation to run_process where the NetworkNamespace handle is already in scope.

The ssh_proxy_url was computed in run_networking purely to forward it through the Networking struct and back into run_process. supervisor-network never read it. Move the derivation to run_process where the NetworkNamespace handle and SandboxPolicy are already in scope. After this commit the Networking struct no longer carries any SSH-shaped fields, and supervisor-network reads only host_ip from the netns (for the proxy bind address).

…netns run_networking only ever read host_ip from the netns it was passed (the SSH plumbing reads moved to run_process in earlier commits). Replace the NetworkNamespace parameter with a plain Option<IpAddr> the orchestrator extracts. supervisor-network's run module no longer references the netns type for any consumer, only for create_netns_for_proxy (which still lives in this crate; relocates next).

Relocates the NetworkNamespace handle, nft ruleset builder, and create_netns_for_proxy constructor into openshell-supervisor-process. The orchestrator (openshell-sandbox) phantom-owns the RAII handle for the duration of run_sandbox; supervisor-network no longer references the type at all. Drops uuid, libc, nix, openshell-ocsf, and tempfile from core's Linux target deps (all were exclusive to netns). tempfile becomes a Linux runtime dep on supervisor-process for nft ruleset application.

cargo-machete flagged 26 direct dependencies that were carried over from the pre-split monolith and are no longer used by the orchestrator itself: regorus, russh, rcgen, tokio-rustls, ipnet, apollo-parser, openshell-router, anyhow, base64, bytes, flate2, glob, hex, hmac, nix, rand_core, rustls-pemfile, serde, serde_yml, sha1, sha2, thiserror, tokio-stream, uuid, webpki-roots. These now live (transitively) in openshell-supervisor-network and openshell-supervisor-process where they are actually consumed.

- Drop unused `url` from openshell-supervisor-network. - Mark `prost` and `prost-types` as cargo-machete-ignored in openshell-core: they have no source-level `use`, but the tonic- generated proto code references them via `::prost::Message` etc. - openshell-supervisor-process is already clean.

The OPA symlink-resolution task reads entrypoint_pid once at the top of the spawned closure. Because the spawn happens before run_process publishes the workload PID, the load returns 0, the probe path bakes in as /proc/0/root/, and the loop exhausts its retries against a path that does not exist on Linux. The reload never fires, so policies that whitelist symlinked binaries (e.g. /usr/bin/python3 → python3.11) get silent denials when the workload exec's the realpath. Split the wait into two phases: 5s polling entrypoint_pid for a non-zero value, then the existing 5s window probing /proc/<pid>/root/. Distinct warn messages on each timeout so future debugging can tell "PID never published" apart from "container fs never appeared".

rrhubenov · 2026-06-05T18:30:55Z

@drew Thank you for the review, I've applied changes based on your comments.

Notable new things are:

Denial aggregator is now part of the orchestrator (openshell-sandbox) rather than the
openshell-supervisor-process crate.

Moving the bypass monitor to the process leaf crate caused the need for both leaf crates
to send denial events to the aggregator. This also lead to them both having to depend
on the DenialEvent type, consequently it had to be moved to the core crate.

I think having the aggregator in the orchestrator is actually a pretty natural fit.

Bypass monitor moved to process leaf crate as per our discussion.

Only nastier thing here is that a subset of procfs had to be copied into the process
leaf crate, since the network crate is also using it. The same logic that applied to
netns applies to procfs, so it did not fit well in openshell-core. Alternatively a
separate crate could be created for procfs, but I opted out of it since I did not want
to strain the PR. I think a future cleanup could handle this, or I can create the new
crate in this PR, if preferred.

There was some leftover SSH code in the network crate.

This acted as both a cleanup and a prerequisite to moving netns into the process crate.

netns was moved from core to process leaf crate.

This was easy to do after the previous steps, however one notable thing is that the
NetworkNamespace struct had to be created in the orchestrator and passed downwards to
the run_process func. This was due to the orchestrator having a longer lifetime than the
run_process & run_network funcs, so it was a more natural fit for it to hold the RAII
handle.

There's probably a bunch of nice followup refactors and designs that can be applied, but
I'm not sure this PR is correct for that.

Applied your suggested fix for the PID of the sandboxed process.
Nasty merge conflict had to be fixed.

Opted for a merge commit since rebasing would have taken ages and would have been error
prone. Most interesting thing there is the movement of the activity-aggregator component
and thread. It very much follows the design of the denial-aggregator (shared types in
core, thread spawned by orchestrator, transfer channels passed downstream to the leaf
crates).

drew · 2026-06-05T18:36:29Z

@rrhubenov thanks for addressing the comments. Do you mind creating a followup ticket just so we don't lose track of the remaining cleanups? Otherwise this looks good to me and seems like a good start towards getting the supervisor running outside the agent container.

I'll let @derekwaynecarr give the final 👍.

drew · 2026-06-05T18:36:40Z

/ok to test ed12505

Reconciles main's anonymous activity-aggregator feature with the supervisor split topology. ActivityEvent / ActivitySender are placed in openshell-core::activity (mirroring the DenialEvent producer/consumer pattern); the orchestrator owns the channel and the aggregator, while both supervisor leaves (proxy in supervisor-network, bypass monitor in supervisor-process) emit as pure producers via cloned senders.

rrhubenov added 30 commits June 1, 2026 11:30

style: rustfmt import ordering

b9dc830

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

drew added the test:e2e Requires end-to-end coverage label Jun 1, 2026

drew reviewed Jun 2, 2026

View reviewed changes

rrhubenov added 2 commits June 2, 2026 10:14

style(supervisor-network): run cargo fmt

bbefa45

Signed-off-by: Radoslav Hubenov <rrhubenov@gmail.com>

rrhubenov force-pushed the refactor/split-sandbox branch from 8526c54 to 7a3fdd7 Compare June 2, 2026 13:00

jganoff mentioned this pull request Jun 4, 2026

feat: establish the Isolation Backend interface #1737

Open

2 tasks

drew mentioned this pull request Jun 5, 2026

feat(sandbox): pluggable handler for bootstrap-subsystem failures #1549

Open

7 tasks

kon-angelo mentioned this pull request Jun 5, 2026

Kubernetes Operator #1719

Open

rrhubenov added 9 commits June 5, 2026 15:21

rrhubenov force-pushed the refactor/split-sandbox branch from ed12505 to b9a6a7b Compare June 5, 2026 20:49

Conversation

rrhubenov commented Jun 1, 2026

Summary

Review suggestions:

Related Issue

Changes

Testing

Checklist

Uh oh!

derekwaynecarr commented Jun 1, 2026

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

derekwaynecarr commented Jun 1, 2026

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

TaylorMutch commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

drew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drew Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

drew Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drew Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

rrhubenov Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rrhubenov Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

drew Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

drew commented Jun 2, 2026

Uh oh!

drew commented Jun 5, 2026

Uh oh!

rrhubenov commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drew commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drew commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

drew left a comment •

edited

Loading

rrhubenov Jun 3, 2026 •

edited

Loading

rrhubenov commented Jun 5, 2026 •

edited

Loading

drew commented Jun 5, 2026 •

edited

Loading