Problem
Sandbox state is spread across four places in openshell-server, each authoritative for a different aspect, none aware of the others:
| Concern |
Owner |
Source of truth |
| Intent (spec + desired phase) |
Sandbox proto in Store |
DB |
| Observed pod state |
openshell-driver-kubernetes → compute::reconcile_* |
K8s, reflected into DB |
| Control-plane liveness |
SupervisorSessionRegistry (in-memory) |
Registry |
| Ephemeral per-request state |
pending_relays, exec tasks |
Registry (transient) |
The fragmentation causes concrete symptoms today:
cleanup_sandbox_state in compute/mod.rs:475 tears down the tracing/watch buses but not the supervisor-session registry.
Registry entries are eventually cleaned up when the gRPC stream dies, but nothing guarantees the ordering. When a sandbox is deleted externally (kubectl, driver reconcile), the registry is lagging.
GetSandbox returns phase=Ready even when the supervisor is disconnected. The only way to find out is to attempt an exec and receive Unavailable. Operators have no way to see "pod is up but supervisor session is gone".
WatchSandbox streams pod-level phase transitions but silently drops supervisor-session connect / disconnect events.
Agents watching for readiness can't tell liveness apart from pod-phase.
- Multi-gateway (OS-88 onwards) needs a lease directory. Adding it as a fifth disconnected subsystem would make this worse.
Proposal
Introduce a SandboxLifecycle façade inside openshell-server that owns all four concerns. Callers — handle_get_sandbox, handle_watch_sandbox, cleanup_sandbox_state, handle_connect_supervisor, the relay handlers, and the compute reconciler — stop talking directly to Store, SupervisorSessionRegistry, and the compute driver. They go through the façade.
The façade's surface:
pub struct SandboxLifecycle { /* holds store, registry, driver handles */ }
impl SandboxLifecycle {
// Reads
pub async fn get_state(&self, sandbox_id: &str) -> FullState;
pub fn subscribe(&self, sandbox_id: &str) -> EventStream;
// Writes — driver observations
pub async fn apply_driver_update(&self, snapshot: DriverSandbox);
pub async fn apply_driver_delete(&self, sandbox_id: &str);
// Writes — supervisor session
pub fn register_session(&self, sandbox_id: &str, session_id: &str, tx: Sender);
pub fn revoke_session(&self, sandbox_id: &str, session_id: &str);
// Single cleanup entry point
pub async fn forget(&self, sandbox_id: &str);
}
pub struct FullState {
pub sandbox: Sandbox, // from DB
pub session: Option<SessionStatus>, // from registry
pub lease: Option<LeaseInfo>, // from lease store
pub last_driver_snapshot_ms: i64,
}
The façade encodes rules the current split cannot, cleanly:
- "if the driver says the pod is gone, revoke the lease and close the session" - today these are three uncoordinated code paths.
- "session liveness is part of the observable state" — today it isn't.
- "there is exactly one place that knows whether a sandbox exists" - today there are three.
Scope (phased)
Phase 1: internal refactor, no behaviour change
- Introduce
SandboxLifecycle wrapping the existing Store + SupervisorSessionRegistry + compute references.
- Route
cleanup_sandbox_state, handle_connect_supervisor's remove_if_current, and the compute reconciler's update/delete paths through it.
- Smallest shape change:
cleanup_sandbox_state becomes lifecycle.forget(sandbox_id) and also cleans the registry.
- Outcome: the registry/reconciler coupling is fixed; no user-visible change.
Phase 2: lease store (multi-gateway prerequisite)
- Add a
supervisor_lease table in the DB (or equivalent on the Sandbox row): (sandbox_id, owner_gateway_id, session_id, lease_expires_at).
SandboxLifecycle::register_session writes the lease.
revoke_session and forget delete it.
- Heartbeat tick refreshes
lease_expires_at.
- Crash recovery: a stale lease with expired TTL is reclaimable by the next connecting supervisor.
- Abstract behind a
LeaseStore trait; DB is the first backend. Swap for etcd later if needed.
Phase 3: surface session state in the API
- Extend
Sandbox proto (or add a sibling SandboxHealth) with session_connected, last_session_seen_ms, lease_owner.
GetSandbox populates these from the façade.
WatchSandbox emits session connect / disconnect events alongside existing phase transitions.
- Agent tooling can now distinguish "pod ready" from "ready and reachable".
Phase 4: multi-gateway forwarding (OS-88 territory)
SupervisorSessionRegistry::lookup_session returns SessionLocation { Local, Remote(gateway_id), Unknown } by consulting the lease store on miss.
open_relay branches: local → as today; remote → forward the relay RPC to the owning gateway or return a gateway-hint error for the CLI to retry.
Phases 1–3 can ship before multi-gateway is a feature; they are valuable on their own. Phase 4 is multi-gateway-specific and is the natural continuation.
Non-goals
- High-frequency session telemetry in the DB (bytes, packets, per-connection metrics). That belongs in Prometheus.
- A replacement for the compute driver trait. This façade consumes driver observations; it does not replace the driver abstraction.
- Breaking the existing
Sandbox proto schema. New fields are additive.
Links
- Parent: OS-31
- Predecessor: OS-86 (SSH/exec over supervisor sessions)
- Related: OS-88 (expose services through the gateway), OS-83 (event-based streaming over supervisor sessions)
- Related PR: #867 (gRPC RelayStream data plane — where the registry abstraction lives today)
- Related follow-up: OS-91 (Direct
SupervisorExec RPC)
Problem
Sandbox state is spread across four places in
openshell-server, each authoritative for a different aspect, none aware of the others:Sandboxproto inStoreopenshell-driver-kubernetes→compute::reconcile_*SupervisorSessionRegistry(in-memory)pending_relays, exec tasksThe fragmentation causes concrete symptoms today:
cleanup_sandbox_stateincompute/mod.rs:475tears down the tracing/watch buses but not the supervisor-session registry.Registry entries are eventually cleaned up when the gRPC stream dies, but nothing guarantees the ordering. When a sandbox is deleted externally (kubectl, driver reconcile), the registry is lagging.
GetSandboxreturnsphase=Readyeven when the supervisor is disconnected. The only way to find out is to attempt anexecand receiveUnavailable. Operators have no way to see "pod is up but supervisor session is gone".WatchSandboxstreams pod-level phase transitions but silently drops supervisor-session connect / disconnect events.Agents watching for readiness can't tell liveness apart from pod-phase.
Proposal
Introduce a
SandboxLifecyclefaçade insideopenshell-serverthat owns all four concerns. Callers —handle_get_sandbox,handle_watch_sandbox,cleanup_sandbox_state,handle_connect_supervisor, the relay handlers, and the compute reconciler — stop talking directly toStore,SupervisorSessionRegistry, and the compute driver. They go through the façade.The façade's surface:
The façade encodes rules the current split cannot, cleanly:
Scope (phased)
Phase 1: internal refactor, no behaviour change
SandboxLifecyclewrapping the existingStore+SupervisorSessionRegistry+computereferences.cleanup_sandbox_state,handle_connect_supervisor'sremove_if_current, and the compute reconciler's update/delete paths through it.cleanup_sandbox_statebecomeslifecycle.forget(sandbox_id)and also cleans the registry.Phase 2: lease store (multi-gateway prerequisite)
supervisor_leasetable in the DB (or equivalent on theSandboxrow):(sandbox_id, owner_gateway_id, session_id, lease_expires_at).SandboxLifecycle::register_sessionwrites the lease.revoke_sessionandforgetdelete it.lease_expires_at.LeaseStoretrait; DB is the first backend. Swap for etcd later if needed.Phase 3: surface session state in the API
Sandboxproto (or add a siblingSandboxHealth) withsession_connected,last_session_seen_ms,lease_owner.GetSandboxpopulates these from the façade.WatchSandboxemits session connect / disconnect events alongside existing phase transitions.Phase 4: multi-gateway forwarding (OS-88 territory)
SupervisorSessionRegistry::lookup_sessionreturnsSessionLocation { Local, Remote(gateway_id), Unknown }by consulting the lease store on miss.open_relaybranches: local → as today; remote → forward the relay RPC to the owning gateway or return a gateway-hint error for the CLI to retry.Phases 1–3 can ship before multi-gateway is a feature; they are valuable on their own. Phase 4 is multi-gateway-specific and is the natural continuation.
Non-goals
Sandboxproto schema. New fields are additive.Links
SupervisorExecRPC)