Skip to content

Unify sandbox state management behind a SandboxLifecycle facade #71

@linear

Description

@linear

Problem

Sandbox state is spread across four places in openshell-server, each authoritative for a different aspect, none aware of the others:

Concern Owner Source of truth
Intent (spec + desired phase) Sandbox proto in Store DB
Observed pod state openshell-driver-kubernetescompute::reconcile_* K8s, reflected into DB
Control-plane liveness SupervisorSessionRegistry (in-memory) Registry
Ephemeral per-request state pending_relays, exec tasks Registry (transient)

The fragmentation causes concrete symptoms today:

  • cleanup_sandbox_state in compute/mod.rs:475 tears down the tracing/watch buses but not the supervisor-session registry.
    Registry entries are eventually cleaned up when the gRPC stream dies, but nothing guarantees the ordering. When a sandbox is deleted externally (kubectl, driver reconcile), the registry is lagging.
  • GetSandbox returns phase=Ready even when the supervisor is disconnected. The only way to find out is to attempt an exec and receive Unavailable. Operators have no way to see "pod is up but supervisor session is gone".
  • WatchSandbox streams pod-level phase transitions but silently drops supervisor-session connect / disconnect events.
    Agents watching for readiness can't tell liveness apart from pod-phase.
  • Multi-gateway (OS-88 onwards) needs a lease directory. Adding it as a fifth disconnected subsystem would make this worse.

Proposal

Introduce a SandboxLifecycle façade inside openshell-server that owns all four concerns. Callers — handle_get_sandbox, handle_watch_sandbox, cleanup_sandbox_state, handle_connect_supervisor, the relay handlers, and the compute reconciler — stop talking directly to Store, SupervisorSessionRegistry, and the compute driver. They go through the façade.

The façade's surface:

pub struct SandboxLifecycle { /* holds store, registry, driver handles */ }

impl SandboxLifecycle {
    // Reads
    pub async fn get_state(&self, sandbox_id: &str) -> FullState;
    pub fn subscribe(&self, sandbox_id: &str) -> EventStream;

    // Writes — driver observations
    pub async fn apply_driver_update(&self, snapshot: DriverSandbox);
    pub async fn apply_driver_delete(&self, sandbox_id: &str);

    // Writes — supervisor session
    pub fn register_session(&self, sandbox_id: &str, session_id: &str, tx: Sender);
    pub fn revoke_session(&self, sandbox_id: &str, session_id: &str);

    // Single cleanup entry point
    pub async fn forget(&self, sandbox_id: &str);
}

pub struct FullState {
    pub sandbox: Sandbox,                   // from DB
    pub session: Option<SessionStatus>,     // from registry
    pub lease: Option<LeaseInfo>,           // from lease store
    pub last_driver_snapshot_ms: i64,
}

The façade encodes rules the current split cannot, cleanly:

  • "if the driver says the pod is gone, revoke the lease and close the session" - today these are three uncoordinated code paths.
  • "session liveness is part of the observable state" — today it isn't.
  • "there is exactly one place that knows whether a sandbox exists" - today there are three.

Scope (phased)

Phase 1: internal refactor, no behaviour change

  • Introduce SandboxLifecycle wrapping the existing Store + SupervisorSessionRegistry + compute references.
  • Route cleanup_sandbox_state, handle_connect_supervisor's remove_if_current, and the compute reconciler's update/delete paths through it.
  • Smallest shape change: cleanup_sandbox_state becomes lifecycle.forget(sandbox_id) and also cleans the registry.
  • Outcome: the registry/reconciler coupling is fixed; no user-visible change.

Phase 2: lease store (multi-gateway prerequisite)

  • Add a supervisor_lease table in the DB (or equivalent on the Sandbox row): (sandbox_id, owner_gateway_id, session_id, lease_expires_at).
  • SandboxLifecycle::register_session writes the lease.
  • revoke_session and forget delete it.
  • Heartbeat tick refreshes lease_expires_at.
  • Crash recovery: a stale lease with expired TTL is reclaimable by the next connecting supervisor.
  • Abstract behind a LeaseStore trait; DB is the first backend. Swap for etcd later if needed.

Phase 3: surface session state in the API

  • Extend Sandbox proto (or add a sibling SandboxHealth) with session_connected, last_session_seen_ms, lease_owner.
  • GetSandbox populates these from the façade.
  • WatchSandbox emits session connect / disconnect events alongside existing phase transitions.
  • Agent tooling can now distinguish "pod ready" from "ready and reachable".

Phase 4: multi-gateway forwarding (OS-88 territory)

  • SupervisorSessionRegistry::lookup_session returns SessionLocation { Local, Remote(gateway_id), Unknown } by consulting the lease store on miss.
  • open_relay branches: local → as today; remote → forward the relay RPC to the owning gateway or return a gateway-hint error for the CLI to retry.

Phases 1–3 can ship before multi-gateway is a feature; they are valuable on their own. Phase 4 is multi-gateway-specific and is the natural continuation.

Non-goals

  • High-frequency session telemetry in the DB (bytes, packets, per-connection metrics). That belongs in Prometheus.
  • A replacement for the compute driver trait. This façade consumes driver observations; it does not replace the driver abstraction.
  • Breaking the existing Sandbox proto schema. New fields are additive.

Links

  • Parent: OS-31
  • Predecessor: OS-86 (SSH/exec over supervisor sessions)
  • Related: OS-88 (expose services through the gateway), OS-83 (event-based streaming over supervisor sessions)
  • Related PR: #867 (gRPC RelayStream data plane — where the registry abstraction lives today)
  • Related follow-up: OS-91 (Direct SupervisorExec RPC)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions