Unify sandbox state management behind a SandboxLifecycle facade

## Problem

Sandbox state is spread across four places in `openshell-server`, each authoritative for a different aspect, none aware of the others:


| Concern | Owner | Source of truth |
| -- | -- | -- |
| Intent (spec + desired phase) | `Sandbox` proto in `Store` | DB |
| Observed pod state | `openshell-driver-kubernetes` → `compute::reconcile_*` | K8s, reflected into DB |
| Control-plane liveness | `SupervisorSessionRegistry` (in-memory) | Registry |
| Ephemeral per-request state | `pending_relays`, exec tasks | Registry (transient) |

The fragmentation causes concrete symptoms today:

* `cleanup_sandbox_state` in `compute/mod.rs:475` tears down the tracing/watch buses but not the supervisor-session registry.<br>Registry entries are eventually cleaned up when the gRPC stream dies, but nothing guarantees the ordering. When a sandbox is deleted externally (kubectl, driver reconcile), the registry is lagging.
* `GetSandbox` returns `phase=Ready` even when the supervisor is disconnected. The only way to find out is to attempt an `exec` and receive `Unavailable`. Operators have no way to see "pod is up but supervisor session is gone".
* `WatchSandbox` streams pod-level phase transitions but silently drops supervisor-session connect / disconnect events.<br>Agents watching for readiness can't tell liveness apart from pod-phase.
* **Multi-gateway (**[OS-88](https://linear.app/nvidia/issue/OS-88/expose-services-through-the-gateway) **onwards)** needs a lease directory. Adding it as a fifth disconnected subsystem would make this worse.

## Proposal

Introduce a `SandboxLifecycle` façade inside `openshell-server` that owns all four concerns. Callers — `handle_get_sandbox`, `handle_watch_sandbox`, `cleanup_sandbox_state`, `handle_connect_supervisor`, the relay handlers, and the compute reconciler — stop talking directly to `Store`, `SupervisorSessionRegistry`, and the compute driver. They go through the façade.

The façade's surface:

```rust
pub struct SandboxLifecycle { /* holds store, registry, driver handles */ }

impl SandboxLifecycle {
    // Reads
    pub async fn get_state(&self, sandbox_id: &str) -> FullState;
    pub fn subscribe(&self, sandbox_id: &str) -> EventStream;

    // Writes — driver observations
    pub async fn apply_driver_update(&self, snapshot: DriverSandbox);
    pub async fn apply_driver_delete(&self, sandbox_id: &str);

    // Writes — supervisor session
    pub fn register_session(&self, sandbox_id: &str, session_id: &str, tx: Sender);
    pub fn revoke_session(&self, sandbox_id: &str, session_id: &str);

    // Single cleanup entry point
    pub async fn forget(&self, sandbox_id: &str);
}

pub struct FullState {
    pub sandbox: Sandbox,                   // from DB
    pub session: Option<SessionStatus>,     // from registry
    pub lease: Option<LeaseInfo>,           // from lease store
    pub last_driver_snapshot_ms: i64,
}
```

The façade encodes rules the current split cannot, cleanly:

* "if the driver says the pod is gone, revoke the lease and close the session" - today these are three uncoordinated code paths.
* "session liveness is part of the observable state" — today it isn't.
* "there is exactly one place that knows whether a sandbox exists" - today there are three.

## Scope (phased)

### Phase 1: internal refactor, no behaviour change

* Introduce `SandboxLifecycle` wrapping the existing `Store` + `SupervisorSessionRegistry` + `compute` references.
* Route `cleanup_sandbox_state`, `handle_connect_supervisor`'s `remove_if_current`, and the compute reconciler's update/delete paths through it.
* Smallest shape change: `cleanup_sandbox_state` becomes `lifecycle.forget(sandbox_id)` and also cleans the registry.
* Outcome: the registry/reconciler coupling is fixed; no user-visible change.

### Phase 2: lease store (multi-gateway prerequisite)

* Add a `supervisor_lease` table in the DB (or equivalent on the `Sandbox` row): `(sandbox_id, owner_gateway_id, session_id, lease_expires_at)`.
* `SandboxLifecycle::register_session` writes the lease.
* `revoke_session` and `forget` delete it.
* Heartbeat tick refreshes `lease_expires_at`.
* Crash recovery: a stale lease with expired TTL is reclaimable by the next connecting supervisor.
* Abstract behind a `LeaseStore` trait; DB is the first backend. Swap for etcd later if needed.

### Phase 3: surface session state in the API

* Extend `Sandbox` proto (or add a sibling `SandboxHealth`) with `session_connected`, `last_session_seen_ms`, `lease_owner`.
* `GetSandbox` populates these from the façade.
* `WatchSandbox` emits session connect / disconnect events alongside existing phase transitions.
* Agent tooling can now distinguish "pod ready" from "ready and reachable".

### Phase 4: multi-gateway forwarding ([OS-88](https://linear.app/nvidia/issue/OS-88/expose-services-through-the-gateway) territory)

* `SupervisorSessionRegistry::lookup_session` returns `SessionLocation { Local, Remote(gateway_id), Unknown }` by consulting the lease store on miss.
* `open_relay` branches: local → as today; remote → forward the relay RPC to the owning gateway or return a gateway-hint error for the CLI to retry.

Phases 1–3 can ship before multi-gateway is a feature; they are valuable on their own. Phase 4 is multi-gateway-specific and is the natural continuation.

## Non-goals

* High-frequency session telemetry in the DB (bytes, packets, per-connection metrics). That belongs in Prometheus.
* A replacement for the compute driver trait. This façade consumes driver observations; it does not replace the driver abstraction.
* Breaking the existing `Sandbox` proto schema. New fields are additive.

## Links

* Parent: [OS-31](https://linear.app/nvidia/issue/OS-31/openshell-long-running-agentservice-primitives)
* Predecessor: [OS-86](https://linear.app/nvidia/issue/OS-86/implement-sshexec-over-supervisor-sessions) (SSH/exec over supervisor sessions)
* Related: [OS-88](https://linear.app/nvidia/issue/OS-88/expose-services-through-the-gateway) (expose services through the gateway), [OS-83](https://linear.app/nvidia/issue/OS-83/switch-supervisor-gateway-communication-to-event-based-streaming-over) (event-based streaming over supervisor sessions)
* Related PR: #867 (gRPC RelayStream data plane — where the registry abstraction lives today)
* Related follow-up: [OS-91](https://linear.app/nvidia/issue/OS-91/direct-execsandbox-via-supervisorexec-rpc-bypass-ssh-for-non) (Direct `SupervisorExec` RPC)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify sandbox state management behind a SandboxLifecycle facade #71

Problem

Proposal

Scope (phased)

Phase 1: internal refactor, no behaviour change

Phase 2: lease store (multi-gateway prerequisite)

Phase 3: surface session state in the API

Phase 4: multi-gateway forwarding (OS-88 territory)

Non-goals

Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Concern	Owner	Source of truth
Intent (spec + desired phase)	`Sandbox` proto in `Store`	DB
Observed pod state	`openshell-driver-kubernetes` → `compute::reconcile_*`	K8s, reflected into DB
Control-plane liveness	`SupervisorSessionRegistry` (in-memory)	Registry
Ephemeral per-request state	`pending_relays`, exec tasks	Registry (transient)

Unify sandbox state management behind a SandboxLifecycle facade #71

Description

Problem

Proposal

Scope (phased)

Phase 1: internal refactor, no behaviour change

Phase 2: lease store (multi-gateway prerequisite)

Phase 3: surface session state in the API

Phase 4: multi-gateway forwarding (OS-88 territory)

Non-goals

Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions