You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
agent sandboxes are idle most of the time, waiting on tool calls, model responses, or humans. right now the ComputeDriver lifecycle is create/stop/delete, so an idle sandbox either holds its resources or gets torn down and cold-restarted (libkrun first boot is ~10-30s). what i'd want is suspend/resume: checkpoint a sandbox's memory + fs, free the resources, and restore it in well under a second on the next request, same idle economics Lambda gets from Firecracker snapshots.
would adding Snapshot/Restore RPCs to ComputeDriver (alongside create/stop/delete) be welcome?
preferred path: snapshot on the QEMU backend (savevm/migrate-to-file), or a dedicated Firecracker / Cloud Hypervisor backend? both have first-class snapshot/restore.
or is this intentionally deferred behind the multi-tenant milestone?
context: evaluating OpenShell as the sandbox runtime for an agent-compute substrate where suspend/resume + KV cache aware routing/offloading are the core economics. happy to help scope or send a patch if there's appetite.
agent sandboxes are idle most of the time, waiting on tool calls, model responses, or humans. right now the
ComputeDriverlifecycle is create/stop/delete, so an idle sandbox either holds its resources or gets torn down and cold-restarted (libkrun first boot is ~10-30s). what i'd want is suspend/resume: checkpoint a sandbox's memory + fs, free the resources, and restore it in well under a second on the next request, same idle economics Lambda gets from Firecracker snapshots.a few questions:
VmBackend { Libkrun, Qemu }split and the QEMU backend from Adding qemu vm driver support with GPU pass-through #992 (plus the closed CH attempt in Vcauxbrisebo/vm gpu support #851), so the backend abstraction already exists.Snapshot/RestoreRPCs toComputeDriver(alongside create/stop/delete) be welcome?savevm/migrate-to-file), or a dedicated Firecracker / Cloud Hypervisor backend? both have first-class snapshot/restore.context: evaluating OpenShell as the sandbox runtime for an agent-compute substrate where suspend/resume + KV cache aware routing/offloading are the core economics. happy to help scope or send a patch if there's appetite.