Skip to content

feat(driver-vm): snapshot/restore (suspend/resume) for idle sandboxes #1551

@athreesh

Description

@athreesh

agent sandboxes are idle most of the time, waiting on tool calls, model responses, or humans. right now the ComputeDriver lifecycle is create/stop/delete, so an idle sandbox either holds its resources or gets torn down and cold-restarted (libkrun first boot is ~10-30s). what i'd want is suspend/resume: checkpoint a sandbox's memory + fs, free the resources, and restore it in well under a second on the next request, same idle economics Lambda gets from Firecracker snapshots.

a few questions:

  • is VM snapshot/restore on the roadmap at all? i see the VmBackend { Libkrun, Qemu } split and the QEMU backend from Adding qemu vm driver support with GPU pass-through #992 (plus the closed CH attempt in Vcauxbrisebo/vm gpu support #851), so the backend abstraction already exists.
  • would adding Snapshot/Restore RPCs to ComputeDriver (alongside create/stop/delete) be welcome?
  • preferred path: snapshot on the QEMU backend (savevm/migrate-to-file), or a dedicated Firecracker / Cloud Hypervisor backend? both have first-class snapshot/restore.
  • or is this intentionally deferred behind the multi-tenant milestone?

context: evaluating OpenShell as the sandbox runtime for an agent-compute substrate where suspend/resume + KV cache aware routing/offloading are the core economics. happy to help scope or send a patch if there's appetite.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions