Skip to content

v0.5.0

Latest

Choose a tag to compare

@fslongjin fslongjin released this 03 Jul 09:24
30b4e25

2026.07.03 Release v0.5.0

CubeSandbox 0.5.0 introduces AutoPause/AutoResume, a platform-level sandbox lifecycle automation that transparently suspends idle sandboxes and resumes them on-demand on the next dataplane request. This release also delivers ARM64 (aarch64) native support across the entire stack — from hypervisor to CI/CD — a TencentCloud Terraform cluster deployer for production-grade one-click deployment, and network security hardening with per-sandbox traffic access tokens, CubeEgress fail-closed bootstrap, and policy-routing egress. Additional highlights include a pure-Go native rootfs export pipeline that bypasses Docker, skopeo, and umoci entirely, a snapshot runtime locking refactor to eliminate high-concurrency deadlocks, image uid/gid preservation fixes for non-root container images, E2B SDK alignment with complete filesystem and PTY APIs, and a one-click upgrade mode with three-way config merge. 116 commits from 26 contributors.

🎯 Major Features

AutoPause / AutoResume: Sandbox Lifecycle Automation

Sandboxes in agent workflows spend most of their time idle — waiting for user input, callbacks, or the next RL rollout cycle. AutoPause/AutoResume lets the platform automatically suspend idle sandboxes and instantly wake them on the next incoming request, releasing physical host resources during idle periods. This is implemented as a platform-side, per-sandbox capability with semantics aligned to the E2B lifecycle parameter.

  • AutoPause mechanism: A sweeper in the new cube-proxy-sidecar (under CubeProxy/sidecar/) tracks sandbox activity via last_active timestamps reported by CubeProxy's log_phase.lua. When idle >= timeout_seconds, the sidecar triggers a pause through CubeMaster → Cubelet, which snapshots the full VM state (memory + filesystem) to /data/cubelet/root/pausevm/<sandbox>, then shuts down the MicroVM. A configurable BootstrapWarmup window prevents premature pausing during sidecar startup.
  • AutoResume mechanism: When a dataplane request arrives for a paused sandbox, CubeProxy's sandbox_state.lua gate intercepts it and fires an internal sub-request to the sidecar's /internal/resume. The sidecar drives a resume RPC through CubeMaster → Cubelet → containerd, which restores the VM from the pause snapshot. The dataplane request blocks until resume completes (bound by nginx proxy_read_timeout). Concurrent resumes for the same sandbox are coalesced in-process (singleflight pattern); cross-replica coordination uses Redis SETNX locks.
  • Configurable resource release ratio (#553): A node-level configuration host.quota.paused_resource_release_ratio (float [0, 1], default 0) controls how much CPU/memory quota paused sandboxes release back to the scheduler. At ratio 1.0, all quota is released for maximum node density; at ratio 0, paused sandboxes retain full quota (guaranteed resume). Before resuming, a local admission check verifies the node has capacity — if not, the resume is rejected with HTTP 409 and a precise capacity diagnostic.
  • Traffic access token gating (#639): Sandboxes created with network.allow_public_traffic=false receive a per-sandbox traffic_access_token (UUID v4). CubeProxy enforces this token on every inbound request (both cold-path and cache-hit), returning HTTP 403 for missing or mismatched tokens. Token values are redacted from all logs. Accepts both e2b-traffic-access-token and cube-traffic-access-token headers.
  • Kill-path lifecycle: Sandboxes with on_timeout="kill" go through an idle timeout kill path with task.Kill. New POST /cube/sandbox/timeout and POST /cube/sandbox/refresh APIs expose end_at for deterministic lifecycle management. The CubeProxy gate returns 410 Gone for killing/killed sandboxes.

New files: CubeProxy/sidecar/ (Go binary — sweeper, resumer, registry, stream consumer, last-active poller, Redis coordination); CubeProxy Lua modules (sandbox_state.lua, admin_phase.lua); CubeMaster lifecycle endpoints; Cubelet pause/resume RPCs.

ARM64 (aarch64) Native Support

CubeSandbox now runs natively on ARM64 hosts, spanning the hypervisor, guest agent, shim, networking (BPF), build system, CI/CD, and deployment tooling. The work was a deep collaboration between Arm engineering and the Cube project team, progressing from feasibility to formal enablement.

  • Hypervisor port (4dc7275): The SysCtrl device (guest-to-host signaling for shutdown, reboot, vsock-ready) was rewritten from PIO (x86-only) to MMIO for ARM64, registered on the mmio_bus at LEGACY_SYS_CTRL_MAPPED_IO_START. KVM register access (get_one_reg) was updated for a changed upstream API signature. Seccomp rules were aligned for ARM64 syscall number differences (SYS_lstat vs SYS_fstatat/SYS_newfstatat). Live migration support remains x86_64-gated.
  • Guest agent (cb5706a): The RPC readiness signal changed from x86 ioperm() + PIO port write (port 0x680) to ARM64 /dev/mem mmap at physical address 0x0903_0000 (SysCtrl MMIO region) with ptr::write_volatile. Build target auto-detected from host arch.
  • CubeShim (fe3044c, 4feab88): Kernel command line adapted per architecture — console=ttyAMA0,115200 (ARM PL011 UART) vs console=hvc0 (virtio-console). x86-only mitigations (no_timer_check, noreplace-smp) gated to #[cfg(target_arch = "x86_64")]. Seccomp allow-lists aligned: SYS_mkdirSYS_mkdirat on ARM64; added missing SYS_faccessat2 for glibc path resolution on ARM64.
  • BPF / CubeNet (cb263f2): Hardcoded -target amd64 in BPF //go:generate directives replaced with -target $GOARCH. Per-architecture vmlinux.h headers (amd64 + arm64 BTF dumps). BPF object files regenerated at build time; prebuilt .o files removed from git. Requires clang ≥ 14 (added to builder image via apt.llvm.org).
  • Multi-arch builds (c6bb376, 475b488): Dockerfile.builder parameterized with TARGETARCH for Go, protoc, and Rust toolchain downloads. CubeEgress, CubeAPI, and envd images support multi-arch manifest lists. A containerized make guest-kernel target supports both native and cross builds with architecture-specific kernel configs (kernel-oc9.x86_64.config, kernel-oc9.aarch64.config).
  • CI/CD (#720): Builder image, VMLinux, and one-click release workflows all produce per-architecture artifacts. Release workflow split into release_amd64 and release_arm64 jobs running on native runners; both upload to the same GitHub Release.
  • Deployment (cb52bc7): Dev environment (run_vm.sh) auto-detects architecture — ARM64 uses machine virt with UEFI firmware (qemu-efi-aarch64). Release bundle script packages per-architecture mkcert binaries.
  • Known limitations (documented): PVM (nested KVM) is x86_64-only; ARM64 requires bare-metal with native KVM. Live migration is x86_64-only.

TencentCloud Terraform Cluster Deployer

A production-grade, one-click cluster deployment for TencentCloud driven entirely by Terraform IaC. From a single release bundle and create.sh entry point, the deployer provisions a full CubeSandbox cluster with managed control plane, HA middleware, and elastic compute nodes.

  • Infrastructure provisioning (#629): Terraform provisions a private VPC (10.0.0.0/16) with per-zone subnets, NAT Gateway with EIP, security groups, and a bastion jumpserver. Managed cloud services are automatically created:
    • MySQL (TencentDB 8.0): Multi-AZ with semi-sync replication, 4 GB / 200 GB, application account cube with database-level privileges.
    • Redis (TencentDB 7.0): Standard master/replica architecture, configurable memory (default 1 GB), password-protected.
    • CFS (Cloud File Storage): NFS share for cube-master's shared persistent storage (/data/CubeMaster/storage), mounted ReadWriteMany across replicas. Optional — enables multi-replica HA mode.
    • TCR (Tencent Container Registry): Private registry with VPC peering, namespace per deployment, long-lived access token.
  • TKE control plane (#629): Managed Kubernetes cluster (v1.34.1, GlobalRouter, containerd) with intranet-only apiserver. Four control-plane components deployed as Deployments with CLB Services:
    • cube-master: Shared CFS NFS volume for template/snapshot/runtime state. CubeEgress MITM CA (ECDSA P-256, Terraform-generated). Internal CLB on port 8089.
    • cube-api: Public CLB on port 3000, proxies to cube-master via cluster DNS.
    • cube-proxy: Public CLB on ports 80/443, with Redis and TLS Secrets.
    • cube-webui: Public CLB on port 80, nginx reverse-proxying to cube-api and cube-proxy.
  • Configurable replicas (#658): cubemaster_replicas, cube_api_replicas, cube_proxy_replicas, cube_webui_replicas variables (default 1 for POC mode). When TENCENTCLOUD_USE_CFS=true, cube-master multi-replica HA is enabled with cubemaster_replicas driving both spec.replicas and scheduler concurrency apportionment.
  • Elastic compute nodes: Configurable CVM instances (PVM or bare-metal) in private VPC, auto-scaling via TKE node pool. Default 1 node, configurable count and instance types.
  • One-click upgrade mode (#538): install.sh --mode=upgrade detects existing installations, performs three-way .env config merge (new defaults + old customizations + explicit overrides), runs fail-fast preflight checks (disk space, semver compatibility, CIDR conflict), and backs up configuration before any destructive change. User customizations are preserved; secrets are redacted in the diff report but retained in the actual merged file.
  • External MySQL/Redis (#514): One-click installer supports pointing at pre-existing external MySQL and Redis instances via CUBE_EXTERNAL_MYSQL_* / CUBE_EXTERNAL_REDIS_* variables. Local Docker containers are masked, and all components (CubeMaster, CubeAPI, CubeProxy) are configured to use the external endpoints.
  • Other installer improvements: CUBE_PROXY_HOST_PORT deprecated, split into CUBE_PROXY_HTTP_PORT / CUBE_PROXY_HTTPS_PORT (#588). Manual SQL seed removed in favor of CubeMaster embedded migrations (#628). CubeEgress integrated into compute node startup (#707). Improved external dependency compatibility with shell-safe env persistence and redis-cli timeout detection (#673). Cubelet reporting interval and scheduler scoring configured for multi-node TencentCloud deployments (b8f2242). Image defaults updated to v0.5.0 tags (#719).

New files: deploy/one-click/terraform/tencentcloud/ (create.sh, destroy.sh, main.tf, variables.tf, tke-addons.tf, outputs.tf, lib-state-sync.sh, env.example); docs/guide/tencentcloud-terraform-deploy.md (EN + ZH).

Network Security Hardening

Three critical patches for sandbox network security: inbound access control, outbound fail-closed, and policy-routing egress.

  • Per-sandbox traffic access token (#639): Described in AutoPause/AutoResume above. CubeProxy enforces traffic_access_token on every inbound request when AllowPublicTraffic=false, returning 403 for missing/bad tokens. Token values are never logged.
  • CubeEgress fail-closed bootstrap (c5f811d): During CubeEgress startup, before L7 policies are loaded from CubeMaster (bootstrap_status ≠ "ready"), the proxy now returns 403 for all non-audit traffic instead of the previous fail-open behavior. This eliminates the security gap where sandbox outbound traffic could bypass all host/SNI/method/path controls during a restart window.
  • Route-aware egress / cube-router (7c514e9): An optional cube-router kernel dummy device allows sandbox outbound traffic to enter the Linux host routing stack instead of being hard-redirected to the primary NIC. Traffic can leave through any routable device (eth0, eth1, GRE tunnels, VXLAN, WireGuard), enabling seamless integration with existing multi-NIC and VPN network infrastructure. Existing CubeEgress L7 policy, DNS allow-list, and port mapping are all preserved.
  • BPF TCP checksum fix (7c2dd1f): Fixed invalid TCP checksums on cross-node port-mapped sandbox replies. The snat_tcp() function was incorrectly using BPF_F_PSEUDO_HDR when only the TCP port had changed, causing checksum corruption on multi-node deployments.
  • from_world cleanup (5b91946): Removed ineffective from_world TC filter attachment on the loopback interface, reducing unnecessary eBPF hook points.
  • Host service access (48d080e): Sandbox outbound traffic to the host's own IP is now redirected via BPF shortcut, enabling sandboxed code to reach host-local services.
  • Network hardening guide (#663): New bilingual documentation covering default control-plane attack surface, binding strategies (private NIC vs firewall whitelisting), CubeAPI auth callback with path+method validation, and TLS/credential rotation guidance.

✨ Enhancements

SDK

  • E2B filesystem API alignment (#678): Go and Python SDKs now implement the full E2B filesystem API surface: list, stat, exists, remove, rename, mkdir, and watch. Comprehensive integration tests included.
  • Python SDK PTY APIs (250d248): Complete PTY (pseudo-terminal) interface added to the Python SDK — create, connect, kill, send_stdin, resize — with streaming output via PtyHandle iterator. Speaks envd's Connect-JSON RPC directly; no dependency on e2b Python packages.
  • Python SDK E2B network.rules transforms (#568): Compatibility with E2B's per-host {transform: {headers: {...}}} credential injection shape, translated into CubeEgress L7 action.inject rules. Drop-in replacement for codebases using E2B's per-host credential injection.
  • Python SDK double-encoding fix (#572): Execution logs and errors are no longer double-encoded, fixing corrupted output display.

Performance

  • Pure-Go native rootfs export (#558): A daemonless, pure-Go rootfs export pipeline that bypasses Docker, skopeo, and umoci entirely. Features concurrent prefetch, loop-mount streaming directly into ext4 block devices, and a "decompress-and-delete" strategy that significantly reduces peak memory and build time compared to the previous skopeo/umoci pipeline. Enabled by default.
  • VirtIO block performance (#575): Enables VIRTIO_BLK_F_SEG_MAX (multi-segment requests) and VIRTIO_RING_F_INDIRECT_DESC (indirect descriptors) in the hypervisor's virtio-blk device, improving sequential write throughput from 2888 MiB/s to 3293 MiB/s (fio benchmark).

Template Management

  • Image pull progress TUI (#580): Real-time pull progress tracking with Redis-backed persistence. New template watch and template build-watch CLI commands provide an interactive bubbletea TUI showing live metrics (download speed, per-layer completion, step-by-step checklist), with plain-text fallback on non-TTY terminals.
  • Latest job ID in template list (#546): Template list and detail APIs now expose each template's latest create/rebuild job ID. The Web UI automatically opens build logs when viewing a template with an active (running/pending/building) job.
  • Artifact resource leak fixes (#631): Introduces t_cube_artifact_node_placement table for node-level artifact tracking independent of replica lifecycle. Periodic artifact GC with MySQL GET_LOCK for HA coordination. Hardened cascade cleanup across CubeMaster, Cubelet, and CubeAPI. AgentHub snapshot cascade cleanup with path traversal protection.

Web UI

  • Template creation overhaul (#675): New multi-step template creation form with image source selection, instance type configuration, network settings, and advanced options. Improved validation and field organization.

AgentHub

  • LLM env-var fallback removal (#602): All AgentHub LLM secrets and settings now live exclusively in the database, encrypted with a per-installation CSPRNG-generated master key. Environment variable fallback paths are eliminated. The one-click upgrade script actively deletes obsolete env keys. decrypt_or_passthrough now fails closed for undecryptable payloads.
  • Assistant state persistence (#582): Authentication, OpenClaw runtime state, template inheritance, and snapshot recovery behavior are now persisted in the database. Includes bcrypt-based auth, WeCom secret decryption, rate limiting on auth entrypoints. Enables backup/restore and safe cloning of digital assistants.

Deployment & Installer

  • One-click upgrade mode (#538): Detailed in Terraform section above. --mode=upgrade with three-way env merge, pre-upgrade backup, and fail-fast preflights.
  • External MySQL/Redis (#514): Detailed above. Support for pointing one-click installer at pre-existing external database instances.
  • CUBE_PROXY port split (#588): CUBE_PROXY_HOST_PORT deprecated; CUBE_PROXY_HTTP_PORT (default 80) and CUBE_PROXY_HTTPS_PORT (default 443) provide separate HTTP/HTTPS control.
  • Static builds (#583): cube-api, cubemaster, and cubemastercli are now fully static binaries (CGO_ENABLED=0 / musl target), eliminating host glibc dependency and preventing version-skew failures.
  • Embedded migrations (#628): Manual SQL seed removed; CubeMaster embedded goose migrations own single-node seed rows. Eliminates mysql client dependency on control nodes.
  • CubeEgress compute integration (#707): CubeEgress integrated into compute node startup/down scripts, extending egress policy enforcement to all nodes.
  • resolvectl compatibility (#703): Tolerates missing resolvectl default-route on older systemd (pre-v240), preventing install failures on RHEL 8.3 and similar distributions.

Infrastructure

  • Redis key unification (#609): All Redis keys centralized in pkg/base/rediskey with consistent naming. Read/write pool separation dropped in favor of a single pool, simplifying multi-node deployment configuration.
  • Migration identity hardening (#620): Three-layer defense against silent migration skipping: 14-digit UTC timestamp prefixes for new migrations, out-of-order application support, and SHA-256 content fingerprinting with startup verification. CI enforces immutability of already-merged migration files.
  • Node label management API (#633): POST /nodes/{id}/labels and DELETE /nodes/{id}/labels endpoints for admin-managed labels. Kubernetes-compatible naming (DNS1123 subdomain prefix), SELECT FOR UPDATE race protection, 64-label limit, system-reserved namespace protection (kubernetes.io, beta.kubernetes.io, cube.cloud.tencentcloud.com).
  • envd version reporting (#650): Collected envd versions propagated as sandbox annotations, enabling E2B SDK feature gating against real runtime versions instead of a hardcoded constant.
  • Configurable HTTP bind (#662): CUBEMASTER_HTTP_BIND makes the HTTP listen address configurable, supporting private-NIC binding for network security hardening.
  • TencentCloud scheduler scoring (b8f2242): Multi-node scheduler scoring with real_time_weighted_average plugin balancing mvm_num, local_create_num, cpu_usage, and quota_mem_usage. priority_select_num dynamically capped at min(compute_node_count, 3).

🐛 Bug Fixes

These fixes address issues present in v0.4.0:

  • High-concurrency rollback deadlock (#693): Snapshot runtime active binding refactored into a dedicated t_cube_snapshot_runtime_active table with sandbox-level and resource-level distributed locks. Eliminates MySQL 1213 deadlock errors.
  • Image uid/gid squashing (#671, #608): Two complementary fixes for image ownership preservation:
    • Native export path (#671): Removed WithNoSameOwner() option, which was squashing all file uid/gid to the extracting user, breaking image ownership.
    • Template center (#608): umoci unpack --rootless now only passed when euid ≠ 0 (not when running as root). Docker-export fallback uses --same-owner --numeric-owner. Fixes Chromium profile write failures and CDP unreachability in browser sandboxes, and envd exec EACCES on /home/user for Python images.
  • Hung hostdir mounts (#691): Hostdir bind and remount operations now run through a bounded 3-second timeout subprocess. Previously, stale NFS mounts or other hung filesystems would block sandbox creation indefinitely.
  • CubeProxy envd streaming buffering (#647): Nginx response buffering disabled for envd server-streaming endpoints. Previously, nginx buffered early stream frames, breaking immediate-return semantics for background commands and watch streams.
  • MSI-X table/PBA hardening (#619): Guest-triggered panics in the VMM's MSI-X table/PBA read/write paths replaced with graceful error handling. A malicious or buggy guest can no longer crash the VMM process via invalid MSI-X accesses.
  • CubeProxy implementation detail leakage (#653): Removed X-Cube-Retcode response header and $cube_retcode access-log field that exposed internal failure-mode codes. Errors now return opaque HTTP status codes with uniform JSON bodies, preventing sandbox ID enumeration and infrastructure probing.
  • Auth callback method forwarding (#315): X-Request-Method now forwarded to auth callbacks alongside X-Request-Path, enabling fine-grained (path + method) authorization. Previously, a read-only credential could access destructive endpoints on the same path.
  • envd command env propagation (#566): Fixed create-time environment variables being dropped when starting envd via commands.run.
  • Sandbox preview route (#570): Fixed missing /cube/sandbox/preview route handler that caused cubemastercli tpl render to silently fail.
  • One-click upgrade bundle integrity (#597): Fixed missing scripts/common/ directory in upgrade bundles that broke preflight validation scripts for the upgrade feature (#538).
  • Template creation Cube CA forwarding (#652): Fixed with_cube_ca parameter not being forwarded when creating templates, ensuring clients can control whether the CubeEgress root CA is baked into template rootfs.
  • CubeEgress compute CA refresh (#614): CubeEgress on compute nodes now refreshes its MITM CA from CubeMaster, fixing CA mismatch after master rotation.
  • Template info backfill (#594): Backfilled missing created_at and image_info fields in template detail responses.
  • Snapshot delete job cleanup (#559): Cleaned orphaned job rows after snapshot deletion, preventing stale build job references.
  • Cross-replica node sync (#542): Periodic reload goroutine started for nodemeta cross-replica synchronization, fixing stale node state in multi-replica CubeMaster deployments.
  • BPF inner map BTF key/value (#595): CubeVS inner maps now created with BTF key/value, fixing BPF map compatibility on newer kernels.
  • One-click install root enforce (#649): Configurable install root paths removed; install prefix safety assertion hardened to prevent accidental system directory wipes.
  • One-click MIRROR persistence (#622): MIRROR env now persisted to .one-click.env for consistent image registry selection across restarts.
  • One-click same-CIDR reinstall (#586): Distinguishes same-CIDR reinstall from CIDR change, uses systemd to stop services cleanly.
  • One-click compute quickcheck race (#637): Post-start compute node checks now tolerant of transient startup races.
  • One-click DATABASE_URL persistence (#611): DATABASE_URL now persisted for local MySQL in install and systemd start paths.
  • Python SDK streaming body (377a99d): Request bodies buffered before copying in IPOverrideTransport, fixing multipart upload failures.
  • CubeProxy log cleanup (#593): Dead log fields and faulty-backend stubs removed from CubeProxy.
  • TencentCloud deployer semver helpers (#587): semver_compare and version_lt functions added for correct upgrade decision logic.

📚 Documentation

  • ARM64 deployment guides (64931ff): All deployment guides (quickstart, bare-metal, dev environment, multi-node, self-build) updated in EN + ZH with architecture-specific instructions. ARM64 PVM limitation documented.
  • Network hardening guide (#663): Bilingual operational security guide covering control-plane attack surface, binding strategies, auth callback configuration, and TLS/credential guidance.
  • TencentCloud Terraform deploy guide (ae81121): Full deployment guide (EN + ZH) for the Terraform cluster deployer.
  • Snapshot/clone/rollback deep-dive (#680): Technical deep-dive blog post on snapshot, clone, and rollback mechanisms (EN + ZH).
  • Sandbox logs guide (#692): New cubecli logs usage guide with examples.
  • Multi-node scheduler scoring (#672): Guidance on configuring multi-node scheduler scoring for balanced workload distribution.
  • Host mount permission fixes (#560): Documentation explaining host mount permission handling.
  • v0.4.0 release blog posts (#585): Release announcement and agent-friendly-service posts (EN + ZH).
  • Network deep-dive blog (#627): Technical deep-dive on CubeSandbox networking (EN + ZH).
  • README improvements (#640, #666, #646): Product highlights, v0.4 showcase, benchmark report links, architecture diagram update, homepage tagline refinement.
  • Documentation optimization (#665): Cross-documentation link fixes, sidebar navigation for snapshot-rollback-clone, localized docs flow preservation (#668, #664).
  • Install guide links (#475, #466): Installation guide callouts in benchmark posts, troubleshooting links in install error messages.

⚙️ Engineering Improvements

  • Multi-arch CI (#720): Builder image, VMLinux, and one-click release workflows all support amd64 + arm64 with per-architecture artifacts and multi-arch manifests.
  • Python SDK publish workflow (#700): GitHub Actions workflow for PyPI publishing triggered by python-sdk-v* tags. Version cross-validation (tag vs pyproject.toml vs __init__.py), smart change detection to skip unchanged publishes, and twine check validation.
  • Cubelet reporting default (#722): Reporting interval default changed to 1s for more responsive metrics.
  • CI release workflow hardening (#724): --repo flag added to gh release commands for correct repository targeting.