Deploy cubesandbox on k8s #171

dushulin · 2026-05-09T03:43:52Z

dushulin
May 9, 2026

Taking advantage of k8s resource elasticity and resource management capabilities, cube is made easier to use at the resource level, and control plane service management is more elastic and highly available.

aisjca · 2026-05-09T03:49:02Z

aisjca
May 9, 2026

deploy across two Kubernetes clusters.

0 replies

kinwin-ustc · 2026-05-09T04:10:05Z

kinwin-ustc
May 9, 2026
Maintainer

The management cluster part of this architecture diagram is almost the same as the deployment of our actual intranet, and is managed based on the deployment capabilities of k8s. The difference is that we manage the compute cluster part based on an internal OSS system, and it would be a good best practice if someone could run through the K8S management of the compute cluster

0 replies

vwxyzjn · 2026-05-13T04:40:20Z

vwxyzjn
May 13, 2026

Thank you for the reference chart @aisjca. Is there a reference k8s deployment example?

0 replies

jamesxia20181001 · 2026-05-27T03:22:53Z

jamesxia20181001
May 27, 2026

K8s boasts a large user ecosystem. We recommend prioritizing the capability of managing CubeSandbox sandboxes via K8s, which will greatly expand CubeSandbox's ecosystem. The recently open-sourced project https://github.com/agent-substrate/substrate by Google natively supports K8s management, delivering a friendly experience for users within the K8s ecosystem.

0 replies

kinwin-ustc · 2026-05-27T10:15:15Z

kinwin-ustc
May 27, 2026
Maintainer

Thank you for your suggestion, we will provide some guidelines later for convenient cluster deployment using terrorform or k8s configuration methods

0 replies

zyl1121 · 2026-06-09T13:57:37Z

zyl1121
Jun 9, 2026

I have been experimenting with running CubeSandbox in a Kubernetes environment recently, so I wanted to add a few notes from my tests.

Using Kubernetes to manage CubeSandbox makes sense to me. It would let us reuse the existing ecosystem for deployment, scheduling, rollouts, monitoring, service discovery, and general operations.

From what I have seen so far, the control-plane services are probably not the hard part. Components such as cube-api, cubemaster, and cubeproxy are relatively conventional long-running services, so they should fit into Kubernetes reasonably well.

My main concern is on the compute/data-plane side, especially cubelet, CubeVS, and network-agent. These components touch the host networking stack directly: TAP devices, routes, port mappings, and TC/eBPF programs. On a normal Kubernetes worker node, those areas are already managed or affected by the CNI, so the coexistence story is not trivial.

In my cluster, the CNI is Cilium. I saw CubeSandbox’s eBPF programs being loaded, but the datapath did not always become effective. For example, the from_envoy attachment on cube-dev egress was missing or unstable in one case, and sandbox probes / port mappings failed as a result. The same image and template creation flow worked on a clean VM without an existing Kubernetes CNI, so this looks more like an interaction with the node networking stack than an image or template issue.

I also noticed issue #443, which reports a similar networking problem in a Kubernetes environment using Calico. That makes me think this may not be Cilium-specific, but a broader compatibility question between CubeSandbox’s networking components and Kubernetes CNI implementations.

It would be useful to clarify the recommended architecture here: can cubelet, CubeVS, and network-agent run safely on regular Kubernetes worker nodes, or should compute nodes be dedicated / configured with specific CNI assumptions? Some guidance around CNI compatibility, TC/eBPF hook ownership, TAP devices, routing, port mapping, and possibly CRI/runtime assumptions would be very helpful.

0 replies

devincd · 2026-06-17T09:11:32Z

devincd
Jun 17, 2026

@zyl1121 thanks for writing this up in detail, this matches what I'd expect from mixing two independent network-management stacks on the same host.

I think the real issue isn't "CNI vs no CNI" but ownership of the host network stack. cubelet/CubeVS/network-agent aren't just using the network like a normal workload, they're doing the same job a CNI does (TAP/veth setup, routing, TC/eBPF programs). Cilium (and apparently Calico, per #443) also assumes it owns the host endpoint and reconciles TC hooks it doesn't recognize — that reconciliation loop is a plausible explanation for why from_envoy shows up as "missing or unstable" rather than just absent: two controllers are periodically fighting over the same hook instead of failing once and staying failed.

Given that, I'd suggest the recommended architecture is to not run a general-purpose CNI on dedicated compute nodes at all, and let network-agent be the only thing managing TAP devices, routes, and TC/eBPF on those nodes. This only works cleanly if compute nodes are a separate pool that doesn't also need to schedule regular pod-networked workloads (which seems consistent with how the management cluster is described above — control-plane services are conventional, compute nodes are the special case). One thing to watch: kube-proxy (or Cilium's kube-proxy-replacement) also writes to the host's iptables/eBPF for Service routing, so it would need to be excluded from these nodes too, otherwise the conflict surface just moves from CNI to kube-proxy.

On how to actually run these three components: kubelet + static pods would work mechanically — static pods bypass the scheduler entirely (kubelet reads them straight from the local manifest directory), so they'll start even if the node is NotReady due to no CNI being configured, and with hostNetwork: true the CRI never even invokes a CNI plugin for them. But I'd push back gently on static pods as the long-term answer: they solve "can this start without API server/scheduler involvement," which doesn't seem to be the actual constraint here (there's no chicken-and-egg bootstrap problem like there is for etcd/apiserver). They also opt out of the rollout/monitoring/service-discovery benefits that were the original motivation for moving to k8s in the first place. A DaemonSet with hostNetwork: true and explicit tolerations for the network-unavailable taint gets you the same "always running, restarts on crash, no CNI dependency" properties while staying inside the normal k8s deployment/rollout/observability path — this is effectively how kube-proxy and most CNI agents are deployed today, and I don't see a reason cubelet/CubeVS/network-agent would need to deviate from that pattern.

So in short: dedicate compute nodes, drop the general CNI (and kube-proxy) there, let network-agent be the de facto CNI for that pool, and manage the three components as a DaemonSet rather than static pods unless there's a bootstrap-ordering reason I'm missing.

0 replies

zyl1121 · 2026-06-17T09:45:15Z

zyl1121
Jun 17, 2026

@devincd

Thanks, this makes sense to me. I agree that framing this as ownership of the host networking stack is more accurate than just “CNI vs no CNI”. cubelet, CubeVS, and network-agent are not normal workloads from a networking perspective, and having them compete with Cilium or Calico over routes, TAP devices, and TC/eBPF hooks is likely to be fragile.

Given that, I'd suggest the recommended architecture is to not run a general-purpose CNI on dedicated compute nodes at all, and let network-agent be the only thing managing TAP devices, routes, and TC/eBPF on those nodes.

My concern is that removing the general CNI and kube-proxy from compute nodes solves only the local datapath conflict, not the full architecture. cubelet still needs to report node state to cubemaster, and template creation/distribution also requires reliable communication between the control plane and compute nodes. If the control plane is managed by Kubernetes but the compute nodes cannot use normal Kubernetes Service paths such as ClusterIP, then we need another host-network-reachable path for cubemaster, such as an external LB/VIP, hostNetwork endpoint, or separate routing/service-discovery setup.

That may still be a valid direction, but it makes the compute pool much more special. The design would need to clearly document which components are managed by Kubernetes, which nodes run CNI/kube-proxy, how compute nodes reach the control plane, how failover/LB is handled, and what networking assumptions are required on the compute side.

So I think the “no CNI on compute nodes” model is worth considering, but it should be treated as part of a larger architecture rather than just a deployment detail. Ideally we can either find a safe coexistence model with common Kubernetes networking stacks, or document a dedicated-compute-pool model with clear control-plane connectivity, failover, and operational assumptions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploy cubesandbox on k8s #171

Uh oh!

{{title}}

Uh oh!

Replies: 8 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Deploy cubesandbox on k8s #171

Uh oh!

dushulin May 9, 2026

Replies: 8 comments

Uh oh!

Uh oh!

aisjca May 9, 2026

Uh oh!

kinwin-ustc May 9, 2026 Maintainer

Uh oh!

vwxyzjn May 13, 2026

Uh oh!

jamesxia20181001 May 27, 2026

Uh oh!

kinwin-ustc May 27, 2026 Maintainer

Uh oh!

zyl1121 Jun 9, 2026

Uh oh!

devincd Jun 17, 2026

Uh oh!

zyl1121 Jun 17, 2026

dushulin
May 9, 2026

aisjca
May 9, 2026

kinwin-ustc
May 9, 2026
Maintainer

vwxyzjn
May 13, 2026

jamesxia20181001
May 27, 2026

kinwin-ustc
May 27, 2026
Maintainer

zyl1121
Jun 9, 2026

devincd
Jun 17, 2026

zyl1121
Jun 17, 2026