Checkpoint/Restart or Live Motion #3803

jonathan-3play · 2024-02-28T19:16:46Z

What I'd like:

Live container migration. The ability to checkpoint containers and restart them on a different node, ideally with an ease and confidence rivaling workload migration seen in enterprise IT using virtual machines (e.g. VMware vMotion or Hyper-V Live Migration.

This will perhaps be viewed as outside the remit of a minimized, security-first OS such as Bottlerocket. OTOH Bottlerocket aspires to be the OS, foundation, and infrastructure for containerized workloads and world-leading k8s environments (e.g. EKS). Enterprise computing has long enjoyed workload migration (vMotion released 2003 and known to be used in production at scale by 2006). We'd love to see that in the container/k8s world.

In fact, we need workload migration it in the container/k8s world. While autoscalers (e.g. Karpenter) can eagerly provision more resources when needed, if a workload contains a mixture of short-, medium-, and long-duration jobs (ours most certainly do!), autoscalers are almost guaranteed to "strand" some nodes awaiting completion of the longest running jobs. Without workload migration, there is no way to effectively consolidate the long-running jobs and "compact" the cluster's resources.

Any alternatives you've considered:

Segmenting possibly long-running jobs onto a separate node pool in the hopes of stranding fewer resources. Effortful and home-grown. Difficult to accurate determine every job's likely run duration a priori. Somewhat challenging to link app-based duration signals with infrastructure-level (Karpenter/k8s) scheduling controls. Not clear node segmentation would be efficient/efficacious.
CRIU. Unclear if supported on Bottlerocket, or how well.
DIY checkpoint/restart. Effortful and home-grown. Feels like should be system-supported, as in the VM world.
Lighting a candle that Karpenter over time becomes smarter about recognizing node stranding and using that understanding to better bin-pack jobs, revisit previous do-not-schedule and deprovisioning decisions.
Probably others. None feels compelling.

yeazelm · 2024-02-29T19:56:33Z

Hello @jonathan-3play, Thanks for cutting this well written issue! There has been some work around the ability to checkpoint and restore containers in cri-o and k8s. I don’t believe there is something off the shelf though for what you are describing. This is a pretty interesting feature request and I think there is some compelling use cases for being able to checkpoint/restore a long-running container. This issue will first require a deep dive into the current state of the various tools and what might be needed to deliver this type of functionality. I’d like to use this task to track any findings that might be of interest around checkpoint/restore and CRIU in Bottlerocket.

kannon92 · 2024-03-29T13:04:13Z

kubernetes/enhancements#2008

jonathan-3play added status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request labels Feb 28, 2024

yeazelm added area/kubernetes K8s including EKS, EKS-A, and including VMW and removed status/needs-triage Pending triage or re-evaluation labels Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint/Restart or Live Motion #3803

Checkpoint/Restart or Live Motion #3803

jonathan-3play commented Feb 28, 2024

yeazelm commented Feb 29, 2024

kannon92 commented Mar 29, 2024

Checkpoint/Restart or Live Motion #3803

Checkpoint/Restart or Live Motion #3803

Comments

jonathan-3play commented Feb 28, 2024

yeazelm commented Feb 29, 2024

kannon92 commented Mar 29, 2024