Skip to content

[KEP-5359] Pod-Level Swap Control #5360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

iholder101
Copy link
Contributor

@iholder101 iholder101 commented Jun 1, 2025

  • One-line PR description: Introduces pod-Level swap control.

Other comments:

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 1, 2025
@iholder101 iholder101 marked this pull request as draft June 1, 2025 14:26
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: iholder101
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 1, 2025
@k8s-ci-robot k8s-ci-robot requested review from dchen1107 and mrunalp June 1, 2025 14:26
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 1, 2025
@iholder101 iholder101 mentioned this pull request Jun 1, 2025
4 tasks
@iholder101 iholder101 force-pushed the kep/1st-pod-level-swap-control branch from 839434c to 852de62 Compare June 3, 2025 08:34
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 3, 2025
@iholder101 iholder101 force-pushed the kep/1st-pod-level-swap-control branch 3 times, most recently from b36dbc4 to 860eb09 Compare June 3, 2025 08:38
@iholder101 iholder101 marked this pull request as ready for review June 3, 2025 08:39
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 3, 2025
@iholder101 iholder101 force-pushed the kep/1st-pod-level-swap-control branch 2 times, most recently from 820c2fb to b48a955 Compare June 3, 2025 13:50
know that this has succeeded?
-->

* Introduce a new `swapPolicy` field in PodSpec to allow users to configure swap for an individual pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-pod or per-container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! That's a good question.

Perhaps we can start with a pod-level knob, then add a container-level one if we need to?
Do you have an opinion on this @ajaysundark?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather NOT do both, but so far the trend seems to be whichever we start with, we end up want ing the other, and then we have to define the intersection of them. So we should think about it up front.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the swap "policy" at pod-level should be sufficient. The 'swap' decisions are more workload centric. I think applications should be considered as a 'single atomic unit' for swap -

  1. If the main application container is latency-sensitive, its critical sidecars (maybe a service mesh proxy) are also latency-sensitive because they are in the request path. They should all be governed by same swap policy. It is hard to envision a scenario for this choice to differ between tightly-coupled processes within the same Pod.
  2. A pod-level policy provides a clear attribute for future scheduling enhancements like Node-capabilities. For example, workloads specifying swapPolicy as 'Avoid' or 'Required', would allow the scheduler to place them onto nodes (NoSwap or LimitedSwap) that can satisfy that capability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, Pod-level sounds like quite enough.

Copy link
Member

@thockin thockin Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the main application container is latency-sensitive, its critical sidecars (maybe a service mesh proxy) are also latency-sensitive

Ehh, that's not necessarily true. People use sidecars for lots of things, not all of them are critical or LS. Disk cleanup, log-sending, etc are all non-critical.

I'm fine to make it per-pod, but I'll bet $5 that a really good use-case for per-container control emerges within 12 months of GA :)

So WHEN that happens - what does per-container control mean when per-pod control is already defined?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, not all sidecars are latency-sensitive. The goal for this KEP was centered on finding the simplest possible model that solves the immediate, well-understood use cases for protecting a latency-sensitive workload.

This KEP provides workload autonomy on swap, a necessary 'opt-out' policy when required for application owners. 'swapPolicy' could provide the required scheduling attribute for matching swap capability in the future, which is a pod level decision.

If a future KEP introduced a per-container swap control, it would likely need to be a resource model design (as the container is already at the swap enabled node).

It would probably look like:

  1. The pod-level swapPolicy select the environment (e.g., a new mode AssignedLimits).
  2. A new, per-container field (e.g., resources.limits.swap) could then be introduced to set a specific ceiling on swap usage within the environment established by the pod.

When requests for container level resource control emerge, we need to evaluate the use-cases further.

// SwapPolicy defines the desired swap memory policy for this pod. This field
// is immutable after the pod has been created.
//
// If unspecified, the default value is "NoPreference", which means the pod's swap behavior is determined by the node's swap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a literal "NoPreference" value or just leave it unset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean using something like the following?

// +kubebuilder:validation:Enum=NoPreference;Disabled
// +kubebuilder:default=NoPreference

I think it's not a bad idea!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking, not telling. We often hear people complain about the sheer volume of YAML we produce, so leaving it "" and omitempty has some attractiveness. OTOH, we also generally prefer to be explicit, so setting a default value has some attractiveness.

The real value of an explicit default is that if we version the API we can chage the default. I'm not sure that applies here.

We can leave this open for API review, but let's PLEASE ask the question at that time. In the meantime we can get some opinions.

And for the record, we don't use kubebuilder tags, we just say // +default="NoPreference" or // +default=ref(SwapPolicyNoPreference) (but I think I have come around to the simpler form)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of an enum but I think the default could change over time.

For some of our openshift APIs, we push defaulting into the controller rather than the API so its easier to control over time.

For alpha, I'd prefer "disabled" to be on the safer side. I don't know if we are in a place to encourage swap everywhere in Kubernetes ecosystem so I would think default to disable swap and users that want swap can enable it on their podSpecs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some of our openshift APIs, we push defaulting into the controller rather than the API so its easier to control over time.

Kube treats defaults as intrinsic to the GVK. Pushing it into the controller is not "defaulting" it's "setting" :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it all boils down to the question of a default behavior. If we want to support opt-in, #2400 should be extended to enable "preferSwap" or "oktoswap" option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to support opt-in, #2400 should be extended to enable "preferSwap" or "oktoswap" option.

I thought that the current state of #2400 was predicated on swap being a thing that would start as opaque to pods, would have to be opted into at the node level, and only clusters / nodes that know enough about their workloads would choose to enable that node-wide setting. If that is still the case, I don't see a reason to take away the ability for cluster admins to opt nodes into swap for all pods just via node config. That capability seems ready.

If there is evidence that there are clusters who cannot turn on node-wide swap by default via #2400 because they don't know enough about their workloads, then adding a node option and Pod API surface to opt into swap could be reasonable, but that seems feasible to evolve node config and pod API as part of this KEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the current state of #2400 was predicated on swap being a thing that would start as opaque to pods, would have to be opted into at the node level, and only clusters / nodes that know enough about their workloads would choose to enable that node-wide setting.

I do not know how we can limit the who will be using swap. I would expect many users will enable it as a "cheap" expansion of memory on the node.

We already have feedback that in #2400 customers want:

  • disable swap per-pod
  • enable swap on guaranteed pod (currently we have a default behavior to exclude guaranteed pods from swap)

Because of this feedback we created this KEP (#5359) and the API discussion led to this question about the default Pod behavior.

@liggitt given the need to disable swap per Pod (even when node has swap enabled), and the proposal of API in this KEP, do you think it is OK to proceed with #2400 as-is or we need to change the default in #2400 to simplify the "disable" behavior implementation?

Copy link
Member

@liggitt liggitt Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt given the need to disable swap per Pod (even when node has swap enabled), and the proposal of API in this KEP, do you think it is OK to proceed with #2400 as-is or we need to change the default in #2400 to simplify the "disable" behavior implementation?

The node default is NoSwap, that is a reasonable default.

#2400 allows admins to configure nodes to automatically apply swap in limited ways to as many pods as they can, and that seems like an ~ok lever to me, especially if we're clear that the current LimitedSwap node setting will exclude pods that explicitly opt out of swap if that becomes possible in the future. If admins try to use the LimitedSwap lever to get "free memory" and their workloads become unhappy, they might have to stop using that setting and wait for pod-level opt-in to make use of swap.

Requiring a double opt-in (node-level enablement AND modifying every single pod to opt-in) is just going to lead to automated opt-in "solutions" at the pod level that do things like mutate pods as they are created in order to opt them into swap.

I think graduating #2400 as-is is ~ok, not everyone will be able to make use of it yet, and we have work to do in this KEP to give the pod-level controls, scheduling visibility, and probably a third node-level option for OptInLimitedSwap or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requiring a double opt-in (node-level enablement AND modifying every single pod to opt-in) is just going to lead to automated opt-in "solutions" at the pod level that do things like mutate pods as they are created in order to opt them into swap.

I think enabling double opt-in may allow even better experience where swap is enabled more broadly as admins will rely on pod owners to check Pod is OK to swap from reliability and security perspective. And automated opt-in solutions will likely always respect explicit disable. So this argument can go both ways.

#2400 allows admins to configure nodes to automatically apply swap in limited ways to as many pods as they can, and that seems like an ~ok lever to me, especially if we're clear that the current LimitedSwap node setting will exclude pods that explicitly opt out of swap if that becomes possible in the future. If admins try to use the LimitedSwap lever to get "free memory" and their workloads become unhappy, they might have to stop using that setting and wait for pod-level opt-in to make use of swap.

The problem is that this now requires interaction between cluster admins and pod owners and planning on where each pod needs to run based on swap perferences. So we are creating another tension point and uncertain behavior of Pods.

and we have work to do in this KEP to give the pod-level controls, scheduling visibility, and probably a third node-level option for OptInLimitedSwap or something.

One of the ask for #2400 and this is a typical bar for many KEPs is to explain the API evolution. Since the only piece of API evolution is this KEP, I do not like the idea of grduating the #2400 as-is. I do not think I am being inconsistent here. If we will have a path forward explained, it will be easier to say that in a couple of releases people will have all the levers we are thinking to provide to them. With just an idea that we may have this API in future, I do not feel comfortable to say this.

@iholder101 iholder101 force-pushed the kep/1st-pod-level-swap-control branch from b48a955 to d04c1f0 Compare June 9, 2025 11:14
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
@iholder101 iholder101 force-pushed the kep/1st-pod-level-swap-control branch from d04c1f0 to fe25dad Compare June 9, 2025 11:15
@k8s-ci-robot
Copy link
Contributor

@iholder101: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-test fe25dad link true /test pull-enhancements-test
pull-enhancements-verify fe25dad link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


## Summary

<!--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you fill out sections, can you please remove the comments?

In the first phase, we'll focus on the ability to disable swap for all its containers irrespective of underlying node
swap behavior.

### Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted this comment: #5387 (comment) It may speed up the introduction of per-Pod enablement and will shape this KEP better.


const (
// SwapPolicyModeDisabled explicitly disables swap for the pod.
SwapPolicyModeDisabled SwapPolicyMode = "Disabled"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ @liggitt had suggestions to rename Disabled to NoSwap to align with Node config

In any case, I'll keep this here for reference.

One potential concern with the `Disabled` mode is that it forces the API to allow pod owners to disable swap, regardless
of the `swapBehavior` setting configured by the admin at the kubelet level.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enforcing field restrictions can still be possible with higher-level constructs. eg: validation-webhook or policy-agents such as OPA or kyverno

// SwapPolicyModeDisabled explicitly disables swap for the pod.
SwapPolicyModeDisabled SwapPolicyMode = "Disabled"
// SwapPolicyModeNoPreference states the pod should follow node-level swap configuration.
SwapPolicyModeNoPreference SwapPolicyMode = "NoPreference"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a "No preference" mode, should we also have a status field which indicates the result (and leaves room for likely extensions) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a hard time following how much detail about swap we want surfaced via the API ... is it supposed to be a hidden implementation detail or not?

Also, it seems ... weird for a pod to indicate "I don't care whether I use swap or not" and then want to know whether it used swap or not...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a hard time following how much detail about swap we want surfaced via the API

As little as we can get away with, but it seems like something that people might want to know, for example, when their app crashes - was it running with swap?

* Maintain backward compatibility: existing pods that run on swap-enabled nodes should behave as they do by default.
* Provide a mechanism to alleviate concerns regarding "all-or-nothing" nature of node-level swap enablement, potentially
unblocking KEP-2400’s path to GA.
* Open the door to future enhancements, both at the kubelet-level swap behaviors and the new pod-level swap control API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this covers the transition from 2400 to the ability to disable swap. Why we are not covering how we will transition to specifying the limit explicitly? We need at least an idea how pods with the limits specified will work with kubelet with limited swap. Or if they will be completely incompatible, than we need to explicitly say that LimitedSwap is a limited time feature and will be fully replaced with something else

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pod level swap can be another swap type for the pod. Pod API always wins, if kubelet is limited swap then we also add swap if the pod said "NoPreference", or if it's disabled in kubelet and NoPreference in api then no swap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: if node doesn't have swap enabled, then pod API swap pods fail at admission (maybe with NodeCapabilities feature to help scheduling hints)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested something here: https://docs.google.com/document/d/19muCmfBndxfr-yq1EmeI54yq-upYmZsWLsLvcm1i6J4/edit?disco=AAABljwWs3w In general it feels we may have a path forward

@dchen1107
Copy link
Member

dchen1107 commented Jun 17, 2025

We discussed this topic today at SIG Node meeting. I'd like to reiterate my position and rationale raised at the meeting earlier today, building on the points we've discussed.

I strongly prefer to proceed with graduating Node Swap (KEP-2400) to GA in 1.34 as planned, without incorporating per-pod enable/disable APIs proposed in this KEP at this stage. I believe this phased approach offers the most robust and beneficial path forward for Kubernetes users:

  • Node-Level Swap Delivers Foundational Value Now:

Graduating KEP-2400 to GA in 1.34 provides immediate and tangible benefits. It enables a fundamental operating system feature (swap) for Kubernetes nodes, which is crucial for improving node stability, preventing abrupt OOM kills for burstable workloads, and enhancing overall resource utilization. Many existing applications expect swap, and this allows them to run more reliably on Kubernetes. Getting this core feature to GA unblocks a significant capability for our users and allows us to focus on future enhancements.

  • Per-Pod Enable/Disable APIs Offer Limited and Potentially Misleading Value (KEP-5359):

While the intent behind per-pod controls is valid, a simple "enable/disable" API has significant limitations and could provide a "false sense of security." As we discussed, even if a pod explicitly disables swap, it can still be negatively impacted by "noisy neighbors" on the same node that are heavily utilizing swap. The performance degradation caused by I/O contention from swapping affects the entire node, regardless of individual pod settings. Furthermore, tying swap calculation to memoryRequest rather than actual usage patterns also limits its effectiveness, particularly for burstable workloads.

  • Swap-Aware Scheduling is the Strategic Long-Term Solution:

To truly address the concerns around latency-sensitive workloads and "noisy neighbors" in a swap-enabled environment, the most effective solution is swap-aware scheduling. This allows the scheduler to intelligently place pods based on node capabilities and a pod's sensitivity or tolerance to swap activity. This provides genuine isolation and optimized resource allocation, going far beyond what a simple per-pod enable/disable API can offer. This should be our next major focus after node-level GA.

  • Per-Pod Swap Limits are a Subsequent Refinement:

More granular per-pod swap limits would be a valuable feature for fine-tuning specific workload behaviors, but this should be pursued as a separate KEP. It represents a more complex implementation detail that would build upon the foundation of node-level swap and be most effective when coupled with swap-aware scheduling.

My proposed priority, therefore, is:

  1. Graduating NodeSwap (KEP-2400) to GA in 1.34. @iholder101 @ajaysundark @SergeyKanzhelev @haircommander @kannon92 The feature should be disabled by default and the blocker issues are included and discussed in https://docs.google.com/document/d/14OLKSPl3BDzw2BBacdN0kHpDt-JN8AOTSei_XExh-2s/edit?tab=t.0
  2. Developing Swap-aware scheduling. @pravk03 @ajaysundark @yujuhong @tallclair
  3. Introducing per-pod swap limits. @ajaysundark @iholder101

This phased approach ensures we deliver immediate value with a robust node-level feature, then systematically tackle the more complex challenges of intelligent workload placement and granular controls. Attempting to bundle a limited per-pod API now would, in my view, delay essential GA, introduce a suboptimal design, and not fully resolve the underlying performance concerns. This is just my 2 cents, and happy to be convinced otherwise.

@SergeyKanzhelev
Copy link
Member

I want to reiterate my position from the meeting. I do not think the comment above resolves the issue at hand.
The problem I am trying to address is that the coherent Memory Swap API in long term requires a minimal change for this KEP to be implemented.

As we discussed, we need to find a way forward from Node-level enablement to per-Pod swap aware scheduling to confirm that this KEP is not making it harder to make next steps. The need for a plan resulted in two artifacts, which were :

Reviewing the first artifact, we got a clear feedback that the best default Pod behavior is NoSwap, even on the nodes that has swap enabled. The proposed change was to introduce the swap mode picker for the Pod in the 2400, enabling other modes in future and making transition to swap safer.

The proposed change to this KEP will improve it in many ways:

  • No more different behavior for guaranteed pods and ability to enable swap for them.
  • Addressing security concern of users by allowing mass-disable swap for sensitive Pods if needed (without changing the bin-packing strategy).
  • Simplifying transition to per-Pod swap aware scheduling, which will be a different mode easily distinguishable from LimitedSwap (since LimitedSwap will be an explicit setting).
  • Making it easier for people to disable swap quickly without recreating Nodes if they have performance concerns.

Given that this addition to the KEP is small and it makes implementing of next KEPs easier and API more consistent, I do recommend to include this addition to this KEP.

If there is an alternative proposal on how long term API will look like, let's put it for discussion. My understanding was that we have many agreements in this thread on the API future and no alternatives.

@liggitt
Copy link
Member

liggitt commented Jun 18, 2025

we got a clear feedback that the best default Pod behavior is NoSwap, even on the nodes that has swap enabled. The proposed change was to introduce the swap mode picker for the Pod in the 2400, enabling other modes in future and making transition to swap safer.

Pods can't currently express opinions about swap, it's currently an opaque and node-wide setting, which node admins should only enable if they understand their workloads well enough to know that is safe. Some clusters know enough about their workloads to enable node-wide as part of #2400 now. I don't think graduating that work makes later per-pod controls harder.

I can see why some clusters wouldn't know enough about their workloads to do that, and allowing for gradual per-pod opt-in likely will require a new middle option on the node, but I have a hard time seeing why that's part of #2400 ... it seems like work to do as part of this KEP.

@dchen1107's point about swap implications not being well-contained is a good one. Once we open the can of worms of per-pod opinions around swap/noswap, aren't there a bunch of scheduling / requirement levels that are possible? e.g.:

  1. "I prefer to run on a node with swap enabled"
  2. "I require running on a node with swap enabled"
  3. "I prefer to run without swap enabled for myself"
  4. "I prefer to run on a node that doesn't enable swap for any pod"
  5. "I require running without swap enabled for myself; don't schedule me if that cannot be honored"
  6. "I require running on a node that doesn't enable swap for any pod; don't schedule me if that cannot be honored"

The trickiest parts of per-pod / opt-in swap would seem to be around surfacing node state and scheduling for those cross-pod bits. Saying if you want opt-in swap you need to configure your nodes to OptInLimitedSwap or some new node-level setting seems ok to me, so that pods that don't have an opinion get their node default behavior, which is no swap. I see this KEP as unlocking the possibility of swap for clusters / admins who today are stuck with NoSwap because they don't know enough about their workloads. The clusters / admins who are ok with enabling LimitedSwap today are not really in view.


Hoisting my reply from #5360 (comment) as well for visibility:

@liggitt given the need to disable swap per Pod (even when node has swap enabled), and the proposal of API in this KEP, do you think it is OK to proceed with #2400 as-is or we need to change the default in #2400 to simplify the "disable" behavior implementation?

The node default is NoSwap, that is a reasonable default.

#2400 allows admins to configure nodes to automatically apply swap in limited ways to as many pods as they can, and that seems like an ~ok lever to me, especially if we're clear that the current LimitedSwap node setting will exclude pods that explicitly opt out of swap if that becomes possible in the future. If admins try to use the LimitedSwap lever to get "free memory" and their workloads become unhappy, they might have to stop using that setting and wait for pod-level opt-in to make use of swap.

Requiring a double opt-in (node-level enablement AND modifying every single pod to opt-in) is just going to lead to automated opt-in "solutions" at the pod level that do things like mutate pods as they are created in order to opt them into swap.

I think graduating #2400 as-is is ~ok, not everyone will be able to make use of it yet, and we have work to do in this KEP to give the pod-level controls, scheduling visibility, and probably a third node-level option for OptInLimitedSwap or something.

@thockin
Copy link
Member

thockin commented Jun 18, 2025

My primary concern is that the norm is "safe". My fear is that we do this early work to enable swap at all and then we never follow up with "safe" (or we take 3 years at which point everyone will have started using "unsafe").

I accept the "admins are making the decision" argument, but we're giving them only 2 options: "bad" and "worse".

It's no secret that I am pretty anti-swap for most use-cases, so I feel the default should be NOT to use swap unless I specifically say otherwise on a workload level.

We're going to end up with something like:

per-node per-pod result
1 (not specified) (not specified) no swap
2 enabled (not specified) enabled
3 enabled disabled disabled (but still impacted)
4 opt-in (not specified) disabled (but still impacted)
5 opt-in enabled enabled

Or more if we find way to modulate it further. That's a lot for people to consume.

Are we (sig-node) COMMITTED to having "opt-in" (aka "let the pod decide") as the next per-node kubelet config for swap? Is there a KEP for that? Why not?

How long will we be sitting in the "unsafe by default" state (row 2)?

Do we have a plan that gets non-swapped pods out of "being impacted by swap pods" mode? Or is it always going to be something that pods need to be aware of?

@iholder101
Copy link
Contributor Author

Thanks everyone for this discussion.

As @dchen1107, @liggitt and other suggested, I do think we should GA 2400 as-is and continue the API discussion in parallel.
Here are a few small replies to try and support this argument.


@SergeyKanzhelev

I do not know how we can limit the who will be using swap. I would expect many users will enable it as a "cheap" expansion of memory on the node.

I honestly don't think this is a strong argument.

AFAICT we're here to give admins options which they can choose to use or not. We turn swap off by default, and we'll provide thorough documentation about the pros cons and risks, letting admins decide what's best for their environment. As @liggitt said, some admins wouldn't want to take the risk without extra safty mechanisms that will come up in the future - and that's completely fine - but I don't understand why this should block other admins who do want to use it as-is in their production setups.

Because of this feedback we created this KEP (#5359) and the API discussion led to this question about the default Pod behavior.

I know there had been some misunderstanding, but as I and others understood it, this original idea was to create a KEP to discuss APIs while we continue with 2400 GA. I still think we shouldn't tie the two.

Being honest, the most feedback I've received is, by far, that people from various companies are using this in production for a long while, are very happy and desperate for a GA.

I think enabling double opt-in may allow even better experience where swap is enabled more broadly as admins will rely on pod owners to check Pod is OK to swap from reliability and security perspective. And automated opt-in solutions will likely always respect explicit disable.

I tend to disagree that immediate broad adoption is very important.
Quite the contrary - I see this as a feature, not a bug. This way, only whoever's willing to take some risks is going to use it and provide feedback, from which we can learn and use to design APIs more carefully.

In addition, demanding to have a double opt-in + a webhook/MPA is both a bad UX and would limit the API design moving forward. From my POV, I'm not sure at all that the right API is a simple "disable" switch.


@thockin

I accept the "admins are making the decision" argument, but we're giving them only 2 options: "bad" and "worse".

I'm not sure I understand what these two options are.
I'd phrase this differently: we give them two options - "do" or "don't" ("don't" by default).

And, as time goes by, we'll do our best to provide more mechanisms that will ease the adoption of this.

It's no secret that I am pretty anti-swap for most use-cases, so I feel the default should be NOT to use swap unless I specifically say otherwise on a workload level.

As an admin, you'd have swap off by default and probably wouldn't opt-in to it.
I'm not sure I understand why a double opt-in helps here.

We're going to end up with something like:
...
Or more if we find way to modulate it further. That's a lot for people to consume.

I agree that the suggested API here is far from being perfect.
For this reason, I don't think we should rush the "enable/disable" field and feel like we can do better. Mainly by letting people adopt this in production and provide feedback.

Are we (sig-node) COMMITTED to having "opt-in" (aka "let the pod decide") as the next per-node kubelet config for swap?

As I see it, we should introduce more swapBehaviors in the future. I'm not sure how they will look like exactly, but I don't want to commit to the fact that any pod is able to always dictate to turn off swap despite the admin's decision. I believe that the best path forward is probably for workloads to hint the admin about their desires, but for the admin to take the final call about this. After all, swap is first and foremost a node-level configuration (that most workloads don't care about at all).

My fear is that we do this early work to enable swap at all and then we never follow up with "safe"

FWIW, I commit to invest a huge amount of effort on the swap-related follow-up KEPs, while leading them and doing most of the work needed. I've been working on this feature for about 2 years at this point, and I am willing to continue doing so.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
if you're still planning to merge this for 1.34 please make sure to fill in the required PRR documents and sign me up for PRR review

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 18, 2025
@tallclair
Copy link
Member

I tried to capture my concerns with LimitedSwap in https://docs.google.com/document/d/16Z8CfYg3JKOzV2uC0MwC5dbthAjBhJjZkAUOu4Jhtyw/edit?tab=t.0#heading=h.s2siryc5tp3l.

I agree with @thockin's comment above that we're probably going to end up with a new node config option to enable per-pod opt-in. If we aren't fully happy with the swap calculation, and intend to add a new node-config option anyway, I'm worried that we're rushing this LimitedSwap option to GA that we're not really happy with and don't expect people to use long-term.

@iholder101
Copy link
Contributor Author

iholder101 commented Jun 18, 2025

I tried to capture my concerns with LimitedSwap in docs.google.com/document/d/16Z8CfYg3JKOzV2uC0MwC5dbthAjBhJjZkAUOu4Jhtyw/edit?tab=t.0#heading=h.s2siryc5tp3l.

I agree with @thockin's comment above that we're probably going to end up with a new node config option to enable per-pod opt-in. If we aren't fully happy with the swap calculation, and intend to add a new node-config option anyway, I'm worried that we're rushing this LimitedSwap option to GA that we're not really happy with and don't expect people to use long-term.

Thank you @tallclair.

As written in the document, the fact that the swap feature is not complete is known and is written very clearly in the KEP's scope and intentions.

What I expect from this document to include is: why these problems mean that swap cannot be GAed, then we can enhance it later? Why block (many!) admins that are using it in production and are desperate to GA it?

And, what's the justification to go against what the community already merged and is encoded clearly in the KEP?

Let me mention again what is written in the KEP:

This KEP aims to introduce basic swap enablement and leave further extensions to follow-up KEPs. This way Kubernetes users / vendors would be able to use swap in a basic manner quickly while extensions would be brought to discussion in dedicated KEPs that would progress in the meantime.

For example, to achieve this goal, this KEP does not introduce any APIs that allow customizing how the feature behaves

@haircommander
Copy link
Contributor

I agree with @dchen1107 and @iholder101 , I'm anxious about combining the two features and rushing on a pod API, which has more edge cases and particularities than I want to introduce to a beta KEP. I think swap as-is is useful, and not perfect. GA'ing 2400 and continuing to iterate in 5359 is my preference

@mrunalp
Copy link
Contributor

mrunalp commented Jun 18, 2025

I am inclined towards @dchen1107 position on this. We can consider opening a follow-on KEP to this one for pod opt-out alongside fleshing out the pod-level swap KEP as @thockin mentioned in his comment above.

@kannon92
Copy link
Contributor

I think we should mention that there is such much inertia on swap being off for Kubernetes clusters.

To "accidental" enable swap, one needs to create a node, provision swap, turn off kubelet gate fail-swap-on=false, enable kubelet config for LimitedSwap, and then pods that are burstable will be swap.

AFAIK most vendors/organizations have probably default to swap not being enabled on a node. Admins that want to enable swap have a few steps to do before swap is on their clusters.

@yujuhong
Copy link
Contributor

+1 for having more time to flesh out the pod-level swap design and not rush it with the node-level swap feature.

As for whether the current node-level swap is a useful standalone feature that's worth graduating to GA, AFAIK, there are users using the feature in production since it was alpha. Maybe others could chime in more on this. For reliability and observability concerns, there have been a few rounds of discussions and I believe @ajaysundark also looked into this in-depth did not raise extra concerns (@ajaysundark correct me if I'm wrong). With pod-level swap design still in its early phase, I'm inclined to agree that we should let the node-level swap proceed to GA to benefit the users immediately.

@ajaysundark
Copy link

ajaysundark commented Jun 18, 2025

Given that many aspects we discussed around API, I think folding a minimal KEP for API into existing design feels like a decision we may be rushing in making.

default should be NOT to use swap unless I specifically say otherwise on a workload level.

@thockin I think most folks agree that 'disabled' is the ideal default swap-choice per pod.

I added it elsewhere, but want to capture here for data-points as well. Many production grade applications commonly deployed in k8s want to dictate on swap, with preference to keep them disabled.
Some examples: Elastic Search, Kafka, MySql, Redis, Postgres, Cassandra

Kubernetes default choices should align with this interests.

  1. NoSwap is the current default in node.
  2. When pod-level API comes into fruition it could preferably be disabled by default too.

By making swap 'disabled' by default, we accept the impact on these swap enabled pods that need to add a one-line yaml change for them to use swap again.

I think clear scheduling isolation to some extent will reduce this migration concern. We could introduce minimal improvements like capturing the swap-mode as a label, so exclusivity can be achieved in some form (node selection or affinity).

aren't there a bunch of scheduling / requirement levels that are possible?

@liggitt I feel like capability awareness for swap (comment) will help with different long-term qualitative scheduling needs to achieve different affinity levels.

@SergeyKanzhelev
Copy link
Member

I still believe that the GA-ing #2400 will be a rash decision.

The ask to understand the long term Swap story and transition from LimitedSwap to swap-aware was made long ago as well as the ask to allow customers to opt out per-Pod. We worked hard on reviews last minute and after the back-and-forth came up with some minimal changes that might make #2400 better and safer.

Now, based on limited customers experience, we want to GA it, justifying it by users who use it in production. Suggesting that other user feedback will be managed by the proper warnings in release notes and users setting taints and tolerations. We also suggest that we will come up with the better story later without an articulated path forward. As pointed above, the path forward may be involving deprecating the LimitedSwap.

It feels to me the LimitedSwap is seen as a temporary feature that is pushed to enable the feature faster and collect more user feedback. If this is the positioning of the feature, than it is no better no worse than being in beta.

@thockin
Copy link
Member

thockin commented Jun 19, 2025

I'm not sure I understand why a double opt-in helps here.

Opting in per-node is very coarse and it requires applications to be aware of swap, even if they don't want swap. It's a silent hazard that cluster admin is putting in place for apps.

I don't want to commit to the fact that any pod is able to always dictate to turn off swap despite the admin's decision

This is the "bad" vs. "worse". The only options are "no swap at all" (bad) or "swap for everyone, unless they take action to avoid it" (worse).

There's always going to be tension between what apps want and what admins want. Most important, in my opinion, is not to break things that already work.

When I see things like "as time goes by, we'll do our best to provide more mechanisms" what I really hear is "we think this is good enough and maybe we'll revisit at some unknown future date". I don't mean that anyone is intentionally lying, I mean that we don't seem to be on the same page WRT risk.

I don't think was part of the review for the ain KEP, but scanning it now I see that taints were considered and then made "advisory". :(

I get that there are people using it as-is. I also know that once it goes GA there will be MORE people using it as-is. You can document the risks in big bold letters, but it will still be an attractive nuisance. People will get hurt, and it will be our fault (partly, anyway). I haven't heard ANY specific plans for mitigating the risks, which makes me very uncomfortable.

All that said, the preponderance of SIG Node leaders' and stakeholders' opinions seem to say that the value outweighs the risk, so I am going to back down. What I really want is a commitment that the end-goal is a) safe-by-default for pods which do not want swap; and b) coming within a couple releases. If we are staring at 1.38 and don't have this pinned down, I will be a very grumpy person.

I'd like to see the public docs for this include a STRONG recommendation that swap-enabled nodes be tainted to prevent accidentally swapping pods that are not ready for it.

@dchen1107
Copy link
Member

Thank you all for this incredibly comprehensive and passionate discussion over the past 24 hours. I've read through every comment carefully from both sides. As the senior tech lead, I need to make a call on the path forward that balances immediate user value, long-term API coherence, and critical safety considerations.

My decision is to proceed with graduating NodeSwap (KEP-2400) to GA in Kubernetes 1.34 as-is, without incorporating the proposed per-pod enable/disable API from KEP-5359 into KEP-2400 at this time. This decision for unblocking is contingent upon successful completion of all GA requirements, including robust testing, comprehensive documentation, and no new blockers emerging during the implementation phase.

To explain how I've arrived at this conclusion, I asked myself the following four key questions, integrating the points raised in this KEP from the community:

Question 1: Is NodeSwap (KEP-2400) as-is useful and ready for GA, given its current scope?
Conclusion: Yes.

  • Immediate Value for Informed Admins: As @iholder101, @yujuhong, and I have noted, there are existing users in production, running large stateful workloads and CI/CD pipelines, who are eagerly awaiting this feature to GA. It significantly improves node stability and resource utilization for burstable workloads by providing a controlled way to leverage swap.

  • Urgency for AI/ML Workloads: This immediate value is becoming even more critical with the increasing prevalence of memory-intensive AI/ML workloads. These jobs frequently push RAM limits, and controlled swap enablement offers crucial stability, preventing OOM kills and allowing for more efficient utilization of existing hardware for larger models, even with a performance trade-off.

  • "Per Pod Per Node" Usage Pattern: Furthermore, our observations suggest that current swap usage in production often involves a few large, memory-hungry pods dominating swap use on a given node. This 'per pod per node' pattern makes node-level swap, managed with explicit taints, a practical and immediately useful solution for cluster administrators addressing these specific, high-demand workloads.

  • Explicit Opt-In and Controlled Enablement at the Node Level: Enabling NodeSwap is not an "accidental" action. As @kannon92 indicates, it requires explicit steps by a cluster administrator (provisioning swap, disabling --fail-swap-on, and setting MemorySwap.SwapBehavior to LimitedSwap in the kubelet config). Furthermore, KEP-2400 includes a crucial recommendation to taint nodes with swap available stated in "Best Practices" section. This means that pods must explicitly tolerate this taint to be scheduled on such nodes. This ensures that only admins who understand their specific environment and workloads, and whose pods are designed to run on swap-enabled nodes, will utilize this feature. This is how GKE users use the feature since Alpha phase. For these informed users, KEP-2400 delivers immediate, tangible value. As @liggitt acknowledged, this KEP "unlocks the possibility of swap for clusters / admins who today are stuck with NoSwap because they don't know enough about their workloads."

Question 2: Does graduating KEP-2400 as-is hinder or pave the way for future comprehensive per-pod swap controls and a coherent long-term API?
Conclusion: It paves the way, and does not hinder future development.

  • No Hindrance to Future Work: A critical point, articulated by @liggitt, is that graduating KEP-2400 "does not make later per-pod controls harder." This directly addresses the concern that moving forward now might lead to an inconsistent or limited long-term API.

  • Focus on Foundational Layer: By stabilizing the node-level capability, we establish a robust base. The complexities of per-pod swap opinions (e.g., "prefer/require," "self/node-wide" swap) outlined in the discussion demand dedicated, comprehensive design and implementation efforts that are best handled in separate KEPs.

  • Avoid Double Opt-In Anti-Pattern: As @liggitt pointed out, forcing a "double opt-in" (node-level enablement AND per-pod opt-in in KEP-2400) would likely lead to cumbersome automated solutions (like webhooks) that create a poor user experience. Our current path avoids this.

Question 3: Does the proposed minimal per-pod API (from KEP-5359) effectively address the core "safety by default" and "noisy neighbor" concerns, and is it the right addition to KEP-2400 now?
Conclusion: No, it does not fully address core concerns and is not the right addition to KEP-2400 now, but a more robust per-pod API is viable as a follow-up.

  • Limited "Safety" and "Noisy Neighbor" Mitigation with Simple Flag: As I and @yujuhong initially argued, a simple per-pod "enable/disable" flag (as previously discussed for KEP-5359 integration) provides limited, potentially "false sense of security." If a pod is scheduled on a swap-enabled node (which it must opt into via toleration), even if its own swap is disabled, it can still suffer performance degradation from other "noisy neighbor" pods on the same node heavily utilizing swap, leading to shared I/O contention. This core problem is fundamentally a scheduling concern, not a per-pod disablement issue.

  • Risk of Suboptimal and Limiting API: Trying to fold even a "minimal" per-pod API into KEP-2400 now risks delaying its GA and could pre-empt a more holistic, extensible design for future per-pod swap management that truly integrates with the scheduler and considers the various opinion levels. @haircommander and @mrunalp also expressed anxiety about rushing a pod API with complex edge cases into a Beta KEP.

  • A Promising Alternative: Per-Pod SwapMaxLimit: This is an idea I've championed multiple times, including during this week's SIG Node meeting, and I've now documented my thoughts on it here. Instead of a simple enable/disable flag, I envision a more robust per-pod API, such as introducing a SwapMaxLimit field in PodSpec (e.g., in bytes). This would default to 0 (meaning no swap usage for the pod).

    1. A pod with SwapMaxLimit = 0 and no taint toleration would only schedule on non-swap nodes (safe by default).
    2. A pod with SwapMaxLimit = 0 but with a NodeSwap taint toleration could schedule on a swap-enabled node, and the Kubelet would enforce its 0 swap usage (explicit disablement).
    3. A pod with a non-zero SwapMaxLimit and a NodeSwap taint toleration would schedule on a swap-enabled node, and its swap usage would be limited to that value (explicit limit).

This design offers fine-grained control, leverages existing taint/toleration mechanisms for node selection, and provides a very smooth and safe migration path for existing workloads. Of course this is just some brainstorming idea in my mind, there are more details required to be ironed out.

Question 4: What is our concrete commitment and plan for achieving a truly "safe by default" and "opt-in" swap experience in Kubernetes, addressing the impact on non-swapped pods?
Conclusion: Our commitment is firm, and the plan is a clear, prioritized roadmap immediately following KEP-2400 GA, built upon the existing node-level opt-in model and the proposed per pod API, such as SwapMaxLimit as one of examples. Swap-aware scheduling remains crucial for cluster-wide optimization.

  • "Safe by Default" is Achieved via Node Taints Today: I believe this directly addresses @thockin's "unsafe by default" concern. KEP-2400's strong recommendation to taint nodes with swap enabled means that pods are not scheduled onto these nodes by default. Instead, pods must explicitly add a toleration for that taint to be scheduled on a swap-enabled node. This establishes an opt-in mechanism at the node scheduling level right from the start, providing a foundational safety layer for pods not explicitly configured for swap.

  • Per-Pod SwapMaxLimit or something equivalent as the Next Step for Granular Control: My proposed introduction of a SwapMaxLimit in PodSpec will be the immediate follow-up to NodeSwap GA. This API will provide the precise per-pod control needed, building on the existing node-level opt-in. Its default of 0 makes it inherently safe for unconfigured pods, enabling a smoother migration and broader adoption.

  • Swap-Aware Scheduling for Cluster-Wide Optimization and Advanced Use Cases: Even with SwapMaxLimit and taints, a more advanced Swap-Aware Scheduling remains critical (but outside SIG Node). Its role shifts from basic safety (now largely covered by taints and SwapMaxLimit) to:

    • Quantitative Resource Allocation: Enabling the scheduler to understand and manage the total available swap on nodes, preventing over-subscription of swap resources when multiple pods with non-zero SwapMaxLimit are scheduled.
    • Qualitative Resource Matching: Facilitating intelligent placement based on swap characteristics (e.g., faster vs. slower swap devices) or ensuring more nuanced isolation for highly latency-sensitive workloads from nodes experiencing high overall swap I/O from other pods.
      This ensures optimal cluster utilization and performance for diverse workloads.
  • Commitment to Follow-Through: As @iholder101 explicitly stated, there is a commitment to invest significant effort in driving these follow-up KEPs. I echo that commitment and will personally champion the SwapMaxLimit API KEP immediately, followed by the Swap-Aware Scheduling KEP but outside SIG Node.

cc/ @mrunalp @haircommander @SergeyKanzhelev @yujuhong @ajaysundark @iholder101 @liggitt @thockin @derekwaynecarr

@haircommander
Copy link
Contributor

Excellent summary @dchen1107 , I wanna echo your gratitude for the thorough conversation as well! As a procedural FYI: KEP 2400 is proposed to go GA here whenever we feel ready to move forward

@iholder101 iholder101 mentioned this pull request Jun 19, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.