-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[KEP-5359] Pod-Level Swap Control #5360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[KEP-5359] Pod-Level Swap Control #5360
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: iholder101 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
839434c
to
852de62
Compare
b36dbc4
to
860eb09
Compare
820c2fb
to
b48a955
Compare
know that this has succeeded? | ||
--> | ||
|
||
* Introduce a new `swapPolicy` field in PodSpec to allow users to configure swap for an individual pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-pod or per-container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! That's a good question.
Perhaps we can start with a pod-level knob, then add a container-level one if we need to?
Do you have an opinion on this @ajaysundark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather NOT do both, but so far the trend seems to be whichever we start with, we end up want ing the other, and then we have to define the intersection of them. So we should think about it up front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the swap "policy" at pod-level should be sufficient. The 'swap' decisions are more workload centric. I think applications should be considered as a 'single atomic unit' for swap -
- If the main application container is latency-sensitive, its critical sidecars (maybe a service mesh proxy) are also latency-sensitive because they are in the request path. They should all be governed by same swap policy. It is hard to envision a scenario for this choice to differ between tightly-coupled processes within the same Pod.
- A pod-level policy provides a clear attribute for future scheduling enhancements like Node-capabilities. For example, workloads specifying swapPolicy as 'Avoid' or 'Required', would allow the scheduler to place them onto nodes (NoSwap or LimitedSwap) that can satisfy that capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, Pod-level sounds like quite enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the main application container is latency-sensitive, its critical sidecars (maybe a service mesh proxy) are also latency-sensitive
Ehh, that's not necessarily true. People use sidecars for lots of things, not all of them are critical or LS. Disk cleanup, log-sending, etc are all non-critical.
I'm fine to make it per-pod, but I'll bet $5 that a really good use-case for per-container control emerges within 12 months of GA :)
So WHEN that happens - what does per-container control mean when per-pod control is already defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, not all sidecars are latency-sensitive. The goal for this KEP was centered on finding the simplest possible model that solves the immediate, well-understood use cases for protecting a latency-sensitive workload.
This KEP provides workload autonomy on swap, a necessary 'opt-out' policy when required for application owners. 'swapPolicy' could provide the required scheduling attribute for matching swap capability in the future, which is a pod level decision.
If a future KEP introduced a per-container swap control, it would likely need to be a resource model design (as the container is already at the swap enabled node).
It would probably look like:
- The pod-level
swapPolicy
select the environment (e.g., a new modeAssignedLimits
). - A new, per-container field (e.g.,
resources.limits.swap
) could then be introduced to set a specific ceiling on swap usage within the environment established by the pod.
When requests for container level resource control emerge, we need to evaluate the use-cases further.
// SwapPolicy defines the desired swap memory policy for this pod. This field | ||
// is immutable after the pod has been created. | ||
// | ||
// If unspecified, the default value is "NoPreference", which means the pod's swap behavior is determined by the node's swap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a literal "NoPreference" value or just leave it unset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean using something like the following?
// +kubebuilder:validation:Enum=NoPreference;Disabled
// +kubebuilder:default=NoPreference
I think it's not a bad idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm asking, not telling. We often hear people complain about the sheer volume of YAML we produce, so leaving it "" and omitempty
has some attractiveness. OTOH, we also generally prefer to be explicit, so setting a default value has some attractiveness.
The real value of an explicit default is that if we version the API we can chage the default. I'm not sure that applies here.
We can leave this open for API review, but let's PLEASE ask the question at that time. In the meantime we can get some opinions.
And for the record, we don't use kubebuilder tags, we just say // +default="NoPreference"
or // +default=ref(SwapPolicyNoPreference)
(but I think I have come around to the simpler form)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of an enum but I think the default could change over time.
For some of our openshift APIs, we push defaulting into the controller rather than the API so its easier to control over time.
For alpha, I'd prefer "disabled" to be on the safer side. I don't know if we are in a place to encourage swap everywhere in Kubernetes ecosystem so I would think default to disable swap and users that want swap can enable it on their podSpecs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some of our openshift APIs, we push defaulting into the controller rather than the API so its easier to control over time.
Kube treats defaults as intrinsic to the GVK. Pushing it into the controller is not "defaulting" it's "setting" :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it all boils down to the question of a default behavior. If we want to support opt-in, #2400 should be extended to enable "preferSwap" or "oktoswap" option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to support opt-in, #2400 should be extended to enable "preferSwap" or "oktoswap" option.
I thought that the current state of #2400 was predicated on swap being a thing that would start as opaque to pods, would have to be opted into at the node level, and only clusters / nodes that know enough about their workloads would choose to enable that node-wide setting. If that is still the case, I don't see a reason to take away the ability for cluster admins to opt nodes into swap for all pods just via node config. That capability seems ready.
If there is evidence that there are clusters who cannot turn on node-wide swap by default via #2400 because they don't know enough about their workloads, then adding a node option and Pod API surface to opt into swap could be reasonable, but that seems feasible to evolve node config and pod API as part of this KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that the current state of #2400 was predicated on swap being a thing that would start as opaque to pods, would have to be opted into at the node level, and only clusters / nodes that know enough about their workloads would choose to enable that node-wide setting.
I do not know how we can limit the who will be using swap. I would expect many users will enable it as a "cheap" expansion of memory on the node.
We already have feedback that in #2400 customers want:
- disable swap per-pod
- enable swap on guaranteed pod (currently we have a default behavior to exclude guaranteed pods from swap)
Because of this feedback we created this KEP (#5359) and the API discussion led to this question about the default Pod behavior.
@liggitt given the need to disable swap per Pod (even when node has swap enabled), and the proposal of API in this KEP, do you think it is OK to proceed with #2400 as-is or we need to change the default in #2400 to simplify the "disable" behavior implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt given the need to disable swap per Pod (even when node has swap enabled), and the proposal of API in this KEP, do you think it is OK to proceed with #2400 as-is or we need to change the default in #2400 to simplify the "disable" behavior implementation?
The node default is NoSwap, that is a reasonable default.
#2400 allows admins to configure nodes to automatically apply swap in limited ways to as many pods as they can, and that seems like an ~ok lever to me, especially if we're clear that the current LimitedSwap
node setting will exclude pods that explicitly opt out of swap if that becomes possible in the future. If admins try to use the LimitedSwap
lever to get "free memory" and their workloads become unhappy, they might have to stop using that setting and wait for pod-level opt-in to make use of swap.
Requiring a double opt-in (node-level enablement AND modifying every single pod to opt-in) is just going to lead to automated opt-in "solutions" at the pod level that do things like mutate pods as they are created in order to opt them into swap.
I think graduating #2400 as-is is ~ok, not everyone will be able to make use of it yet, and we have work to do in this KEP to give the pod-level controls, scheduling visibility, and probably a third node-level option for OptInLimitedSwap or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requiring a double opt-in (node-level enablement AND modifying every single pod to opt-in) is just going to lead to automated opt-in "solutions" at the pod level that do things like mutate pods as they are created in order to opt them into swap.
I think enabling double opt-in may allow even better experience where swap is enabled more broadly as admins will rely on pod owners to check Pod is OK to swap from reliability and security perspective. And automated opt-in solutions will likely always respect explicit disable. So this argument can go both ways.
#2400 allows admins to configure nodes to automatically apply swap in limited ways to as many pods as they can, and that seems like an ~ok lever to me, especially if we're clear that the current
LimitedSwap
node setting will exclude pods that explicitly opt out of swap if that becomes possible in the future. If admins try to use theLimitedSwap
lever to get "free memory" and their workloads become unhappy, they might have to stop using that setting and wait for pod-level opt-in to make use of swap.
The problem is that this now requires interaction between cluster admins and pod owners and planning on where each pod needs to run based on swap perferences. So we are creating another tension point and uncertain behavior of Pods.
and we have work to do in this KEP to give the pod-level controls, scheduling visibility, and probably a third node-level option for OptInLimitedSwap or something.
One of the ask for #2400 and this is a typical bar for many KEPs is to explain the API evolution. Since the only piece of API evolution is this KEP, I do not like the idea of grduating the #2400 as-is. I do not think I am being inconsistent here. If we will have a path forward explained, it will be easier to say that in a couple of releases people will have all the levers we are thinking to provide to them. With just an idea that we may have this API in future, I do not feel comfortable to say this.
b48a955
to
d04c1f0
Compare
Signed-off-by: Itamar Holder <iholder@redhat.com>
Signed-off-by: Itamar Holder <iholder@redhat.com>
d04c1f0
to
fe25dad
Compare
@iholder101: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
||
## Summary | ||
|
||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you fill out sections, can you please remove the comments?
In the first phase, we'll focus on the ability to disable swap for all its containers irrespective of underlying node | ||
swap behavior. | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I posted this comment: #5387 (comment) It may speed up the introduction of per-Pod enablement and will shape this KEP better.
|
||
const ( | ||
// SwapPolicyModeDisabled explicitly disables swap for the pod. | ||
SwapPolicyModeDisabled SwapPolicyMode = "Disabled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ @liggitt had suggestions to rename Disabled
to NoSwap
to align with Node config
In any case, I'll keep this here for reference. | ||
|
||
One potential concern with the `Disabled` mode is that it forces the API to allow pod owners to disable swap, regardless | ||
of the `swapBehavior` setting configured by the admin at the kubelet level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enforcing field restrictions can still be possible with higher-level constructs. eg: validation-webhook or policy-agents such as OPA or kyverno
// SwapPolicyModeDisabled explicitly disables swap for the pod. | ||
SwapPolicyModeDisabled SwapPolicyMode = "Disabled" | ||
// SwapPolicyModeNoPreference states the pod should follow node-level swap configuration. | ||
SwapPolicyModeNoPreference SwapPolicyMode = "NoPreference" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a "No preference" mode, should we also have a status field which indicates the result (and leaves room for likely extensions) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a hard time following how much detail about swap we want surfaced via the API ... is it supposed to be a hidden implementation detail or not?
Also, it seems ... weird for a pod to indicate "I don't care whether I use swap or not" and then want to know whether it used swap or not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a hard time following how much detail about swap we want surfaced via the API
As little as we can get away with, but it seems like something that people might want to know, for example, when their app crashes - was it running with swap?
* Maintain backward compatibility: existing pods that run on swap-enabled nodes should behave as they do by default. | ||
* Provide a mechanism to alleviate concerns regarding "all-or-nothing" nature of node-level swap enablement, potentially | ||
unblocking KEP-2400’s path to GA. | ||
* Open the door to future enhancements, both at the kubelet-level swap behaviors and the new pod-level swap control API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this covers the transition from 2400 to the ability to disable swap. Why we are not covering how we will transition to specifying the limit explicitly? We need at least an idea how pods with the limits specified will work with kubelet with limited swap. Or if they will be completely incompatible, than we need to explicitly say that LimitedSwap is a limited time feature and will be fully replaced with something else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pod level swap can be another swap type for the pod. Pod API always wins, if kubelet is limited swap then we also add swap if the pod said "NoPreference", or if it's disabled in kubelet and NoPreference in api then no swap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: if node doesn't have swap enabled, then pod API swap pods fail at admission (maybe with NodeCapabilities feature to help scheduling hints)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested something here: https://docs.google.com/document/d/19muCmfBndxfr-yq1EmeI54yq-upYmZsWLsLvcm1i6J4/edit?disco=AAABljwWs3w In general it feels we may have a path forward
We discussed this topic today at SIG Node meeting. I'd like to reiterate my position and rationale raised at the meeting earlier today, building on the points we've discussed. I strongly prefer to proceed with graduating Node Swap (KEP-2400) to GA in 1.34 as planned, without incorporating per-pod enable/disable APIs proposed in this KEP at this stage. I believe this phased approach offers the most robust and beneficial path forward for Kubernetes users:
Graduating KEP-2400 to GA in 1.34 provides immediate and tangible benefits. It enables a fundamental operating system feature (swap) for Kubernetes nodes, which is crucial for improving node stability, preventing abrupt OOM kills for burstable workloads, and enhancing overall resource utilization. Many existing applications expect swap, and this allows them to run more reliably on Kubernetes. Getting this core feature to GA unblocks a significant capability for our users and allows us to focus on future enhancements.
While the intent behind per-pod controls is valid, a simple "enable/disable" API has significant limitations and could provide a "false sense of security." As we discussed, even if a pod explicitly disables swap, it can still be negatively impacted by "noisy neighbors" on the same node that are heavily utilizing swap. The performance degradation caused by I/O contention from swapping affects the entire node, regardless of individual pod settings. Furthermore, tying swap calculation to memoryRequest rather than actual usage patterns also limits its effectiveness, particularly for burstable workloads.
To truly address the concerns around latency-sensitive workloads and "noisy neighbors" in a swap-enabled environment, the most effective solution is swap-aware scheduling. This allows the scheduler to intelligently place pods based on node capabilities and a pod's sensitivity or tolerance to swap activity. This provides genuine isolation and optimized resource allocation, going far beyond what a simple per-pod enable/disable API can offer. This should be our next major focus after node-level GA.
More granular per-pod swap limits would be a valuable feature for fine-tuning specific workload behaviors, but this should be pursued as a separate KEP. It represents a more complex implementation detail that would build upon the foundation of node-level swap and be most effective when coupled with swap-aware scheduling. My proposed priority, therefore, is:
This phased approach ensures we deliver immediate value with a robust node-level feature, then systematically tackle the more complex challenges of intelligent workload placement and granular controls. Attempting to bundle a limited per-pod API now would, in my view, delay essential GA, introduce a suboptimal design, and not fully resolve the underlying performance concerns. This is just my 2 cents, and happy to be convinced otherwise. |
I want to reiterate my position from the meeting. I do not think the comment above resolves the issue at hand. As we discussed, we need to find a way forward from Node-level enablement to per-Pod swap aware scheduling to confirm that this KEP is not making it harder to make next steps. The need for a plan resulted in two artifacts, which were :
Reviewing the first artifact, we got a clear feedback that the best default Pod behavior is NoSwap, even on the nodes that has swap enabled. The proposed change was to introduce the swap mode picker for the Pod in the 2400, enabling other modes in future and making transition to swap safer. The proposed change to this KEP will improve it in many ways:
Given that this addition to the KEP is small and it makes implementing of next KEPs easier and API more consistent, I do recommend to include this addition to this KEP. If there is an alternative proposal on how long term API will look like, let's put it for discussion. My understanding was that we have many agreements in this thread on the API future and no alternatives. |
Pods can't currently express opinions about swap, it's currently an opaque and node-wide setting, which node admins should only enable if they understand their workloads well enough to know that is safe. Some clusters know enough about their workloads to enable node-wide as part of #2400 now. I don't think graduating that work makes later per-pod controls harder. I can see why some clusters wouldn't know enough about their workloads to do that, and allowing for gradual per-pod opt-in likely will require a new middle option on the node, but I have a hard time seeing why that's part of #2400 ... it seems like work to do as part of this KEP. @dchen1107's point about swap implications not being well-contained is a good one. Once we open the can of worms of per-pod opinions around swap/noswap, aren't there a bunch of scheduling / requirement levels that are possible? e.g.:
The trickiest parts of per-pod / opt-in swap would seem to be around surfacing node state and scheduling for those cross-pod bits. Saying if you want opt-in swap you need to configure your nodes to Hoisting my reply from #5360 (comment) as well for visibility:
The node default is NoSwap, that is a reasonable default. #2400 allows admins to configure nodes to automatically apply swap in limited ways to as many pods as they can, and that seems like an ~ok lever to me, especially if we're clear that the current Requiring a double opt-in (node-level enablement AND modifying every single pod to opt-in) is just going to lead to automated opt-in "solutions" at the pod level that do things like mutate pods as they are created in order to opt them into swap. I think graduating #2400 as-is is ~ok, not everyone will be able to make use of it yet, and we have work to do in this KEP to give the pod-level controls, scheduling visibility, and probably a third node-level option for OptInLimitedSwap or something. |
My primary concern is that the norm is "safe". My fear is that we do this early work to enable swap at all and then we never follow up with "safe" (or we take 3 years at which point everyone will have started using "unsafe"). I accept the "admins are making the decision" argument, but we're giving them only 2 options: "bad" and "worse". It's no secret that I am pretty anti-swap for most use-cases, so I feel the default should be NOT to use swap unless I specifically say otherwise on a workload level. We're going to end up with something like:
Or more if we find way to modulate it further. That's a lot for people to consume. Are we (sig-node) COMMITTED to having "opt-in" (aka "let the pod decide") as the next per-node kubelet config for swap? Is there a KEP for that? Why not? How long will we be sitting in the "unsafe by default" state (row 2)? Do we have a plan that gets non-swapped pods out of "being impacted by swap pods" mode? Or is it always going to be something that pods need to be aware of? |
Thanks everyone for this discussion. As @dchen1107, @liggitt and other suggested, I do think we should GA 2400 as-is and continue the API discussion in parallel.
I honestly don't think this is a strong argument. AFAICT we're here to give admins options which they can choose to use or not. We turn swap off by default, and we'll provide thorough documentation about the pros cons and risks, letting admins decide what's best for their environment. As @liggitt said, some admins wouldn't want to take the risk without extra safty mechanisms that will come up in the future - and that's completely fine - but I don't understand why this should block other admins who do want to use it as-is in their production setups.
I know there had been some misunderstanding, but as I and others understood it, this original idea was to create a KEP to discuss APIs while we continue with 2400 GA. I still think we shouldn't tie the two. Being honest, the most feedback I've received is, by far, that people from various companies are using this in production for a long while, are very happy and desperate for a GA.
I tend to disagree that immediate broad adoption is very important. In addition, demanding to have a double opt-in + a webhook/MPA is both a bad UX and would limit the API design moving forward. From my POV, I'm not sure at all that the right API is a simple "disable" switch.
I'm not sure I understand what these two options are. And, as time goes by, we'll do our best to provide more mechanisms that will ease the adoption of this.
As an admin, you'd have swap off by default and probably wouldn't opt-in to it.
I agree that the suggested API here is far from being perfect.
As I see it, we should introduce more
FWIW, I commit to invest a huge amount of effort on the swap-related follow-up KEPs, while leading them and doing most of the work needed. I've been working on this feature for about 2 years at this point, and I am willing to continue doing so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
if you're still planning to merge this for 1.34 please make sure to fill in the required PRR documents and sign me up for PRR review
I tried to capture my concerns with LimitedSwap in https://docs.google.com/document/d/16Z8CfYg3JKOzV2uC0MwC5dbthAjBhJjZkAUOu4Jhtyw/edit?tab=t.0#heading=h.s2siryc5tp3l. I agree with @thockin's comment above that we're probably going to end up with a new node config option to enable per-pod opt-in. If we aren't fully happy with the swap calculation, and intend to add a new node-config option anyway, I'm worried that we're rushing this LimitedSwap option to GA that we're not really happy with and don't expect people to use long-term. |
Thank you @tallclair. As written in the document, the fact that the swap feature is not complete is known and is written very clearly in the KEP's scope and intentions. What I expect from this document to include is: why these problems mean that swap cannot be GAed, then we can enhance it later? Why block (many!) admins that are using it in production and are desperate to GA it? And, what's the justification to go against what the community already merged and is encoded clearly in the KEP? Let me mention again what is written in the KEP:
|
I agree with @dchen1107 and @iholder101 , I'm anxious about combining the two features and rushing on a pod API, which has more edge cases and particularities than I want to introduce to a beta KEP. I think swap as-is is useful, and not perfect. GA'ing 2400 and continuing to iterate in 5359 is my preference |
I am inclined towards @dchen1107 position on this. We can consider opening a follow-on KEP to this one for pod opt-out alongside fleshing out the pod-level swap KEP as @thockin mentioned in his comment above. |
I think we should mention that there is such much inertia on swap being off for Kubernetes clusters. To "accidental" enable swap, one needs to create a node, provision swap, turn off kubelet gate AFAIK most vendors/organizations have probably default to swap not being enabled on a node. Admins that want to enable swap have a few steps to do before swap is on their clusters. |
+1 for having more time to flesh out the pod-level swap design and not rush it with the node-level swap feature. As for whether the current node-level swap is a useful standalone feature that's worth graduating to GA, AFAIK, there are users using the feature in production since it was alpha. Maybe others could chime in more on this. For reliability and observability concerns, there have been a few rounds of discussions and I believe @ajaysundark also looked into this in-depth did not raise extra concerns (@ajaysundark correct me if I'm wrong). With pod-level swap design still in its early phase, I'm inclined to agree that we should let the node-level swap proceed to GA to benefit the users immediately. |
Given that many aspects we discussed around API, I think folding a minimal KEP for API into existing design feels like a decision we may be rushing in making.
@thockin I think most folks agree that 'disabled' is the ideal default swap-choice per pod. I added it elsewhere, but want to capture here for data-points as well. Many production grade applications commonly deployed in k8s want to dictate on swap, with preference to keep them disabled. Kubernetes default choices should align with this interests.
By making swap 'disabled' by default, we accept the impact on these swap enabled pods that need to add a one-line yaml change for them to use swap again. I think clear scheduling isolation to some extent will reduce this migration concern. We could introduce minimal improvements like capturing the swap-mode as a label, so exclusivity can be achieved in some form (node selection or affinity).
@liggitt I feel like capability awareness for swap (comment) will help with different long-term qualitative scheduling needs to achieve different affinity levels. |
I still believe that the GA-ing #2400 will be a rash decision. The ask to understand the long term Swap story and transition from LimitedSwap to swap-aware was made long ago as well as the ask to allow customers to opt out per-Pod. We worked hard on reviews last minute and after the back-and-forth came up with some minimal changes that might make #2400 better and safer. Now, based on limited customers experience, we want to GA it, justifying it by users who use it in production. Suggesting that other user feedback will be managed by the proper warnings in release notes and users setting taints and tolerations. We also suggest that we will come up with the better story later without an articulated path forward. As pointed above, the path forward may be involving deprecating the LimitedSwap. It feels to me the LimitedSwap is seen as a temporary feature that is pushed to enable the feature faster and collect more user feedback. If this is the positioning of the feature, than it is no better no worse than being in beta. |
Opting in per-node is very coarse and it requires applications to be aware of swap, even if they don't want swap. It's a silent hazard that cluster admin is putting in place for apps.
This is the "bad" vs. "worse". The only options are "no swap at all" (bad) or "swap for everyone, unless they take action to avoid it" (worse). There's always going to be tension between what apps want and what admins want. Most important, in my opinion, is not to break things that already work. When I see things like "as time goes by, we'll do our best to provide more mechanisms" what I really hear is "we think this is good enough and maybe we'll revisit at some unknown future date". I don't mean that anyone is intentionally lying, I mean that we don't seem to be on the same page WRT risk. I don't think was part of the review for the ain KEP, but scanning it now I see that taints were considered and then made "advisory". :( I get that there are people using it as-is. I also know that once it goes GA there will be MORE people using it as-is. You can document the risks in big bold letters, but it will still be an attractive nuisance. People will get hurt, and it will be our fault (partly, anyway). I haven't heard ANY specific plans for mitigating the risks, which makes me very uncomfortable. All that said, the preponderance of SIG Node leaders' and stakeholders' opinions seem to say that the value outweighs the risk, so I am going to back down. What I really want is a commitment that the end-goal is a) safe-by-default for pods which do not want swap; and b) coming within a couple releases. If we are staring at 1.38 and don't have this pinned down, I will be a very grumpy person. I'd like to see the public docs for this include a STRONG recommendation that swap-enabled nodes be tainted to prevent accidentally swapping pods that are not ready for it. |
Thank you all for this incredibly comprehensive and passionate discussion over the past 24 hours. I've read through every comment carefully from both sides. As the senior tech lead, I need to make a call on the path forward that balances immediate user value, long-term API coherence, and critical safety considerations. My decision is to proceed with graduating NodeSwap (KEP-2400) to GA in Kubernetes 1.34 as-is, without incorporating the proposed per-pod enable/disable API from KEP-5359 into KEP-2400 at this time. This decision for unblocking is contingent upon successful completion of all GA requirements, including robust testing, comprehensive documentation, and no new blockers emerging during the implementation phase. To explain how I've arrived at this conclusion, I asked myself the following four key questions, integrating the points raised in this KEP from the community: Question 1: Is NodeSwap (KEP-2400) as-is useful and ready for GA, given its current scope?
Question 2: Does graduating KEP-2400 as-is hinder or pave the way for future comprehensive per-pod swap controls and a coherent long-term API?
Question 3: Does the proposed minimal per-pod API (from KEP-5359) effectively address the core "safety by default" and "noisy neighbor" concerns, and is it the right addition to KEP-2400 now?
This design offers fine-grained control, leverages existing taint/toleration mechanisms for node selection, and provides a very smooth and safe migration path for existing workloads. Of course this is just some brainstorming idea in my mind, there are more details required to be ironed out. Question 4: What is our concrete commitment and plan for achieving a truly "safe by default" and "opt-in" swap experience in Kubernetes, addressing the impact on non-swapped pods?
cc/ @mrunalp @haircommander @SergeyKanzhelev @yujuhong @ajaysundark @iholder101 @liggitt @thockin @derekwaynecarr |
Excellent summary @dchen1107 , I wanna echo your gratitude for the thorough conversation as well! As a procedural FYI: KEP 2400 is proposed to go GA here whenever we feel ready to move forward |
Other comments:
/sig node