-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-5233: proposal for NodeReadinessGates #5416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-5233: proposal for NodeReadinessGates #5416
Conversation
ajaysundark
commented
Jun 17, 2025
- One-line PR description: adding new KEP
- Issue link: Node Readiness Gates #5233
- Other comments: Including feedback from API review to include probing mechanisms as a inherent part of the design.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ajaysundark The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This design has been discussed with more folks, and below is the summary of the key feedback:
|
@ajaysundark: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
authors: | ||
- "@ajaysundark" | ||
owning-sig: sig-node | ||
participating-sigs: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to put sig-scheduling and reviewer/approver from us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, we have a lot of existing extension points to allow building something a lot like this, but out of tree.
We put in those extension points for a reason. So, I think we should:
- build this out of tree
- add our voices to calls for better add-on management
### Initial Taints without a Central API | ||
|
||
This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Initial Taints without a Central API | |
This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers. | |
### Initial taints (replaced), with out-of-tree controller | |
This approach uses `--register-with-taints` to apply a single initial taint at startup. A controller then atomically sets a set of replacement taints (configured using a custom resource) and removes the initial taint. | |
For each replacement taint, each component is then responsible for removing its own taint. | |
This is easier to maintain (no in-tree code) but requires people to run an additional | |
controller in their cluster. | |
matchLabels: | ||
readiness-requirement: "network" | ||
requiredConditions: | ||
- type: "network.k8s.io/CalicoReady" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- type: "network.k8s.io/CalicoReady" | |
- type: "vendor.example/NetworkReady" |
requiredConditions: | ||
- type: "network.k8s.io/CalicoReady" | ||
- type: "network.k8s.io/NetworkProxyReady" | ||
- type: "network.k8s.io/DRANetReady" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- type: "network.k8s.io/DRANetReady" | |
- type: "vendor.example/LowLatencyInterconnectReady" |
key: "readiness.k8s.io/network-pending" | ||
effect: NoSchedule | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a CRD that defines a set of rules that map (custom) conditions to taints?
Yes, you can break your cluster with a single, misguided cluster-scoped policy, but we already have that in other places (eg ValidatingAdmissionPolicy).
# Hypothetical Kubelet Configuration | ||
nodeReadinessProbes: | ||
- name: "CNIReady" | ||
conditionType: "network.k8s.io/CalicoCNIReady" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conditionType: "network.k8s.io/CalicoCNIReady" | |
conditionType: "vendor.example/NetworkReady" |
- conditionType: "vendor.com/DeviceDriverReady" | ||
- conditionType: "network.k8s.io/CalicoCNIReady" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- conditionType: "vendor.com/DeviceDriverReady" | |
- conditionType: "network.k8s.io/CalicoCNIReady" | |
- conditionType: "vendor.example.com/DeviceDriverReady" | |
- conditionType: "vendor.example/NetworkReady" |
Note over NA, CNI: Node-Agent Probes for Readiness | ||
NA->>CNI: Probe for readiness (e.g., check health endpoint) | ||
CNI-->>NA: Report Ready | ||
NA->>N: Patch status.conditions:<br/>network.k8s.io/CNIReady=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't make an assumption that network plugins use CNI.
|
||
This approach allows critical components to directly influence when a node is ready, complementing the existing `Ready` condition with more granular, user-defined control. | ||
|
||
### User Stories (Optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### User Stories (Optional) | |
### User Stories |
* Defining the right set of gates requires careful consideration by the cluster administrator. | ||
|
||
## Alternatives | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For autoscaling, cluster autoscalers can directly pay attention to the existing .status.conditions
on nodes. I think that's a viable alternative and one we should list.
###### How can this feature be enabled / disabled in a live cluster? | ||
|
||
1. Feature gate (also fill in values in `kep.yaml`) | ||
- Feature gate name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention NodeReadinessGates
here