Skip to content

KEP-5233: proposal for NodeReadinessGates #5416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ajaysundark
Copy link

  • One-line PR description: adding new KEP
  • Other comments: Including feedback from API review to include probing mechanisms as a inherent part of the design.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 17, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ajaysundark
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 17, 2025
@ajaysundark
Copy link
Author

This design has been discussed with more folks, and below is the summary of the key feedback:

  1. The current design with a new, explicit API is likely not necessary for the identified use cases. The recommended path forward is to first explore a simpler design that does not require a new API. This decision can be revisited if there's a POC or use-cases that demonstrate that a simpler approach is impractical or introduces unforeseen complexities.
  2. There was a strong preference for using a node-local probing mechanism to report readiness. This approach is favored for high-fidelity signals and a better security posture compared to granting nodes/status patch permissions to multiple external agents.
  3. An alternate proposal based on global control (crd) for node readiness is undesirable due to the risk of large scale impact on misconfiguration.
  4. Admins typically know readiness requirements before node provisioning, mutable readiness-gates are not necessary. The conditions themselves are what may change.
  5. It is important to differentiate (at handling the readiness-states) between an agent being present and an agent failing.

@k8s-ci-robot
Copy link
Contributor

@ajaysundark: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify 1dfa3f8 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

authors:
- "@ajaysundark"
owning-sig: sig-node
participating-sigs: []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to put sig-scheduling and reviewer/approver from us.

Copy link

@lmktfy lmktfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, we have a lot of existing extension points to allow building something a lot like this, but out of tree.

We put in those extension points for a reason. So, I think we should:

  • build this out of tree
  • add our voices to calls for better add-on management

Comment on lines +560 to +562
### Initial Taints without a Central API

This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Initial Taints without a Central API
This approach uses `--register-with-taints` to apply multiple readiness taints at startup. Each component is then responsible for removing its own taint. This is less flexible and discoverable than a formal, versioned API for defining the readiness requirements. In addition, due to operational complexity where every critical daemonset needs to be tolerating every other potential readiness taint possible. This is unmanageable in a practical scenario where the components could be managed by different teams / providers.
### Initial taints (replaced), with out-of-tree controller
This approach uses `--register-with-taints` to apply a single initial taint at startup. A controller then atomically sets a set of replacement taints (configured using a custom resource) and removes the initial taint.
For each replacement taint, each component is then responsible for removing its own taint.
This is easier to maintain (no in-tree code) but requires people to run an additional
controller in their cluster.

matchLabels:
readiness-requirement: "network"
requiredConditions:
- type: "network.k8s.io/CalicoReady"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- type: "network.k8s.io/CalicoReady"
- type: "vendor.example/NetworkReady"

requiredConditions:
- type: "network.k8s.io/CalicoReady"
- type: "network.k8s.io/NetworkProxyReady"
- type: "network.k8s.io/DRANetReady"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- type: "network.k8s.io/DRANetReady"
- type: "vendor.example/LowLatencyInterconnectReady"

key: "readiness.k8s.io/network-pending"
effect: NoSchedule
```

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a CRD that defines a set of rules that map (custom) conditions to taints?

Yes, you can break your cluster with a single, misguided cluster-scoped policy, but we already have that in other places (eg ValidatingAdmissionPolicy).

# Hypothetical Kubelet Configuration
nodeReadinessProbes:
- name: "CNIReady"
conditionType: "network.k8s.io/CalicoCNIReady"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
conditionType: "network.k8s.io/CalicoCNIReady"
conditionType: "vendor.example/NetworkReady"

Comment on lines +276 to +277
- conditionType: "vendor.com/DeviceDriverReady"
- conditionType: "network.k8s.io/CalicoCNIReady"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- conditionType: "vendor.com/DeviceDriverReady"
- conditionType: "network.k8s.io/CalicoCNIReady"
- conditionType: "vendor.example.com/DeviceDriverReady"
- conditionType: "vendor.example/NetworkReady"

Note over NA, CNI: Node-Agent Probes for Readiness
NA->>CNI: Probe for readiness (e.g., check health endpoint)
CNI-->>NA: Report Ready
NA->>N: Patch status.conditions:<br/>network.k8s.io/CNIReady=True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't make an assumption that network plugins use CNI.


This approach allows critical components to directly influence when a node is ready, complementing the existing `Ready` condition with more granular, user-defined control.

### User Stories (Optional)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### User Stories (Optional)
### User Stories

* Defining the right set of gates requires careful consideration by the cluster administrator.

## Alternatives

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For autoscaling, cluster autoscalers can directly pay attention to the existing .status.conditions on nodes. I think that's a viable alternative and one we should list.

###### How can this feature be enabled / disabled in a live cluster?

1. Feature gate (also fill in values in `kep.yaml`)
- Feature gate name:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention NodeReadinessGates here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants