Skip to content

KEP-4816 update for beta in 1.34 #5261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 20, 2025

Conversation

mortent
Copy link
Member

@mortent mortent commented Apr 27, 2025

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Apr 27, 2025
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Apr 27, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Apr 27, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 27, 2025

Scheduling a claim that uses this feature may take a bit longer, if it is
necessary to go deeper into the list of alternative options before finding a
suitable device. We can measure this impact in alpha.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can easily measure this, as it largely depends on what the structure of the ResourceClaim.

But in general each subrequest requires the same amount of work as a regular request, so a request with two subrequest will in the worst-case take about twice as long as just a single request. The maximum number of subrequests for each request is 8, so in the worst case, where none of the eight subrequests succeed, it would take 8 times longer than just a normal request.

Since the subrequests are tried in priority order, the extra work is only needed in situations where the first subrequests can be satisfied, so a situation where using just a single request would have failed to allocate devices for the request.

without this feature. However, the references to the original request in the
`DeviceRequestAllocationResult.Request` field will be on the format
`<main request>/<subrequest>` for `DeviceRequest`s using the feature. The
driver must be updated to understand this format in order to correctly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a risk, is there any way we could version this?

At some point as DRA goes into beta we probably want to maintain skew support for the drivers? I would want to be able to upgrade Kubernetes without worrying about the drivers failing immediately.

cc @aojea

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the nitpicking but

At some point as DRA goes into beta

I assume we mean as DRA goes into GA? (it is already in beta)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we mean as DRA goes into GA? (it is already in beta)

Well ... parts of DRA are beta, there are many parts in alpha (like this KEP).

I think the overall point stands that we should be avoiding breakage by now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And no, not as it goes to GA, GA should only be a rename and additional stability commitment, see #5242

https://groups.google.com/a/kubernetes.io/g/dev/c/thuIea5aYtw/m/YYcOY2dhEwAJ

Kubernetes production clusters are trusted to be stable by our users. Beta is our state where features are accessible for usage in production clusters, with a feature gate to disable the feature if it malfunctions. This is “GA-quality without field feedback”. We have updated the KEP template to help guide future KEPs (ref) by explicitly indicating that beta must be complete and promotion from beta to GA must have no significant change for the release. PRR reviewers will check for these criteria during the approval process and help contributors and sig leads explicitly handle promotion criteria.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no changes to the format of these references for ResourceClaims that doesn't use the PrioritizedList feature. So enabling this feature does not affect any existing claims, regardless of whether the driver is upgraded to support the feature.

What this means is that:

  • The user needs to make sure that the feature is enabled on the cluster and that the drivers they are running do support this feature. If they don't plan to use the feature, no action is needed.
  • If the driver on a node is rolled back to a version that doesn't support the feature, pods that references claim using the feature will not run on the node.

I think this is expected behavior and it will be the case for any new DRA feature that require changes to the drivers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

If the driver on a node is rolled back to a version that doesn't support the feature, pods that references claim using the feature will not run on the node.

Assuming the driver support and kubelet level are the same? Or some other mechanism?


Beyond the scope of this KEP, I think we should be thinking about how DRA drivers evolve safely with new capabilities.

For example, CNI config is versioned so implementations advertise which version they're using when writing the config (which is read by the CRI).

This reminds me of discussions with @tallclair about node/kubelet capabilities <> scheduler following in-place resize.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the driver support and kubelet level are the same? Or some other mechanism?

The kubelet has no changes related to this feature, so a rollback of the kubelet will not matter. Obviously, the kubelet needs to be at a version that supports DRA (1.32+) and have the DRA feature enabled for anything DRA to work.

The driver needs to support this feature, otherwise it will not be able to understand the reference to the selected DeviceRequest in the ResourceClaim status.

I agree that some way to express capabilities supported by drivers would be useful here. At the very least so it is possible for users to determine which features of DRA a specific version of a driver supports. We will have similar questions to this for many/most of the other DRA features that are in-flight, as many of them will require support in the drivers.
Not sure if we want drivers to advertise capabilities in a similar way to what is being discussed about the kubelet, but it seems like we have a very similar challenge for drivers.

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed adding a timeout for DRA scheduling for this feature - or perhaps just in general. Is that part of #4381 or this KEP?

@mortent
Copy link
Member Author

mortent commented Jun 7, 2025

We discussed adding a timeout for DRA scheduling for this feature - or perhaps just in general. Is that part of #4381 or this KEP?

That is not specific to the PrioritizedList feature, so I think we should handle that as part of #4381

@mortent
Copy link
Member Author

mortent commented Jun 9, 2025

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 9, 2025
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 10, 2025
Copy link
Contributor

@deads2k deads2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRR metrics addition requested. PRR lgtm otherwise.

@mortent
Copy link
Member Author

mortent commented Jun 12, 2025

PRR metrics addition requested. PRR lgtm otherwise.

Thanks! @deads2k can I add you as the PRR reviewer for this change since John is OOO?

@@ -1010,7 +1017,6 @@ ensure they are handled by the scheduler as described in this KEP.
#### Beta

- Gather feedback
- Implement node scoring
- Evaluate feasibilty of cluster auto scaler implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it evaluated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we want to do and it is tracked in #4970. It was also discussed in #5261 (comment). It is something we want to do, but it is not on the roadmap for this year.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the "Evaluate feasibilty of cluster auto scaler implementation" point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we have any concerns about the cluster autoscaler integration with this feature, but I will verify. Good catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please state it explicitly in the KEP before merging it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mortent this still seems to be outstanding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIG-Autoscaling has confirmed we don't have any concern about this and has approved this PR. I've updated the KEP.

One indicator are unexpected restarts of the cluster control plane components
(kube-scheduler, apiserver, kube-controller-manager).

If the `scheduler_pending_pods` metric in the kube-scheduler suddenly increases,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or when it's constant, i.e. no pods are being scheduled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've updated the language to clarify.

@deads2k
Copy link
Contributor

deads2k commented Jun 16, 2025

PRR lgtm. Approving for PRR only

/approve

@macsko
Copy link
Member

macsko commented Jun 18, 2025

/approve as SIG Scheduling

/hold for #5261 (comment)

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 18, 2025
@jackfrancis
Copy link

@haircommander @mortent I'll take a look for SIG Autoscaling shortly, stand by

rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
Workloads that doesn't use the Prioritized List feature should not be impacted,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key cluster-autoscaler story for the near term (CA + DRA doesn't currently implement prioritized alternatives features). We can make a point to identify non-implemented 1.34 DRA features when CA ships its 1.34 equivalent release to ensure that CA customers understand how they can use DRA. (and we will follow that pattern going forward as new DRA functionality arrives and CA implements support for them in its releases)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. And just to confirm, there is nothing about this feature that we think will be problematic to support in the Cluster Autoscaler and therefore should stop us from taking this to beta?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problematic part to evaluate was how to solve the scoring part in Cluster Autoscaler. The node-local part of choosing the most optimal device on a given Node should work out of the box with the vendored DRA scheduler plugin.

So IMO we should revisit the feasibility when scoring is in scope. Unless we have a more generic CA scoring integration in place by then, we can probably write a CA Expander that would prefer the node groups with more optimal devices, right @jackfrancis?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@jackfrancis
Copy link

/lgtm
/approve for SIG Autoscaling

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, jackfrancis, macsko, mortent

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mortent mortent force-pushed the Update4816ForBeta branch from 17fe043 to d6bbba9 Compare June 20, 2025 17:47
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2025
@haircommander
Copy link
Contributor

/lgtm
/hold

awaiting one confirmation from @jackfrancis #5261 (comment) but I think we can tag this. Jack, would you unhold when you confirm?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2025
@jackfrancis
Copy link

/lgtm

@jackfrancis
Copy link

/hold cancel

thank your @mortent @haircommander and everyone 🚀

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2025
@k8s-ci-robot k8s-ci-robot merged commit 2197b2e into kubernetes:master Jun 20, 2025
3 of 4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 20, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to Done in SIG Scheduling Jun 20, 2025
@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

10 participants