-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-4816 update for beta in 1.34 #5261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
Scheduling a claim that uses this feature may take a bit longer, if it is | ||
necessary to go deeper into the list of alternative options before finding a | ||
suitable device. We can measure this impact in alpha. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we can easily measure this, as it largely depends on what the structure of the ResourceClaim.
But in general each subrequest requires the same amount of work as a regular request, so a request with two subrequest will in the worst-case take about twice as long as just a single request. The maximum number of subrequests for each request is 8, so in the worst case, where none of the eight subrequests succeed, it would take 8 times longer than just a normal request.
Since the subrequests are tried in priority order, the extra work is only needed in situations where the first subrequests can be satisfied, so a situation where using just a single request would have failed to allocate devices for the request.
without this feature. However, the references to the original request in the | ||
`DeviceRequestAllocationResult.Request` field will be on the format | ||
`<main request>/<subrequest>` for `DeviceRequest`s using the feature. The | ||
driver must be updated to understand this format in order to correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a risk, is there any way we could version this?
At some point as DRA goes into beta we probably want to maintain skew support for the drivers? I would want to be able to upgrade Kubernetes without worrying about the drivers failing immediately.
cc @aojea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the nitpicking but
At some point as DRA goes into beta
I assume we mean as DRA goes into GA? (it is already in beta)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we mean as DRA goes into GA? (it is already in beta)
Well ... parts of DRA are beta, there are many parts in alpha (like this KEP).
I think the overall point stands that we should be avoiding breakage by now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And no, not as it goes to GA, GA should only be a rename and additional stability commitment, see #5242
https://groups.google.com/a/kubernetes.io/g/dev/c/thuIea5aYtw/m/YYcOY2dhEwAJ
Kubernetes production clusters are trusted to be stable by our users. Beta is our state where features are accessible for usage in production clusters, with a feature gate to disable the feature if it malfunctions. This is “GA-quality without field feedback”. We have updated the KEP template to help guide future KEPs (ref) by explicitly indicating that beta must be complete and promotion from beta to GA must have no significant change for the release. PRR reviewers will check for these criteria during the approval process and help contributors and sig leads explicitly handle promotion criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no changes to the format of these references for ResourceClaims that doesn't use the PrioritizedList feature. So enabling this feature does not affect any existing claims, regardless of whether the driver is upgraded to support the feature.
What this means is that:
- The user needs to make sure that the feature is enabled on the cluster and that the drivers they are running do support this feature. If they don't plan to use the feature, no action is needed.
- If the driver on a node is rolled back to a version that doesn't support the feature, pods that references claim using the feature will not run on the node.
I think this is expected behavior and it will be the case for any new DRA feature that require changes to the drivers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK.
If the driver on a node is rolled back to a version that doesn't support the feature, pods that references claim using the feature will not run on the node.
Assuming the driver support and kubelet level are the same? Or some other mechanism?
Beyond the scope of this KEP, I think we should be thinking about how DRA drivers evolve safely with new capabilities.
For example, CNI config is versioned so implementations advertise which version they're using when writing the config (which is read by the CRI).
This reminds me of discussions with @tallclair about node/kubelet capabilities <> scheduler following in-place resize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the driver support and kubelet level are the same? Or some other mechanism?
The kubelet has no changes related to this feature, so a rollback of the kubelet will not matter. Obviously, the kubelet needs to be at a version that supports DRA (1.32+) and have the DRA feature enabled for anything DRA to work.
The driver needs to support this feature, otherwise it will not be able to understand the reference to the selected DeviceRequest
in the ResourceClaim
status.
I agree that some way to express capabilities supported by drivers would be useful here. At the very least so it is possible for users to determine which features of DRA a specific version of a driver supports. We will have similar questions to this for many/most of the other DRA features that are in-flight, as many of them will require support in the drivers.
Not sure if we want drivers to advertise capabilities in a similar way to what is being discussed about the kubelet, but it seems like we have a very similar challenge for drivers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed adding a timeout for DRA scheduling for this feature - or perhaps just in general. Is that part of #4381 or this KEP?
/wg device-management |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PRR metrics addition requested. PRR lgtm otherwise.
Thanks! @deads2k can I add you as the PRR reviewer for this change since John is OOO? |
@@ -1010,7 +1017,6 @@ ensure they are handled by the scheduler as described in this KEP. | |||
#### Beta | |||
|
|||
- Gather feedback | |||
- Implement node scoring | |||
- Evaluate feasibilty of cluster auto scaler implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it evaluated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we want to do and it is tracked in #4970. It was also discussed in #5261 (comment). It is something we want to do, but it is not on the roadmap for this year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the "Evaluate feasibilty of cluster auto scaler implementation" point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think we have any concerns about the cluster autoscaler integration with this feature, but I will verify. Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please state it explicitly in the KEP before merging it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mortent this still seems to be outstanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIG-Autoscaling has confirmed we don't have any concern about this and has approved this PR. I've updated the KEP.
One indicator are unexpected restarts of the cluster control plane components | ||
(kube-scheduler, apiserver, kube-controller-manager). | ||
|
||
If the `scheduler_pending_pods` metric in the kube-scheduler suddenly increases, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or when it's constant, i.e. no pods are being scheduled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've updated the language to clarify.
PRR lgtm. Approving for PRR only /approve |
/approve as SIG Scheduling /hold for #5261 (comment) |
@haircommander @mortent I'll take a look for SIG Autoscaling shortly, stand by |
rollout. Similarly, consider large clusters and how enablement/disablement | ||
will rollout across nodes. | ||
--> | ||
Workloads that doesn't use the Prioritized List feature should not be impacted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key cluster-autoscaler story for the near term (CA + DRA doesn't currently implement prioritized alternatives features). We can make a point to identify non-implemented 1.34 DRA features when CA ships its 1.34 equivalent release to ensure that CA customers understand how they can use DRA. (and we will follow that pattern going forward as new DRA functionality arrives and CA implements support for them in its releases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. And just to confirm, there is nothing about this feature that we think will be problematic to support in the Cluster Autoscaler and therefore should stop us from taking this to beta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problematic part to evaluate was how to solve the scoring part in Cluster Autoscaler. The node-local part of choosing the most optimal device on a given Node should work out of the box with the vendored DRA scheduler plugin.
So IMO we should revisit the feasibility when scoring is in scope. Unless we have a more generic CA scoring integration in place by then, we can probably write a CA Expander that would prefer the node groups with more optimal devices, right @jackfrancis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, jackfrancis, macsko, mortent The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
17fe043
to
d6bbba9
Compare
/lgtm awaiting one confirmation from @jackfrancis #5261 (comment) but I think we can tag this. Jack, would you unhold when you confirm? |
/lgtm |
/hold cancel thank your @mortent @haircommander and everyone 🚀 |
One-line PR description: Update KEP to prepare for beta in 1.34
Issue link: DRA: Prioritized Alternatives in Device Requests #4816
Other comments: Some of the material is already covered in KEP-4381: DRA Structure Parameters, so that KEP is referenced in some places.