Implement ProvisioningClass for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests #6824

aleksandra-malinowska · 2024-05-14T10:58:06Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implements provisioning class for best-effort-atomic-scale-up..x-k8s.io ProvisioningRequests: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

Which issue(s) this PR fixes:

Issue #6815

Does this PR introduce a user-facing change?

Added support for `best-effort-atomic-scale-up..x-k8s.io` ProvisioningRequests

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

/cc @yaroslava-serdiuk @kisieland @towca

k8s-ci-robot · 2024-05-14T10:58:11Z

@aleksandra-malinowska: GitHub didn't allow me to request PR reviews from the following users: kisieland.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implements provisioning class for atomic-scale-up.kubernetes.io ProvisioningRequests: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

Which issue(s) this PR fixes:

Issue #6815

Special notes for your reviewer:

Please ONLY consider the second commit, the first one is being reviewed in #6821

Does this PR introduce a user-facing change?
Added support for `atomic-scale-up.kubernetes.io` ProvisioningRequests
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
- [KEP]: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md
/cc @yaroslava-serdiuk @kisieland @towca

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

yaroslava-serdiuk · 2024-05-16T12:07:19Z

cluster-autoscaler/provisioningrequest/besteffortatomic/provisioning_class.go

+// Best effort atomic provisionig class requests scale-up only if it's possible
+// to atomically request enough resources for all pods specified in a
+// ProvisioningRequest. It's "best effort" as it admits workload immediately
+// after successful request, without waiting to verify that resources started.


I don't think the comment is correct. It's best effort because it's rely on AtomicIncreaseSize API and if it's not implemented it fallbacks to IncreaseSize. And for IncreaseSize API your comment is correct.

It's still best effort unless AtomicIncreaseSize is supposed to wait for the resources to be provisioned (for the VMs to start and register as nodes) - a VM can still fail and get stuck.

If AtomicIncreaseSize is supposed to wait for everything to be ready, it would be a serious performance issue. We would have to wait for it asynchronously if we wanted to have a true guarantee.

If AtomicIncreaseSize is supposed to wait for everything to be ready

@kisieland I suppose it is, correct?

AtomicIncreaseSize is not supposed to wait for everything to be ready, the API should be add X or do nothing.
Whereas IncreaseSize is add X or anything that you are able to (anything in between 0 and X).

What does "you are able to" mean in this context?

I have also been under the impression that AtomicIncreaseSize guarantees the VMs coming up reasonably soon if it succeeds. If not that, how is it different than regular IncreaseSize? At least in GCE, most issues with creating instances happen after the increase call (stockouts, quota issues, ip exhaustion, etc.). Is AtomicIncreaseSize supposed to account for them and guarantee they won't happen if the request succeeds?

If not that, how is it different than regular IncreaseSize?

I believe the cloudprovider API within autoscaler doesn't specify that IncreaseSize has to do nothing but return an error if the entire delta is impossible to request. GCE cloudprovider happens to do that, though, so there's not much difference in this case.

The only way AtomicIncreaseSize could provide a stronger guarantee would be if it were a blocking call, and that's infeasible from performance perspective - if we want to wait for VMs before admitting workload, we'll need to find a way to do so without blocking other autoscaler ops.

One way to do that would be to add a checkpoint to ProvisionionigRequest, sth like TriggeredScaleUp=true, then treat request as if it were of check-capacity class and only when the capacity is provisioned update it with Provisioned=true

The main change of AtomicIncreaseSize vs the IncreaseSize is the fact that if the call would result in partial scale up (as it can happen with IncreaseSize for example due to stock-outs) then no VMs will be provisioned and the whole scale up will fail, i.e. no retries will be made.

Regarding the duration of the call, it should provide the answer within the similar time-frame as the IncreaseSize. Which AFAIK now is also handled asynchronously.

The only way AtomicIncreaseSize could provide a stronger guarantee would be if it were a blocking call, and that's infeasible from performance perspective - if we want to wait for VMs before admitting workload, we'll need to find a way to do so without blocking other autoscaler ops.

I guess it could also e.g. provide a stronger guarantee asynchronously with a (much) longer timeout. I think that's what happens in the GKE ProvReq case right? Just want to understand the differences here.

I believe the cloudprovider API within autoscaler doesn't specify that IncreaseSize has to do nothing but return an error if the entire delta is impossible to request. GCE cloudprovider happens to do that, though, so there's not much difference in this cas

@aleksandra-malinowska Hmmm I see, so the only difference is "atomic version guarantees that the method fails if it knows that not all VMs can be provisioned before doing the API call"? I have a couple of questions:

Do you really anticipate it being useful in practice, given that most issues impacting VM provisioning can't be detected before doing the increase call? The only thing I can see this catching is exceeding the node group's max limit, which is already covered by the scale-up code after your previous changes. Is this just for completeness with your previous changes?

The main change of AtomicIncreaseSize vs the IncreaseSize is the fact that if the call would result in partial scale up (as it can happen with IncreaseSize for example due to stock-outs) then no VMs will be provisioned and the whole scale up will fail, i.e. no retries will be made.

Regarding the duration of the call, it should provide the answer within the similar time-frame as the IncreaseSize. Which AFAIK now is also handled asynchronously.

@kisieland This seems to go against what @aleksandra-malinowska is saying above? I haven't seen IncreaseSize calls failing because of stockouts - at least not in GCE which is the main target for this feature. In my experience, the increase call itself ~always succeeds unless the max MIG size limit is exceeded. Stockout issues are only visible when the VMs fail to come up after the call. Unless the atomic version is supposed to predict such stockout issues and fail the API call if so after all?

This isn't blocking for this PR, but IMO we should really clearly set the expectations here, both internally for the cloud providers and externally for ProvReq users. Even people deeply involved with the feature don't seem to have full clarity on this.

The strength of guarantees for the atomicity depends on implementation in a given cloud provider. The call itself expresses only an intent.

AtomicIncreaseSize is used for scale-ups which only make sense if the entire delta is provisioned.

IncreaseSize is used when a partial scale-up might still be useful. This was always the default assumption in autoscaling logic, we're just making it explicit by introducing another option.

In GCE, we can use ResizeRequest API for a more atomic resize. This still doesn't guarantee that the nodes will register with the KCP, but it helps avoid partial scale-ups in some scenarios, such as stockouts.

The assumption is that available external APIs are likely to differ substantially between providers, and so will the cases they handle or don't handle. Some providers may have no support for atomic resizes whatsoever, in which case the implementation will be identical to IncreaseSize's.

Because it's really only best effort on cloud provider's side, we don't want to promise that AtomicIncreaseSize gives any guarantee to autoscaling logic, and we don't rely on it later - failed scale-up is handled the exact same way.

yaroslava-serdiuk · 2024-05-16T12:11:50Z

cluster-autoscaler/provisioningrequest/besteffortatomic/provisioning_class.go

+	// For provisioning requests, unschedulablePods are actually all injected pods. Some may even be schedulable!
+	actuallyUnschedulablePods, err := o.filterOutSchedulable(unschedulablePods)
+	if err != nil {
+		conditions.AddOrUpdateCondition(pr, v1beta1.Provisioned, metav1.ConditionFalse, conditions.CapacityIsNotFoundReason, "Capacity is not found, CA will try to find it later.", metav1.Now())


Can we add a different message, explaining that there is an error during filter out schedulable pods?

yaroslava-serdiuk · 2024-05-16T12:17:11Z

cluster-autoscaler/provisioningrequest/besteffortatomic/provisioning_class.go

+	st, err := o.scaleUpOrchestrator.ScaleUp(actuallyUnschedulablePods, nodes, daemonSets, nodeInfos, true)
+	if err == nil && st.Result == status.ScaleUpSuccessful {
+		// Happy path - all is well.
+		conditions.AddOrUpdateCondition(pr, v1beta1.Provisioned, metav1.ConditionTrue, conditions.CapacityIsFoundReason, conditions.CapacityIsFoundMsg, metav1.Now())


Let's add a different reason/massage that describe that ScaleUp request is successful.
CapacityIsProvisioned

yaroslava-serdiuk · 2024-05-16T12:19:24Z

cluster-autoscaler/provisioningrequest/orchestrator/orchestrator_test.go

+		provreqwrapper.TestProvReqOptions{
+			Name:     "autoprovisioningAtomicScaleUpReq",
+			CPU:      "100m",
+			Memory:   "100",
 			PodCount: int32(5),
 			Class:    v1beta1.ProvisioningClassAtomicScaleUp,


Let's rename provisioning class in api package

I'm not sure I understand what you mean. I don't think we were planning to change API here?

yaroslava-serdiuk

Please update description as well

yaroslava-serdiuk · 2024-05-17T12:48:54Z

cluster-autoscaler/provisioningrequest/besteffortatomic/provisioning_class.go

+	if err != nil {
+		return status.UpdateScaleUpError(&status.ScaleUpStatus{}, errors.NewAutoscalerError(errors.InternalError, err.Error()))
+	}
+	if pr.Spec.ProvisioningClassName != v1beta1.ProvisioningClassAtomicScaleUp {


I mean, since we name ProvisioningClass as bestEffortAtomic, let's rename v1beta1.ProvisioningClassAtomicScaleUp as well for consistency.

I previously assumed the discussion about 'best effort' and other possible names was only related to the terms used inside the CA implementation, not to change the actual ProvisioningRequest API.

@kisieland - as KEP author, what's your take on API change?

It was meant to change the const used in the ProvisioningClassName field.

Thanks for clearing this up. I don't think we want to mix API change with implementation in a single PR though. Let's make API change (and update to KEP) separately.

API change submitted in #6854

yaroslava-serdiuk · 2024-05-17T13:34:50Z

cluster-autoscaler/provisioningrequest/besteffortatomic/provisioning_class.go

+// Best effort atomic provisionig class requests scale-up only if it's possible
+// to atomically request enough resources for all pods specified in a
+// ProvisioningRequest. It's "best effort" as it admits workload immediately
+// after successful request, without waiting to verify that resources started.


If AtomicIncreaseSize is supposed to wait for everything to be ready

@kisieland I suppose it is, correct?

aleksandra-malinowska · 2024-05-20T10:16:03Z

/easycla

aleksandra-malinowska · 2024-05-22T13:34:34Z

@yaroslava-serdiuk @kisieland I believe all current comments are resolved now

kisieland · 2024-05-23T09:47:05Z

/lgtm

k8s-ci-robot · 2024-05-23T09:47:09Z

@kisieland: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

aleksandra-malinowska · 2024-05-23T10:39:44Z

/cc @towca

yaroslava-serdiuk · 2024-05-23T10:59:44Z

Could you update the class name in the description, please

aleksandra-malinowska · 2024-05-23T11:15:08Z

Could you update the class name in the description, please

Done

yaroslava-serdiuk · 2024-05-23T11:16:52Z

For complete implementation we also need to update ProvisioningRequestProcessor (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/provreq/provisioning_request_processor.go) that updates ProvReq conditions.

Seems we can reuse check-capacity processor: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/provisioningrequest/checkcapacity/processor.go#L53, so we just need to add the besteffortatomic class name to the class check and probably move processor to another place (not sure about this)

Also, could you add a test scenario with ProvReq that have BookingExpired condition and so ScaleUp is not needed (without BookingExpired condition the ScaleUp is needed)?

cluster-autoscaler/provisioningrequest/besteffortatomic/provisioning_class.go

towca · 2024-05-24T14:17:06Z

cluster-autoscaler/provisioningrequest/besteffortatomic/provisioning_class.go

+// Best effort atomic provisionig class requests scale-up only if it's possible
+// to atomically request enough resources for all pods specified in a
+// ProvisioningRequest. It's "best effort" as it admits workload immediately
+// after successful request, without waiting to verify that resources started.


What does "you are able to" mean in this context?

I have also been under the impression that AtomicIncreaseSize guarantees the VMs coming up reasonably soon if it succeeds. If not that, how is it different than regular IncreaseSize? At least in GCE, most issues with creating instances happen after the increase call (stockouts, quota issues, ip exhaustion, etc.). Is AtomicIncreaseSize supposed to account for them and guarantee they won't happen if the request succeeds?

cluster-autoscaler/provisioningrequest/conditions/conditions.go

cluster-autoscaler/provisioningrequest/orchestrator/orchestrator_test.go

towca · 2024-05-27T12:30:04Z

The code changes LGTM.

I understand that this PR doesn't really add support for the new provisioning class without the processor changes mentioned by @yaroslava-serdiuk, right? Could you update the PR/commit descriptions to capture that (also the suffix in the class name is still wrong I think) before unholding? And squash the commits/let Tide do it.

I'd also really like for the API expectations discussion to result in a follow-up PR clarifying things.

/hold
/lgtm
/approve

cluster-autoscaler/provisioningrequest/orchestrator/orchestrator_test.go

cluster-autoscaler/provisioningrequest/conditions/conditions.go

towca · 2024-05-27T12:23:38Z

cluster-autoscaler/provisioningrequest/besteffortatomic/provisioning_class.go

+// Best effort atomic provisionig class requests scale-up only if it's possible
+// to atomically request enough resources for all pods specified in a
+// ProvisioningRequest. It's "best effort" as it admits workload immediately
+// after successful request, without waiting to verify that resources started.


The only way AtomicIncreaseSize could provide a stronger guarantee would be if it were a blocking call, and that's infeasible from performance perspective - if we want to wait for VMs before admitting workload, we'll need to find a way to do so without blocking other autoscaler ops.

I guess it could also e.g. provide a stronger guarantee asynchronously with a (much) longer timeout. I think that's what happens in the GKE ProvReq case right? Just want to understand the differences here.

I believe the cloudprovider API within autoscaler doesn't specify that IncreaseSize has to do nothing but return an error if the entire delta is impossible to request. GCE cloudprovider happens to do that, though, so there's not much difference in this cas

@aleksandra-malinowska Hmmm I see, so the only difference is "atomic version guarantees that the method fails if it knows that not all VMs can be provisioned before doing the API call"? I have a couple of questions:

Do you really anticipate it being useful in practice, given that most issues impacting VM provisioning can't be detected before doing the increase call? The only thing I can see this catching is exceeding the node group's max limit, which is already covered by the scale-up code after your previous changes. Is this just for completeness with your previous changes?

The main change of AtomicIncreaseSize vs the IncreaseSize is the fact that if the call would result in partial scale up (as it can happen with IncreaseSize for example due to stock-outs) then no VMs will be provisioned and the whole scale up will fail, i.e. no retries will be made.

Regarding the duration of the call, it should provide the answer within the similar time-frame as the IncreaseSize. Which AFAIK now is also handled asynchronously.

@kisieland This seems to go against what @aleksandra-malinowska is saying above? I haven't seen IncreaseSize calls failing because of stockouts - at least not in GCE which is the main target for this feature. In my experience, the increase call itself ~always succeeds unless the max MIG size limit is exceeded. Stockout issues are only visible when the VMs fail to come up after the call. Unless the atomic version is supposed to predict such stockout issues and fail the API call if so after all?

This isn't blocking for this PR, but IMO we should really clearly set the expectations here, both internally for the cloud providers and externally for ProvReq users. Even people deeply involved with the feature don't seem to have full clarity on this.

k8s-ci-robot · 2024-05-27T12:30:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aleksandra-malinowska, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [towca]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aleksandra-malinowska · 2024-05-27T15:03:26Z

I understand that this PR doesn't really add support for the new provisioning class without the processor changes mentioned by Yaroslava Serdiuk, right?

It implements the ProvisioningClass (internal object) for a provisioning class best-effort-atomic. ProvisioningClass being the part which will actually provisions resources. Naming things is hard.

Missing processor (will need a refactor to clean up the code) & injector (trivial change) are tracked as separate items in #6815.

/label tide/merge-method-squash
/unhold

k8s-ci-robot requested a review from yaroslava-serdiuk May 14, 2024 10:58

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 14, 2024

k8s-ci-robot requested a review from towca May 14, 2024 10:58

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cluster-autoscaler size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 14, 2024

aleksandra-malinowska force-pushed the provreq branch from 5c1df4c to fe17209 Compare May 14, 2024 11:02

aleksandra-malinowska mentioned this pull request May 15, 2024

Add support for all-or-nothing scale-up strategy #6821

Merged

aleksandra-malinowska force-pushed the provreq branch from fe17209 to 4dcae04 Compare May 16, 2024 09:38

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 16, 2024

yaroslava-serdiuk reviewed May 16, 2024

View reviewed changes

aleksandra-malinowska force-pushed the provreq branch from 4dcae04 to 0e70382 Compare May 16, 2024 17:17

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 16, 2024

yaroslava-serdiuk reviewed May 17, 2024

View reviewed changes

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 20, 2024

yaroslava-serdiuk mentioned this pull request May 20, 2024

Rename atomic-scale-up.kubernetes.io to best-effort-atomic-scale-up.kubernetes.io #6849

Closed

Add support for atomic scale-up provisioning request class

e7d9a96

aleksandra-malinowska force-pushed the provreq branch from 0e70382 to e7d9a96 Compare May 22, 2024 13:32

aleksandra-malinowska changed the title ~~Implement provisionig class for atomic-scale-up.kubernetes.io ProvisioningRequests~~ Implement provisionig class for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests May 23, 2024

towca reviewed May 24, 2024

View reviewed changes

review fixes

5cc8562

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 27, 2024

k8s-ci-robot assigned towca May 27, 2024

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 27, 2024

towca approved these changes May 27, 2024

View reviewed changes

aleksandra-malinowska changed the title ~~Implement provisionig class for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests~~ Implement ProvisioningClass for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests May 27, 2024

k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 27, 2024

k8s-ci-robot merged commit 6c1e3e7 into kubernetes:master May 27, 2024
6 checks passed

Implement ProvisioningClass for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests #6824

Implement ProvisioningClass for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests #6824

Conversation

aleksandra-malinowska commented May 14, 2024 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented May 14, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaroslava-serdiuk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aleksandra-malinowska commented May 20, 2024

aleksandra-malinowska commented May 22, 2024

kisieland commented May 23, 2024

k8s-ci-robot commented May 23, 2024

aleksandra-malinowska commented May 23, 2024

yaroslava-serdiuk commented May 23, 2024 • edited

aleksandra-malinowska commented May 23, 2024

yaroslava-serdiuk commented May 23, 2024

Choose a reason for hiding this comment

towca commented May 27, 2024

Choose a reason for hiding this comment

k8s-ci-robot commented May 27, 2024

aleksandra-malinowska commented May 27, 2024 • edited

aleksandra-malinowska commented May 14, 2024 •

edited

yaroslava-serdiuk commented May 23, 2024 •

edited

aleksandra-malinowska commented May 27, 2024 •

edited