Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ProvisioningClass for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests #6824

Merged
merged 2 commits into from
May 27, 2024

Conversation

aleksandra-malinowska
Copy link
Contributor

@aleksandra-malinowska aleksandra-malinowska commented May 14, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implements provisioning class for best-effort-atomic-scale-up..x-k8s.io ProvisioningRequests: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

Which issue(s) this PR fixes:

Issue #6815

Does this PR introduce a user-facing change?

Added support for `best-effort-atomic-scale-up..x-k8s.io` ProvisioningRequests

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

/cc @yaroslava-serdiuk @kisieland @towca

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 14, 2024
@k8s-ci-robot k8s-ci-robot requested a review from towca May 14, 2024 10:58
@k8s-ci-robot
Copy link
Contributor

@aleksandra-malinowska: GitHub didn't allow me to request PR reviews from the following users: kisieland.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implements provisioning class for atomic-scale-up.kubernetes.io ProvisioningRequests: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

Which issue(s) this PR fixes:

Issue #6815

Special notes for your reviewer:

Please ONLY consider the second commit, the first one is being reviewed in #6821

Does this PR introduce a user-facing change?

Added support for `atomic-scale-up.kubernetes.io` ProvisioningRequests

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.md

/cc @yaroslava-serdiuk @kisieland @towca

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cluster-autoscaler size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 14, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 16, 2024
// Best effort atomic provisionig class requests scale-up only if it's possible
// to atomically request enough resources for all pods specified in a
// ProvisioningRequest. It's "best effort" as it admits workload immediately
// after successful request, without waiting to verify that resources started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the comment is correct. It's best effort because it's rely on AtomicIncreaseSize API and if it's not implemented it fallbacks to IncreaseSize. And for IncreaseSize API your comment is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still best effort unless AtomicIncreaseSize is supposed to wait for the resources to be provisioned (for the VMs to start and register as nodes) - a VM can still fail and get stuck.

If AtomicIncreaseSize is supposed to wait for everything to be ready, it would be a serious performance issue. We would have to wait for it asynchronously if we wanted to have a true guarantee.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If AtomicIncreaseSize is supposed to wait for everything to be ready

@kisieland I suppose it is, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AtomicIncreaseSize is not supposed to wait for everything to be ready, the API should be add X or do nothing.
Whereas IncreaseSize is add X or anything that you are able to (anything in between 0 and X).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "you are able to" mean in this context?

I have also been under the impression that AtomicIncreaseSize guarantees the VMs coming up reasonably soon if it succeeds. If not that, how is it different than regular IncreaseSize? At least in GCE, most issues with creating instances happen after the increase call (stockouts, quota issues, ip exhaustion, etc.). Is AtomicIncreaseSize supposed to account for them and guarantee they won't happen if the request succeeds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not that, how is it different than regular IncreaseSize?

I believe the cloudprovider API within autoscaler doesn't specify that IncreaseSize has to do nothing but return an error if the entire delta is impossible to request. GCE cloudprovider happens to do that, though, so there's not much difference in this case.

The only way AtomicIncreaseSize could provide a stronger guarantee would be if it were a blocking call, and that's infeasible from performance perspective - if we want to wait for VMs before admitting workload, we'll need to find a way to do so without blocking other autoscaler ops.

One way to do that would be to add a checkpoint to ProvisionionigRequest, sth like TriggeredScaleUp=true, then treat request as if it were of check-capacity class and only when the capacity is provisioned update it with Provisioned=true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change of AtomicIncreaseSize vs the IncreaseSize is the fact that if the call would result in partial scale up (as it can happen with IncreaseSize for example due to stock-outs) then no VMs will be provisioned and the whole scale up will fail, i.e. no retries will be made.

Regarding the duration of the call, it should provide the answer within the similar time-frame as the IncreaseSize. Which AFAIK now is also handled asynchronously.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way AtomicIncreaseSize could provide a stronger guarantee would be if it were a blocking call, and that's infeasible from performance perspective - if we want to wait for VMs before admitting workload, we'll need to find a way to do so without blocking other autoscaler ops.

I guess it could also e.g. provide a stronger guarantee asynchronously with a (much) longer timeout. I think that's what happens in the GKE ProvReq case right? Just want to understand the differences here.

I believe the cloudprovider API within autoscaler doesn't specify that IncreaseSize has to do nothing but return an error if the entire delta is impossible to request. GCE cloudprovider happens to do that, though, so there's not much difference in this cas

@aleksandra-malinowska Hmmm I see, so the only difference is "atomic version guarantees that the method fails if it knows that not all VMs can be provisioned before doing the API call"? I have a couple of questions:

Do you really anticipate it being useful in practice, given that most issues impacting VM provisioning can't be detected before doing the increase call? The only thing I can see this catching is exceeding the node group's max limit, which is already covered by the scale-up code after your previous changes. Is this just for completeness with your previous changes?

The main change of AtomicIncreaseSize vs the IncreaseSize is the fact that if the call would result in partial scale up (as it can happen with IncreaseSize for example due to stock-outs) then no VMs will be provisioned and the whole scale up will fail, i.e. no retries will be made.

Regarding the duration of the call, it should provide the answer within the similar time-frame as the IncreaseSize. Which AFAIK now is also handled asynchronously.

@kisieland This seems to go against what @aleksandra-malinowska is saying above? I haven't seen IncreaseSize calls failing because of stockouts - at least not in GCE which is the main target for this feature. In my experience, the increase call itself ~always succeeds unless the max MIG size limit is exceeded. Stockout issues are only visible when the VMs fail to come up after the call. Unless the atomic version is supposed to predict such stockout issues and fail the API call if so after all?

This isn't blocking for this PR, but IMO we should really clearly set the expectations here, both internally for the cloud providers and externally for ProvReq users. Even people deeply involved with the feature don't seem to have full clarity on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strength of guarantees for the atomicity depends on implementation in a given cloud provider. The call itself expresses only an intent.

AtomicIncreaseSize is used for scale-ups which only make sense if the entire delta is provisioned.

IncreaseSize is used when a partial scale-up might still be useful. This was always the default assumption in autoscaling logic, we're just making it explicit by introducing another option.

In GCE, we can use ResizeRequest API for a more atomic resize. This still doesn't guarantee that the nodes will register with the KCP, but it helps avoid partial scale-ups in some scenarios, such as stockouts.

The assumption is that available external APIs are likely to differ substantially between providers, and so will the cases they handle or don't handle. Some providers may have no support for atomic resizes whatsoever, in which case the implementation will be identical to IncreaseSize's.

Because it's really only best effort on cloud provider's side, we don't want to promise that AtomicIncreaseSize gives any guarantee to autoscaling logic, and we don't rely on it later - failed scale-up is handled the exact same way.

// For provisioning requests, unschedulablePods are actually all injected pods. Some may even be schedulable!
actuallyUnschedulablePods, err := o.filterOutSchedulable(unschedulablePods)
if err != nil {
conditions.AddOrUpdateCondition(pr, v1beta1.Provisioned, metav1.ConditionFalse, conditions.CapacityIsNotFoundReason, "Capacity is not found, CA will try to find it later.", metav1.Now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a different message, explaining that there is an error during filter out schedulable pods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

st, err := o.scaleUpOrchestrator.ScaleUp(actuallyUnschedulablePods, nodes, daemonSets, nodeInfos, true)
if err == nil && st.Result == status.ScaleUpSuccessful {
// Happy path - all is well.
conditions.AddOrUpdateCondition(pr, v1beta1.Provisioned, metav1.ConditionTrue, conditions.CapacityIsFoundReason, conditions.CapacityIsFoundMsg, metav1.Now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a different reason/massage that describe that ScaleUp request is successful.
CapacityIsProvisioned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

provreqwrapper.TestProvReqOptions{
Name: "autoprovisioningAtomicScaleUpReq",
CPU: "100m",
Memory: "100",
PodCount: int32(5),
Class: v1beta1.ProvisioningClassAtomicScaleUp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename provisioning class in api package

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean. I don't think we were planning to change API here?

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 16, 2024
Copy link
Contributor

@yaroslava-serdiuk yaroslava-serdiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update description as well

if err != nil {
return status.UpdateScaleUpError(&status.ScaleUpStatus{}, errors.NewAutoscalerError(errors.InternalError, err.Error()))
}
if pr.Spec.ProvisioningClassName != v1beta1.ProvisioningClassAtomicScaleUp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, since we name ProvisioningClass as bestEffortAtomic, let's rename v1beta1.ProvisioningClassAtomicScaleUp as well for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously assumed the discussion about 'best effort' and other possible names was only related to the terms used inside the CA implementation, not to change the actual ProvisioningRequest API.

@kisieland - as KEP author, what's your take on API change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was meant to change the const used in the ProvisioningClassName field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clearing this up. I don't think we want to mix API change with implementation in a single PR though. Let's make API change (and update to KEP) separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API change submitted in #6854

// Best effort atomic provisionig class requests scale-up only if it's possible
// to atomically request enough resources for all pods specified in a
// ProvisioningRequest. It's "best effort" as it admits workload immediately
// after successful request, without waiting to verify that resources started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If AtomicIncreaseSize is supposed to wait for everything to be ready

@kisieland I suppose it is, correct?

@aleksandra-malinowska
Copy link
Contributor Author

/easycla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 20, 2024
@aleksandra-malinowska
Copy link
Contributor Author

@yaroslava-serdiuk @kisieland I believe all current comments are resolved now

@kisieland
Copy link
Contributor

/lgtm

@k8s-ci-robot
Copy link
Contributor

@kisieland: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@aleksandra-malinowska
Copy link
Contributor Author

/cc @towca

@yaroslava-serdiuk
Copy link
Contributor

yaroslava-serdiuk commented May 23, 2024

Could you update the class name in the description, please

@aleksandra-malinowska aleksandra-malinowska changed the title Implement provisionig class for atomic-scale-up.kubernetes.io ProvisioningRequests Implement provisionig class for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests May 23, 2024
@aleksandra-malinowska
Copy link
Contributor Author

Could you update the class name in the description, please

Done

@yaroslava-serdiuk
Copy link
Contributor

For complete implementation we also need to update ProvisioningRequestProcessor (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/provreq/provisioning_request_processor.go) that updates ProvReq conditions.

Seems we can reuse check-capacity processor: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/provisioningrequest/checkcapacity/processor.go#L53, so we just need to add the besteffortatomic class name to the class check and probably move processor to another place (not sure about this)

Also, could you add a test scenario with ProvReq that have BookingExpired condition and so ScaleUp is not needed (without BookingExpired condition the ScaleUp is needed)?

// Best effort atomic provisionig class requests scale-up only if it's possible
// to atomically request enough resources for all pods specified in a
// ProvisioningRequest. It's "best effort" as it admits workload immediately
// after successful request, without waiting to verify that resources started.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "you are able to" mean in this context?

I have also been under the impression that AtomicIncreaseSize guarantees the VMs coming up reasonably soon if it succeeds. If not that, how is it different than regular IncreaseSize? At least in GCE, most issues with creating instances happen after the increase call (stockouts, quota issues, ip exhaustion, etc.). Is AtomicIncreaseSize supposed to account for them and guarantee they won't happen if the request succeeds?

@towca
Copy link
Collaborator

towca commented May 27, 2024

The code changes LGTM.

I understand that this PR doesn't really add support for the new provisioning class without the processor changes mentioned by @yaroslava-serdiuk, right? Could you update the PR/commit descriptions to capture that (also the suffix in the class name is still wrong I think) before unholding? And squash the commits/let Tide do it.

I'd also really like for the API expectations discussion to result in a follow-up PR clarifying things.

/hold
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 27, 2024
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 27, 2024
// Best effort atomic provisionig class requests scale-up only if it's possible
// to atomically request enough resources for all pods specified in a
// ProvisioningRequest. It's "best effort" as it admits workload immediately
// after successful request, without waiting to verify that resources started.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way AtomicIncreaseSize could provide a stronger guarantee would be if it were a blocking call, and that's infeasible from performance perspective - if we want to wait for VMs before admitting workload, we'll need to find a way to do so without blocking other autoscaler ops.

I guess it could also e.g. provide a stronger guarantee asynchronously with a (much) longer timeout. I think that's what happens in the GKE ProvReq case right? Just want to understand the differences here.

I believe the cloudprovider API within autoscaler doesn't specify that IncreaseSize has to do nothing but return an error if the entire delta is impossible to request. GCE cloudprovider happens to do that, though, so there's not much difference in this cas

@aleksandra-malinowska Hmmm I see, so the only difference is "atomic version guarantees that the method fails if it knows that not all VMs can be provisioned before doing the API call"? I have a couple of questions:

Do you really anticipate it being useful in practice, given that most issues impacting VM provisioning can't be detected before doing the increase call? The only thing I can see this catching is exceeding the node group's max limit, which is already covered by the scale-up code after your previous changes. Is this just for completeness with your previous changes?

The main change of AtomicIncreaseSize vs the IncreaseSize is the fact that if the call would result in partial scale up (as it can happen with IncreaseSize for example due to stock-outs) then no VMs will be provisioned and the whole scale up will fail, i.e. no retries will be made.

Regarding the duration of the call, it should provide the answer within the similar time-frame as the IncreaseSize. Which AFAIK now is also handled asynchronously.

@kisieland This seems to go against what @aleksandra-malinowska is saying above? I haven't seen IncreaseSize calls failing because of stockouts - at least not in GCE which is the main target for this feature. In my experience, the increase call itself ~always succeeds unless the max MIG size limit is exceeded. Stockout issues are only visible when the VMs fail to come up after the call. Unless the atomic version is supposed to predict such stockout issues and fail the API call if so after all?

This isn't blocking for this PR, but IMO we should really clearly set the expectations here, both internally for the cloud providers and externally for ProvReq users. Even people deeply involved with the feature don't seem to have full clarity on this.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aleksandra-malinowska, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aleksandra-malinowska aleksandra-malinowska changed the title Implement provisionig class for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests Implement ProvisioningClass for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests May 27, 2024
@aleksandra-malinowska
Copy link
Contributor Author

aleksandra-malinowska commented May 27, 2024

I understand that this PR doesn't really add support for the new provisioning class without the processor changes mentioned by Yaroslava Serdiuk, right?

It implements the ProvisioningClass (internal object) for a provisioning class best-effort-atomic. ProvisioningClass being the part which will actually provisions resources. Naming things is hard.

Missing processor (will need a refactor to clean up the code) & injector (trivial change) are tracked as separate items in #6815.

/label tide/merge-method-squash
/unhold

@k8s-ci-robot k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 27, 2024
@k8s-ci-robot k8s-ci-robot merged commit 6c1e3e7 into kubernetes:master May 27, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants