-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ProvisioningClass for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests #6824
Implement ProvisioningClass for best-effort-atomic-scale-up.kubernetes.io ProvisioningRequests #6824
Conversation
@aleksandra-malinowska: GitHub didn't allow me to request PR reviews from the following users: kisieland. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
5c1df4c
to
fe17209
Compare
fe17209
to
4dcae04
Compare
// Best effort atomic provisionig class requests scale-up only if it's possible | ||
// to atomically request enough resources for all pods specified in a | ||
// ProvisioningRequest. It's "best effort" as it admits workload immediately | ||
// after successful request, without waiting to verify that resources started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the comment is correct. It's best effort because it's rely on AtomicIncreaseSize API and if it's not implemented it fallbacks to IncreaseSize. And for IncreaseSize API your comment is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still best effort unless AtomicIncreaseSize
is supposed to wait for the resources to be provisioned (for the VMs to start and register as nodes) - a VM can still fail and get stuck.
If AtomicIncreaseSize
is supposed to wait for everything to be ready, it would be a serious performance issue. We would have to wait for it asynchronously if we wanted to have a true guarantee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If AtomicIncreaseSize is supposed to wait for everything to be ready
@kisieland I suppose it is, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AtomicIncreaseSize
is not supposed to wait for everything to be ready, the API should be add X or do nothing.
Whereas IncreaseSize
is add X or anything that you are able to (anything in between 0 and X).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "you are able to" mean in this context?
I have also been under the impression that AtomicIncreaseSize
guarantees the VMs coming up reasonably soon if it succeeds. If not that, how is it different than regular IncreaseSize
? At least in GCE, most issues with creating instances happen after the increase call (stockouts, quota issues, ip exhaustion, etc.). Is AtomicIncreaseSize
supposed to account for them and guarantee they won't happen if the request succeeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not that, how is it different than regular IncreaseSize?
I believe the cloudprovider API within autoscaler doesn't specify that IncreaseSize
has to do nothing but return an error if the entire delta is impossible to request. GCE cloudprovider happens to do that, though, so there's not much difference in this case.
The only way AtomicIncreaseSize
could provide a stronger guarantee would be if it were a blocking call, and that's infeasible from performance perspective - if we want to wait for VMs before admitting workload, we'll need to find a way to do so without blocking other autoscaler ops.
One way to do that would be to add a checkpoint to ProvisionionigRequest, sth like TriggeredScaleUp=true
, then treat request as if it were of check-capacity
class and only when the capacity is provisioned update it with Provisioned=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change of AtomicIncreaseSize
vs the IncreaseSize
is the fact that if the call would result in partial scale up (as it can happen with IncreaseSize
for example due to stock-outs) then no VMs will be provisioned and the whole scale up will fail, i.e. no retries will be made.
Regarding the duration of the call, it should provide the answer within the similar time-frame as the IncreaseSize
. Which AFAIK now is also handled asynchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way
AtomicIncreaseSize
could provide a stronger guarantee would be if it were a blocking call, and that's infeasible from performance perspective - if we want to wait for VMs before admitting workload, we'll need to find a way to do so without blocking other autoscaler ops.
I guess it could also e.g. provide a stronger guarantee asynchronously with a (much) longer timeout. I think that's what happens in the GKE ProvReq case right? Just want to understand the differences here.
I believe the cloudprovider API within autoscaler doesn't specify that
IncreaseSize
has to do nothing but return an error if the entire delta is impossible to request. GCE cloudprovider happens to do that, though, so there's not much difference in this cas
@aleksandra-malinowska Hmmm I see, so the only difference is "atomic version guarantees that the method fails if it knows that not all VMs can be provisioned before doing the API call"? I have a couple of questions:
Do you really anticipate it being useful in practice, given that most issues impacting VM provisioning can't be detected before doing the increase call? The only thing I can see this catching is exceeding the node group's max limit, which is already covered by the scale-up code after your previous changes. Is this just for completeness with your previous changes?
The main change of
AtomicIncreaseSize
vs theIncreaseSize
is the fact that if the call would result in partial scale up (as it can happen withIncreaseSize
for example due to stock-outs) then no VMs will be provisioned and the whole scale up will fail, i.e. no retries will be made.Regarding the duration of the call, it should provide the answer within the similar time-frame as the
IncreaseSize
. Which AFAIK now is also handled asynchronously.
@kisieland This seems to go against what @aleksandra-malinowska is saying above? I haven't seen IncreaseSize
calls failing because of stockouts - at least not in GCE which is the main target for this feature. In my experience, the increase call itself ~always succeeds unless the max MIG size limit is exceeded. Stockout issues are only visible when the VMs fail to come up after the call. Unless the atomic version is supposed to predict such stockout issues and fail the API call if so after all?
This isn't blocking for this PR, but IMO we should really clearly set the expectations here, both internally for the cloud providers and externally for ProvReq users. Even people deeply involved with the feature don't seem to have full clarity on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strength of guarantees for the atomicity depends on implementation in a given cloud provider. The call itself expresses only an intent.
AtomicIncreaseSize
is used for scale-ups which only make sense if the entire delta is provisioned.
IncreaseSize
is used when a partial scale-up might still be useful. This was always the default assumption in autoscaling logic, we're just making it explicit by introducing another option.
In GCE, we can use ResizeRequest API for a more atomic resize. This still doesn't guarantee that the nodes will register with the KCP, but it helps avoid partial scale-ups in some scenarios, such as stockouts.
The assumption is that available external APIs are likely to differ substantially between providers, and so will the cases they handle or don't handle. Some providers may have no support for atomic resizes whatsoever, in which case the implementation will be identical to IncreaseSize
's.
Because it's really only best effort on cloud provider's side, we don't want to promise that AtomicIncreaseSize
gives any guarantee to autoscaling logic, and we don't rely on it later - failed scale-up is handled the exact same way.
// For provisioning requests, unschedulablePods are actually all injected pods. Some may even be schedulable! | ||
actuallyUnschedulablePods, err := o.filterOutSchedulable(unschedulablePods) | ||
if err != nil { | ||
conditions.AddOrUpdateCondition(pr, v1beta1.Provisioned, metav1.ConditionFalse, conditions.CapacityIsNotFoundReason, "Capacity is not found, CA will try to find it later.", metav1.Now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a different message, explaining that there is an error during filter out schedulable pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
st, err := o.scaleUpOrchestrator.ScaleUp(actuallyUnschedulablePods, nodes, daemonSets, nodeInfos, true) | ||
if err == nil && st.Result == status.ScaleUpSuccessful { | ||
// Happy path - all is well. | ||
conditions.AddOrUpdateCondition(pr, v1beta1.Provisioned, metav1.ConditionTrue, conditions.CapacityIsFoundReason, conditions.CapacityIsFoundMsg, metav1.Now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a different reason/massage that describe that ScaleUp request is successful.
CapacityIsProvisioned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
provreqwrapper.TestProvReqOptions{ | ||
Name: "autoprovisioningAtomicScaleUpReq", | ||
CPU: "100m", | ||
Memory: "100", | ||
PodCount: int32(5), | ||
Class: v1beta1.ProvisioningClassAtomicScaleUp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename provisioning class in api package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you mean. I don't think we were planning to change API here?
4dcae04
to
0e70382
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update description as well
if err != nil { | ||
return status.UpdateScaleUpError(&status.ScaleUpStatus{}, errors.NewAutoscalerError(errors.InternalError, err.Error())) | ||
} | ||
if pr.Spec.ProvisioningClassName != v1beta1.ProvisioningClassAtomicScaleUp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, since we name ProvisioningClass as bestEffortAtomic, let's rename v1beta1.ProvisioningClassAtomicScaleUp as well for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously assumed the discussion about 'best effort' and other possible names was only related to the terms used inside the CA implementation, not to change the actual ProvisioningRequest API.
@kisieland - as KEP author, what's your take on API change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was meant to change the const used in the ProvisioningClassName
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clearing this up. I don't think we want to mix API change with implementation in a single PR though. Let's make API change (and update to KEP) separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API change submitted in #6854
// Best effort atomic provisionig class requests scale-up only if it's possible | ||
// to atomically request enough resources for all pods specified in a | ||
// ProvisioningRequest. It's "best effort" as it admits workload immediately | ||
// after successful request, without waiting to verify that resources started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If AtomicIncreaseSize is supposed to wait for everything to be ready
@kisieland I suppose it is, correct?
/easycla |
0e70382
to
e7d9a96
Compare
@yaroslava-serdiuk @kisieland I believe all current comments are resolved now |
/lgtm |
@kisieland: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc @towca |
Could you update the class name in the description, please |
Done |
For complete implementation we also need to update ProvisioningRequestProcessor (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/provreq/provisioning_request_processor.go) that updates ProvReq conditions. Seems we can reuse check-capacity processor: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/provisioningrequest/checkcapacity/processor.go#L53, so we just need to add the besteffortatomic class name to the class check and probably move processor to another place (not sure about this) Also, could you add a test scenario with ProvReq that have BookingExpired condition and so ScaleUp is not needed (without BookingExpired condition the ScaleUp is needed)? |
cluster-autoscaler/provisioningrequest/besteffortatomic/provisioning_class.go
Outdated
Show resolved
Hide resolved
// Best effort atomic provisionig class requests scale-up only if it's possible | ||
// to atomically request enough resources for all pods specified in a | ||
// ProvisioningRequest. It's "best effort" as it admits workload immediately | ||
// after successful request, without waiting to verify that resources started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "you are able to" mean in this context?
I have also been under the impression that AtomicIncreaseSize
guarantees the VMs coming up reasonably soon if it succeeds. If not that, how is it different than regular IncreaseSize
? At least in GCE, most issues with creating instances happen after the increase call (stockouts, quota issues, ip exhaustion, etc.). Is AtomicIncreaseSize
supposed to account for them and guarantee they won't happen if the request succeeds?
cluster-autoscaler/provisioningrequest/orchestrator/orchestrator_test.go
Show resolved
Hide resolved
cluster-autoscaler/provisioningrequest/orchestrator/orchestrator_test.go
Show resolved
Hide resolved
The code changes LGTM. I understand that this PR doesn't really add support for the new provisioning class without the processor changes mentioned by @yaroslava-serdiuk, right? Could you update the PR/commit descriptions to capture that (also the suffix in the class name is still wrong I think) before unholding? And squash the commits/let Tide do it. I'd also really like for the API expectations discussion to result in a follow-up PR clarifying things. /hold |
cluster-autoscaler/provisioningrequest/orchestrator/orchestrator_test.go
Show resolved
Hide resolved
cluster-autoscaler/provisioningrequest/orchestrator/orchestrator_test.go
Show resolved
Hide resolved
// Best effort atomic provisionig class requests scale-up only if it's possible | ||
// to atomically request enough resources for all pods specified in a | ||
// ProvisioningRequest. It's "best effort" as it admits workload immediately | ||
// after successful request, without waiting to verify that resources started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way
AtomicIncreaseSize
could provide a stronger guarantee would be if it were a blocking call, and that's infeasible from performance perspective - if we want to wait for VMs before admitting workload, we'll need to find a way to do so without blocking other autoscaler ops.
I guess it could also e.g. provide a stronger guarantee asynchronously with a (much) longer timeout. I think that's what happens in the GKE ProvReq case right? Just want to understand the differences here.
I believe the cloudprovider API within autoscaler doesn't specify that
IncreaseSize
has to do nothing but return an error if the entire delta is impossible to request. GCE cloudprovider happens to do that, though, so there's not much difference in this cas
@aleksandra-malinowska Hmmm I see, so the only difference is "atomic version guarantees that the method fails if it knows that not all VMs can be provisioned before doing the API call"? I have a couple of questions:
Do you really anticipate it being useful in practice, given that most issues impacting VM provisioning can't be detected before doing the increase call? The only thing I can see this catching is exceeding the node group's max limit, which is already covered by the scale-up code after your previous changes. Is this just for completeness with your previous changes?
The main change of
AtomicIncreaseSize
vs theIncreaseSize
is the fact that if the call would result in partial scale up (as it can happen withIncreaseSize
for example due to stock-outs) then no VMs will be provisioned and the whole scale up will fail, i.e. no retries will be made.Regarding the duration of the call, it should provide the answer within the similar time-frame as the
IncreaseSize
. Which AFAIK now is also handled asynchronously.
@kisieland This seems to go against what @aleksandra-malinowska is saying above? I haven't seen IncreaseSize
calls failing because of stockouts - at least not in GCE which is the main target for this feature. In my experience, the increase call itself ~always succeeds unless the max MIG size limit is exceeded. Stockout issues are only visible when the VMs fail to come up after the call. Unless the atomic version is supposed to predict such stockout issues and fail the API call if so after all?
This isn't blocking for this PR, but IMO we should really clearly set the expectations here, both internally for the cloud providers and externally for ProvReq users. Even people deeply involved with the feature don't seem to have full clarity on this.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aleksandra-malinowska, towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
It implements the ProvisioningClass (internal object) for a provisioning class Missing processor (will need a refactor to clean up the code) & injector (trivial change) are tracked as separate items in #6815. /label tide/merge-method-squash |
…s.io ProvisioningRequests (kubernetes#6824) * Add support for atomic scale-up provisioning request class * review fixes
…s.io ProvisioningRequests (kubernetes#6824) * Add support for atomic scale-up provisioning request class * review fixes
/cherry-pick cluster-autoscaler-release-1.30 |
@yaroslava-serdiuk: #6824 failed to apply on top of branch "cluster-autoscaler-release-1.30":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cherry-pick cluster-autoscaler-release-1.30 |
@yaroslava-serdiuk: #6824 failed to apply on top of branch "cluster-autoscaler-release-1.30":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cherry-pick cluster-autoscaler-release-1.30 |
@yaroslava-serdiuk: new pull request created: #7030 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Implements provisioning class for
best-effort-atomic-scale-up..x-k8s.io
ProvisioningRequests: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/provisioning-request.mdWhich issue(s) this PR fixes:
Issue #6815
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/cc @yaroslava-serdiuk @kisieland @towca