Unavailable Offerings cache should respond to OperationNotAllowed by caching instances by SKU Series #192
Labels
area/instance-types
Issues or PRs related to instance types
area/unavailable-offerings
Issues or PRs related to unavailable offerings cache
kind/bug
Categorizes issue or PR as related to a bug.
needs-triage
Indicates an issue or PR lacks a `triage/foo` label and requires one.
Tell us about your request
In the karpenter provisioning loop, we will mark an instance as unavailable when we encounter SKUSeries errors. The problem here is that we cache the instance type rather than the sku series.
The dpldsv5 series has its own set of cores quota associated with the subscripton. But rather than caching an the series itself and all isntances part of that series, we instead cache the instance type.
For the dpldsv5 series these are the instance types
In karpenters current implmentation, we would see quota errors for multiple of these instance types. We fail mulitple times before trying a different sku series, even though it would never work.
"Operation could not be completed as it results in exceeding approved standardDplsv5Family Cores quota"
This is the error we get back from arm for each failed instance provisioning. We it over and over and the quota is shared for all instance types, that belong to that sku series.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
We end up hitting nodeclaim registration ttl before getting to provision the vm due to not caching correctly by sku series in extreme cases.
In this particular case, we hit the error for the instances in the standardDLSv5Family then the standardDASv5Family. There are 6 instance types at similar price points all in these two familes so we go through all of those instance types.
We end up hitting nodeclaim registration ttl before getting to provision the vm.
We hit the error for the instances in the standardDLSv5Family then the standardDASv5Family. There are only two series here. If we cache by series, we could get the node to provision in two attempts, instead, we try 6 times, and the registration ttl is hit.
Quota errors
Are you currently working around this issue?
The general workaround is to let karpenter fail 2-3 times and move onto a different sku series. But in the case of nodeclaim
general-purpose-hcdgt
this wasn't enough. Since we hit this error for 2 separate sku series, we weren't able to remediate in time and the node didn't provision.Additional Context
No response
Attachments
No response
Community Note
The text was updated successfully, but these errors were encountered: