Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unavailable Offerings cache should respond to OperationNotAllowed by caching instances by SKU Series #192

Open
Bryce-Soghigian opened this issue Mar 11, 2024 · 0 comments
Labels
area/instance-types Issues or PRs related to instance types area/unavailable-offerings Issues or PRs related to unavailable offerings cache kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@Bryce-Soghigian
Copy link
Collaborator

Bryce-Soghigian commented Mar 11, 2024

Tell us about your request

In the karpenter provisioning loop, we will mark an instance as unavailable when we encounter SKUSeries errors. The problem here is that we cache the instance type rather than the sku series.

The dpldsv5 series has its own set of cores quota associated with the subscripton. But rather than caching an the series itself and all isntances part of that series, we instead cache the instance type.

For the dpldsv5 series these are the instance types

Size vCPU Memory: GiB Temp storage (SSD) GiB Max data disks Max uncached disk throughput: IOPS/MBps Max burst uncached disk throughput: IOPS/MBps Max NICs Max network bandwidth (Mbps)
Standard_D2pls_v5 2 4 Remote Storage Only 4 3750/85 10000/1200 2 12500
Standard_D4pls_v5 4 8 Remote Storage Only 8 6400/145 20000/1200 2 12500
Standard_D8pls_v5 8 16 Remote Storage Only 16 12800/290 20000/1200 4 12500
Standard_D16pls_v5 16 32 Remote Storage Only 32 25600/600 40000/1200 4 12500
Standard_D32pls_v5 32 64 Remote Storage Only 32 51200/865 80000/2000 8 16000
Standard_D48pls_v5 48 96 Remote Storage Only 32 76800/1315 80000/3000 8 24000
Standard_D64pls_v5 64 128 Remote Storage Only 32 80000/1735 80000/3000 8 40000

In karpenters current implmentation, we would see quota errors for multiple of these instance types. We fail mulitple times before trying a different sku series, even though it would never work.

"Operation could not be completed as it results in exceeding approved standardDplsv5Family Cores quota"

This is the error we get back from arm for each failed instance provisioning. We it over and over and the quota is shared for all instance types, that belong to that sku series.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

We end up hitting nodeclaim registration ttl before getting to provision the vm due to not caching correctly by sku series in extreme cases.

In this particular case, we hit the error for the instances in the standardDLSv5Family then the standardDASv5Family. There are 6 instance types at similar price points all in these two familes so we go through all of those instance types.

~/dev/karp/karpenter-core (bsoghigian/disruption-by-method-poc*) » cat file.xt| grep "general-purpose-hcdg" | grep -E "Selected instance type"
{"level":"INFO","time":"2024-03-09T03:30:06.696Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64ls_v5","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"INFO","time":"2024-03-09T03:30:17.284Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64as_v5","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"INFO","time":"2024-03-09T03:30:29.511Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64_v3","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"INFO","time":"2024-03-09T03:43:31.703Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64ls_v5","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"INFO","time":"2024-03-09T03:43:40.525Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64as_v5","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"INFO","time":"2024-03-09T03:43:49.312Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64_v3","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}

We end up hitting nodeclaim registration ttl before getting to provision the vm.

We hit the error for the instances in the standardDLSv5Family then the standardDASv5Family. There are only two series here. If we cache by series, we could get the node to provision in two attempts, instead, we try 6 times, and the registration ttl is hit.

~/dev/karp/karpenter-core (bsoghigian/disruption-by-method-poc*) » cat file.xt| grep "general-purpose-hcdg" | grep -E "Selected instance type"
{"level":"INFO","time":"2024-03-09T03:30:06.696Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64ls_v5","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"INFO","time":"2024-03-09T03:30:17.284Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64as_v5","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"INFO","time":"2024-03-09T03:30:29.511Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64_v3","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"INFO","time":"2024-03-09T03:43:31.703Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64ls_v5","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"INFO","time":"2024-03-09T03:43:40.525Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64as_v5","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"INFO","time":"2024-03-09T03:43:49.312Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_D64_v3","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}

Quota errors

cat file.xt| grep "general-purpose-hcdg" | grep -E "Cores"
{"level":"ERROR","time":"2024-03-09T03:30:14.102Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine \"aks-general-purpose-hcdgt\" failed: PUT https://management.azure.com/subscriptions/redacted/resourceGroups/MC_redacted_RG/providers/Microsoft.Compute/virtualMachines/aks-general-purpose-hcdgt\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"OperationNotAllowed\",\n    \"message\": \"Operation could not be completed as it results in exceeding approved standardDLSv5Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 100, Current Usage: 56, Additional Required: 64, (Minimum) New Limit Required: 120. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%22redacted%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22eastus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardDLSv5Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:120,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardDLSv5Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\"\n  }\n}\n--------------------------------------------------------------------------------\n","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"ERROR","time":"2024-03-09T03:30:14.102Z","logger":"controller.nodeclaim.lifecycle","message":"virtualMachine.BeginCreateOrUpdate for VM \"aks-general-purpose-hcdgt\" failed: PUT https://management.azure.com/subscriptions/redacted/resourceGroups/MC_redacted_RG/providers/Microsoft.Compute/virtualMachines/aks-general-purpose-hcdgt\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"OperationNotAllowed\",\n    \"message\": \"Operation could not be completed as it results in exceeding approved standardDLSv5Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 100, Current Usage: 56, Additional Required: 64, (Minimum) New Limit Required: 120. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%22redacted%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22eastus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardDLSv5Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:120,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardDLSv5Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\"\n  }\n}\n--------------------------------------------------------------------------------\n","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"ERROR","time":"2024-03-09T03:30:25.286Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine \"aks-general-purpose-hcdgt\" failed: PUT https://management.azure.com/subscriptions/redacted/resourceGroups/MC_redacted_RG/providers/Microsoft.Compute/virtualMachines/aks-general-purpose-hcdgt\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"OperationNotAllowed\",\n    \"message\": \"Operation could not be completed as it results in exceeding approved standardDASv5Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 100, Current Usage: 96, Additional Required: 64, (Minimum) New Limit Required: 160. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%22redacted%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22eastus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardDASv5Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:160,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardDASv5Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\"\n  }\n}\n--------------------------------------------------------------------------------\n","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"ERROR","time":"2024-03-09T03:30:25.286Z","logger":"controller.nodeclaim.lifecycle","message":"virtualMachine.BeginCreateOrUpdate for VM \"aks-general-purpose-hcdgt\" failed: PUT https://management.azure.com/subscriptions/redacted/resourceGroups/MC_redacted_RG/providers/Microsoft.Compute/virtualMachines/aks-general-purpose-hcdgt\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"OperationNotAllowed\",\n    \"message\": \"Operation could not be completed as it results in exceeding approved standardDASv5Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 100, Current Usage: 96, Additional Required: 64, (Minimum) New Limit Required: 160. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%22redacted%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22eastus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardDASv5Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:160,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardDASv5Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\"\n  }\n}\n--------------------------------------------------------------------------------\n","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"ERROR","time":"2024-03-09T03:43:37.797Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine \"aks-general-purpose-hcdgt\" failed: PUT https://management.azure.com/subscriptions/redacted/resourceGroups/MC_redacted_RG/providers/Microsoft.Compute/virtualMachines/aks-general-purpose-hcdgt\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"OperationNotAllowed\",\n    \"message\": \"Operation could not be completed as it results in exceeding approved standardDLSv5Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 100, Current Usage: 56, Additional Required: 64, (Minimum) New Limit Required: 120. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%22redacted%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22eastus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardDLSv5Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:120,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardDLSv5Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\"\n  }\n}\n--------------------------------------------------------------------------------\n","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"ERROR","time":"2024-03-09T03:43:37.797Z","logger":"controller.nodeclaim.lifecycle","message":"virtualMachine.BeginCreateOrUpdate for VM \"aks-general-purpose-hcdgt\" failed: PUT https://management.azure.com/subscriptions/redacted/resourceGroups/MC_redacted_RG/providers/Microsoft.Compute/virtualMachines/aks-general-purpose-hcdgt\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"OperationNotAllowed\",\n    \"message\": \"Operation could not be completed as it results in exceeding approved standardDLSv5Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 100, Current Usage: 56, Additional Required: 64, (Minimum) New Limit Required: 120. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%22redacted%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22eastus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardDLSv5Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:120,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardDLSv5Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\"\n  }\n}\n--------------------------------------------------------------------------------\n","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"ERROR","time":"2024-03-09T03:43:47.142Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine \"aks-general-purpose-hcdgt\" failed: PUT https://management.azure.com/subscriptions/redacted/resourceGroups/MC_redacted_RG/providers/Microsoft.Compute/virtualMachines/aks-general-purpose-hcdgt\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"OperationNotAllowed\",\n    \"message\": \"Operation could not be completed as it results in exceeding approved standardDASv5Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 100, Current Usage: 96, Additional Required: 64, (Minimum) New Limit Required: 160. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%22redacted%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22eastus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardDASv5Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:160,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardDASv5Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\"\n  }\n}\n--------------------------------------------------------------------------------\n","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}
{"level":"ERROR","time":"2024-03-09T03:43:47.143Z","logger":"controller.nodeclaim.lifecycle","message":"virtualMachine.BeginCreateOrUpdate for VM \"aks-general-purpose-hcdgt\" failed: PUT https://management.azure.com/subscriptions/redacted/resourceGroups/MC_redacted_RG/providers/Microsoft.Compute/virtualMachines/aks-general-purpose-hcdgt\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"OperationNotAllowed\",\n    \"message\": \"Operation could not be completed as it results in exceeding approved standardDASv5Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 100, Current Usage: 96, Additional Required: 64, (Minimum) New Limit Required: 160. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%22redacted%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22eastus%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22standardDASv5Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:160,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22standardDASv5Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\"\n  }\n}\n--------------------------------------------------------------------------------\n","commit":"c32dbc1-dirty","nodeclaim":"general-purpose-hcdgt"}

Are you currently working around this issue?

The general workaround is to let karpenter fail 2-3 times and move onto a different sku series. But in the case of nodeclaim general-purpose-hcdgt this wasn't enough. Since we hit this error for 2 separate sku series, we weren't able to remediate in time and the node didn't provision.

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@Bryce-Soghigian Bryce-Soghigian changed the title Unavailable Offerings cache should respond to OperationNotAllowed by caching instances for SKU Series Unavailable Offerings cache should respond to OperationNotAllowed by caching instances by SKU Series Mar 11, 2024
@Bryce-Soghigian Bryce-Soghigian added area/unavailable-offerings Issues or PRs related to unavailable offerings cache area/instance-types Issues or PRs related to instance types kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/instance-types Issues or PRs related to instance types area/unavailable-offerings Issues or PRs related to unavailable offerings cache kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

1 participant