Priority expander in cluster-autoscaler does not seem to pick the highest priority node #2359

OvervCW · 2021-05-31T15:16:05Z

What happened: I initially created a cluster with the random expander and then changed it to priority with the following command:

az aks update --subscription xxx --resource-group xxx --name xxx --cluster-autoscaler-profile expander=priority

The cluster-autoscaler logs seem to indicate that it has processed this change:

I0531 14:44:21.464365       1 autoscaler_builder.go:96] Updating autoscaling options to: {10 0.5 0.5 10m0s 20m0s 1000 320000 0 6871947673600000 0 [] [] binpacking priority false false 600 15m0s 45 3 true /etc/kubernetes/provider/azure.json azure [1:30:Delete:aks-cpu-29141059-vmss:{"dedicated-pool":"cpu-workers","dedicated-pool-group":"workers"}|dedicated-pool=cpu-workers:NoSchedule 0:30:Delete:aks-cpuspot-29141059-vmss:{"dedicated-pool":"cpu-workers","dedicated-pool-group":"workers","kubernetes.azure.com/scalesetpriority":"spot"}|dedicated-pool=cpu-workers:NoSchedule,kubernetes.azure.com/scalesetpriority=spot:NoSchedule 1:30:Delete:aks-gpu-29141059-vmss:{"dedicated-pool":"gpu-workers","dedicated-pool-group":"workers"}|dedicated-pool=gpu-workers:NoSchedule 0:30:Delete:aks-gpuspot-29141059-vmss:{"dedicated-pool":"gpu-workers","dedicated-pool-group":"workers","kubernetes.azure.com/scalesetpriority":"spot"}|dedicated-pool=gpu-workers:NoSchedule,kubernetes.azure.com/scalesetpriority=spot:NoSchedule] true 10m0s 10s 3m0s 30 0.1 50 2m0s true false kube-system  false 15 5m0s -10 false 0s 10 3s [] false /etc/kubernetes/kubeconfig/kubeconfig.yaml}
I0531 14:44:21.464428       1 dynamic_autoscaler.go:80] Dynamic reconfiguration completed: updatedConfig=&{[1:30:Delete:aks-cpu-29141059-vmss 0:30:Delete:aks-cpuspot-29141059-vmss 1:30:Delete:aks-gpu-29141059-vmss 0:30:Delete:aks-gpuspot-29141059-vmss] {10s 10m 10s 3m 10m 20m 0.5 600 false priority 0s 10 3 45 false true 15m}}
I0531 14:44:21.464568       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0531 14:44:21.464618       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 7.4µs

I have 4 node pools (CPU spot/non-spot and GPU spot/non-spot) and I set up the priority configmap to prefer scaling the (cheap) CPU spot node pool first:

data:
  priorities: |2
        1:
          - ".*"
        10:
          - ".*gpu.*"
        15:
          - ".*gpuspot.*"
        20:
          - ".*cpu.*"
        25:
          - ".*cpuspot.*"

This should match the VMSS associated with the node pools, which are named:

aks-cpu-29141059-vmss
aks-cpuspot-29141059-vmss
aks-gpu-29141059-vmss
aks-gpuspot-29141059-vmss

However, when I create a pod that can run on any of these 4 pools (with a toleration for spot nodes), the cluster autoscaler always chooses to scale the GPU node pool:

I0531 14:22:07.169191       1 scale_up.go:322] Pod xxx/yyy is unschedulable
I0531 14:22:07.169222       1 scale_up.go:322] Pod xxx/yyy is unschedulable
I0531 14:22:07.169230       1 scale_up.go:322] Pod xxx/yyy is unschedulable
I0531 14:22:07.169236       1 scale_up.go:322] Pod xxx/yyy is unschedulable
I0531 14:22:07.173681       1 scale_up.go:452] Best option to resize: aks-gpu-29141059-vmss
I0531 14:22:07.173705       1 scale_up.go:456] Estimated 2 nodes needed in aks-gpu-29141059-vmss
I0531 14:22:07.173734       1 scale_up.go:569] Final scale-up plan: [{aks-gpu-29141059-vmss 3->5 (max: 30)}]
I0531 14:22:07.173759       1 scale_up.go:658] Scale-up: setting group aks-gpu-29141059-vmss size to 5

If I force the pod to schedule on the CPU spot pool by using a nodeSelector then the cluster-autoscaler will choose to scale the cpuspot node pool, so clearly it's not about that particular node pool being unavailable.

What you expected to happen: I expect the cluster-autoscaler to (always) choose the cpuspot pool.

How to reproduce it (as minimally and precisely as possible): See above, simply create the 4 node pools and create a pod that can run on any of them.

Anything else we need to know?: -

Environment: N/A

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.14", GitCommit:"4d62230f10af41cec2851b5b113874f6d2098297", GitTreeState:"clean", BuildDate:"2021-03-22T23:01:55Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Size of cluster (how many worker nodes are in the cluster?) N/A
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
Others:

The text was updated successfully, but these errors were encountered:

ghost · 2021-05-31T15:16:08Z

Hi OvervCW, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
Please abide by the AKS repo Guidelines and Code of Conduct.
If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

marwanad · 2021-05-31T17:23:47Z

@OvervCW what happens if you remove the first priority?

        1:
          - ".*"

I suspect because you're using a wildcard matcher, you're being hit by an upstream bug which we haven't back-ported to older releases yet (will do this week).

OvervCW · 2021-06-01T07:04:47Z

I removed it and it started out promising with it immediately choosing the cpuspot pool for the next scaling operation but when I scaled up the deployment once more, it decided to use the gpu pool once again so it still seems somewhat random. I do think this is caused by the upstream bug that you mention since it can still cause any of the pools to be picked as best match.

Is there any way to manually update the cluster-autoscaler? Or would that require upgrading to a new minor Kubernetes version?

marwanad · 2021-06-02T23:05:08Z

We will prepare a back-port for supported K8s versions and it should roll-out with next week's release. No action needed on your end. I'll keep you posted on this thread.

qpetraroia · 2021-06-04T01:01:33Z

Hi @OvervCW,

This fix should be rolled out on Monday. Thanks!

ghost · 2021-06-11T18:01:26Z

Thanks for reaching out. I'm closing this issue as it was marked with "Answer Provided" and it hasn't had activity for 2 days.

lindhe · 2021-06-11T20:47:39Z

The release has not been rolled out to all zones yet, so I think it should stay open a little while longer.

OvervCW · 2021-06-15T09:52:57Z

@lindhe Is it possible to see the rollout status somewhere?

lindhe · 2021-06-15T15:49:06Z

No, unfortunately not. I spoke to Microsoft support regarding this (I had run into the same issue) and we identified that the patch had not been rolled out to all zones yet. I asked them if it was possible to see, but they told me the current rollout information was only available via an internal system on their end.

lindhe · 2021-06-21T08:38:58Z

@qpetraroia Can we reopen the issue since it's not fully rolled out yet?

OvervCW · 2021-07-09T07:14:43Z

I just confirmed with support that this release is still not fully rolled out.

marwanad · 2021-07-09T14:06:18Z

@OvervCW which region and which k8s version are you using? This should have completed today.

lindhe · 2021-07-09T15:02:48Z

Until very recently, it didn't work in West Europe. I have not checked today.

EDIT: Checked just now, it's still broken. I've got a reasonable workflow setup for debugging it. I'm attaching instructions in case someone want to try it.

expander-profile.zip

marwanad · 2021-07-09T17:16:13Z

@lindhe thanks for the detailed report. I'll give it a shot and report back here. This could be a different issue than the upstream bug fix linked above.

ghost · 2021-07-11T18:00:39Z

Thanks for reaching out. I'm closing this issue as it was marked with "Answer Provided" and it hasn't had activity for 2 days.

marwanad · 2021-07-12T03:47:10Z

@OvervCW @lindhe I just tested with your setup and it's always scaling up the correct pool for me. Can you test again? That was on 1.19 fwiw. You should also see a log line from the priority package but that seems missing in your report. Are you creating the configmap properly in the kube-system namespace?

I0712 02:42:51.678873       1 scale_up.go:322] Pod default/nginx-7df96bd847-2xgcc is unschedulable
I0712 02:42:51.678903       1 scale_up.go:322] Pod default/nginx-7df96bd847-h5ts5 is unschedulable
I0712 02:42:51.679020       1 scale_up.go:322] Pod default/nginx-7df96bd847-s6r9h is unschedulable
I0712 02:42:51.679026       1 scale_up.go:322] Pod default/nginx-7df96bd847-nf2f6 is unschedulable
I0712 02:42:51.680873       1 priority.go:166] priority expander: aks-spot-xxx-vmss chosen as the highest available

Using 1.19 in southcentralus fwiw.

We can connect over email to get to the bottom of this.

OvervCW · 2021-07-12T07:36:13Z

@marwanad Is it possible that it's still broken with 1.18 but fixed with 1.19? Because that's the cluster version I'm currently running. I'm about to upgrade to 1.19 and then 1.20 this week, so I'll check if the problem is fixed then.

marwanad · 2021-07-12T13:46:56Z

@OvervCW I just verified with 1.18 as well. You can send me your cluster FQDN on the email listed on my GitHub or {myGitHubAlias}@microsoft.com and I can take a closer look.

edit: I was able to repro after a few tries. I think I have a clue on what's happening. I'll update later.

marwanad · 2021-07-12T23:28:04Z

@OvervCW @lindhe I think I found the issue so it was a combination of both. The latter one had to do with the hot swap of the expander configs. Will get a fix rolled out. Sorry for that!

lindhe · 2021-07-19T10:15:17Z

I'm on summer leave right now, so cannot test again until I'm back at work. Think I'll have to take another live session with our Microsoft support rep. to verify that I'm not doing something wrong.

marwanad · 2021-08-03T17:52:49Z

@OvervCW @lindhe the fix should be out now. I'll mark as resolved. Feel free to re-open if you're still facing issues.

lindhe · 2021-08-09T13:05:26Z

Hi, I'm back again now! Thank you for your patience.

I tried it again now, and now it seems to work properly! I tried both with 1.19.11 and 1.20.7 and I was able to scale all the way up to the limit of the node pool and it scaled according to the priority. I guess the rollout was just slower than expected initially.

We can close this issue as far as I'm concerned. Well done.

OvervCW · 2021-08-09T13:15:32Z

Can confirm that it works for us as well, on our existing 1.20.7 cluster!

ghost added the triage label May 31, 2021

marwanad added cluster-autoscaler upstream-bug bug and removed triage upstream-bug labels May 31, 2021

marwanad self-assigned this Jun 3, 2021

github-actions bot mentioned this issue Jun 5, 2021

[AKS] Release 2021-06-03 dev-obs/actus#388

Open

qpetraroia added the resolution/answer-provided Provided answer to issue, question or feedback. label Jun 9, 2021

ghost closed this as completed Jun 11, 2021

marwanad reopened this Jul 9, 2021

ghost closed this as completed Jul 11, 2021

marwanad removed the resolution/answer-provided Provided answer to issue, question or feedback. label Jul 11, 2021

marwanad reopened this Jul 11, 2021

marwanad added the resolution/fix-released label Aug 3, 2021

OvervCW closed this as completed Aug 9, 2021

Azure locked as resolved and limited conversation to collaborators Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Priority expander in cluster-autoscaler does not seem to pick the highest priority node #2359

Priority expander in cluster-autoscaler does not seem to pick the highest priority node #2359

OvervCW commented May 31, 2021 •

edited

ghost commented May 31, 2021

marwanad commented May 31, 2021

OvervCW commented Jun 1, 2021 •

edited

marwanad commented Jun 2, 2021

qpetraroia commented Jun 4, 2021

ghost commented Jun 11, 2021

lindhe commented Jun 11, 2021

OvervCW commented Jun 15, 2021

lindhe commented Jun 15, 2021

lindhe commented Jun 21, 2021

OvervCW commented Jul 9, 2021

marwanad commented Jul 9, 2021

lindhe commented Jul 9, 2021 •

edited

marwanad commented Jul 9, 2021

ghost commented Jul 11, 2021

marwanad commented Jul 12, 2021

OvervCW commented Jul 12, 2021 •

edited

marwanad commented Jul 12, 2021 •

edited

marwanad commented Jul 12, 2021 •

edited

lindhe commented Jul 19, 2021

marwanad commented Aug 3, 2021

lindhe commented Aug 9, 2021

OvervCW commented Aug 9, 2021 •

edited

Priority expander in cluster-autoscaler does not seem to pick the highest priority node #2359

Priority expander in cluster-autoscaler does not seem to pick the highest priority node #2359

Comments

OvervCW commented May 31, 2021 • edited

ghost commented May 31, 2021

marwanad commented May 31, 2021

OvervCW commented Jun 1, 2021 • edited

marwanad commented Jun 2, 2021

qpetraroia commented Jun 4, 2021

ghost commented Jun 11, 2021

lindhe commented Jun 11, 2021

OvervCW commented Jun 15, 2021

lindhe commented Jun 15, 2021

lindhe commented Jun 21, 2021

OvervCW commented Jul 9, 2021

marwanad commented Jul 9, 2021

lindhe commented Jul 9, 2021 • edited

marwanad commented Jul 9, 2021

ghost commented Jul 11, 2021

marwanad commented Jul 12, 2021

OvervCW commented Jul 12, 2021 • edited

marwanad commented Jul 12, 2021 • edited

marwanad commented Jul 12, 2021 • edited

lindhe commented Jul 19, 2021

marwanad commented Aug 3, 2021

lindhe commented Aug 9, 2021

OvervCW commented Aug 9, 2021 • edited

OvervCW commented May 31, 2021 •

edited

OvervCW commented Jun 1, 2021 •

edited

lindhe commented Jul 9, 2021 •

edited

OvervCW commented Jul 12, 2021 •

edited

marwanad commented Jul 12, 2021 •

edited

marwanad commented Jul 12, 2021 •

edited

OvervCW commented Aug 9, 2021 •

edited