Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Priority expander in cluster-autoscaler does not seem to pick the highest priority node #2359

Closed
OvervCW opened this issue May 31, 2021 · 23 comments

Comments

@OvervCW
Copy link

OvervCW commented May 31, 2021

What happened: I initially created a cluster with the random expander and then changed it to priority with the following command:

az aks update --subscription xxx --resource-group xxx --name xxx --cluster-autoscaler-profile expander=priority

The cluster-autoscaler logs seem to indicate that it has processed this change:

I0531 14:44:21.464365       1 autoscaler_builder.go:96] Updating autoscaling options to: {10 0.5 0.5 10m0s 20m0s 1000 320000 0 6871947673600000 0 [] [] binpacking priority false false 600 15m0s 45 3 true /etc/kubernetes/provider/azure.json azure [1:30:Delete:aks-cpu-29141059-vmss:{"dedicated-pool":"cpu-workers","dedicated-pool-group":"workers"}|dedicated-pool=cpu-workers:NoSchedule 0:30:Delete:aks-cpuspot-29141059-vmss:{"dedicated-pool":"cpu-workers","dedicated-pool-group":"workers","kubernetes.azure.com/scalesetpriority":"spot"}|dedicated-pool=cpu-workers:NoSchedule,kubernetes.azure.com/scalesetpriority=spot:NoSchedule 1:30:Delete:aks-gpu-29141059-vmss:{"dedicated-pool":"gpu-workers","dedicated-pool-group":"workers"}|dedicated-pool=gpu-workers:NoSchedule 0:30:Delete:aks-gpuspot-29141059-vmss:{"dedicated-pool":"gpu-workers","dedicated-pool-group":"workers","kubernetes.azure.com/scalesetpriority":"spot"}|dedicated-pool=gpu-workers:NoSchedule,kubernetes.azure.com/scalesetpriority=spot:NoSchedule] true 10m0s 10s 3m0s 30 0.1 50 2m0s true false kube-system  false 15 5m0s -10 false 0s 10 3s [] false /etc/kubernetes/kubeconfig/kubeconfig.yaml}
I0531 14:44:21.464428       1 dynamic_autoscaler.go:80] Dynamic reconfiguration completed: updatedConfig=&{[1:30:Delete:aks-cpu-29141059-vmss 0:30:Delete:aks-cpuspot-29141059-vmss 1:30:Delete:aks-gpu-29141059-vmss 0:30:Delete:aks-gpuspot-29141059-vmss] {10s 10m 10s 3m 10m 20m 0.5 600 false priority 0s 10 3 45 false true 15m}}
I0531 14:44:21.464568       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0531 14:44:21.464618       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 7.4µs

I have 4 node pools (CPU spot/non-spot and GPU spot/non-spot) and I set up the priority configmap to prefer scaling the (cheap) CPU spot node pool first:

data:
  priorities: |2
        1:
          - ".*"
        10:
          - ".*gpu.*"
        15:
          - ".*gpuspot.*"
        20:
          - ".*cpu.*"
        25:
          - ".*cpuspot.*"

This should match the VMSS associated with the node pools, which are named:

  • aks-cpu-29141059-vmss
  • aks-cpuspot-29141059-vmss
  • aks-gpu-29141059-vmss
  • aks-gpuspot-29141059-vmss

However, when I create a pod that can run on any of these 4 pools (with a toleration for spot nodes), the cluster autoscaler always chooses to scale the GPU node pool:

I0531 14:22:07.169191       1 scale_up.go:322] Pod xxx/yyy is unschedulable
I0531 14:22:07.169222       1 scale_up.go:322] Pod xxx/yyy is unschedulable
I0531 14:22:07.169230       1 scale_up.go:322] Pod xxx/yyy is unschedulable
I0531 14:22:07.169236       1 scale_up.go:322] Pod xxx/yyy is unschedulable
I0531 14:22:07.173681       1 scale_up.go:452] Best option to resize: aks-gpu-29141059-vmss
I0531 14:22:07.173705       1 scale_up.go:456] Estimated 2 nodes needed in aks-gpu-29141059-vmss
I0531 14:22:07.173734       1 scale_up.go:569] Final scale-up plan: [{aks-gpu-29141059-vmss 3->5 (max: 30)}]
I0531 14:22:07.173759       1 scale_up.go:658] Scale-up: setting group aks-gpu-29141059-vmss size to 5

If I force the pod to schedule on the CPU spot pool by using a nodeSelector then the cluster-autoscaler will choose to scale the cpuspot node pool, so clearly it's not about that particular node pool being unavailable.

What you expected to happen: I expect the cluster-autoscaler to (always) choose the cpuspot pool.

How to reproduce it (as minimally and precisely as possible): See above, simply create the 4 node pools and create a pod that can run on any of them.

Anything else we need to know?: -

Environment: N/A

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.14", GitCommit:"4d62230f10af41cec2851b5b113874f6d2098297", GitTreeState:"clean", BuildDate:"2021-03-22T23:01:55Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • Size of cluster (how many worker nodes are in the cluster?) N/A
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
  • Others:
@ghost ghost added the triage label May 31, 2021
@ghost
Copy link

ghost commented May 31, 2021

Hi OvervCW, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@marwanad
Copy link

@OvervCW what happens if you remove the first priority?

        1:
          - ".*"

I suspect because you're using a wildcard matcher, you're being hit by an upstream bug which we haven't back-ported to older releases yet (will do this week).

@OvervCW
Copy link
Author

OvervCW commented Jun 1, 2021

I removed it and it started out promising with it immediately choosing the cpuspot pool for the next scaling operation but when I scaled up the deployment once more, it decided to use the gpu pool once again so it still seems somewhat random. I do think this is caused by the upstream bug that you mention since it can still cause any of the pools to be picked as best match.

Is there any way to manually update the cluster-autoscaler? Or would that require upgrading to a new minor Kubernetes version?

@marwanad
Copy link

marwanad commented Jun 2, 2021

We will prepare a back-port for supported K8s versions and it should roll-out with next week's release. No action needed on your end. I'll keep you posted on this thread.

@marwanad marwanad self-assigned this Jun 3, 2021
@qpetraroia
Copy link
Contributor

Hi @OvervCW,

This fix should be rolled out on Monday. Thanks!

@qpetraroia qpetraroia added the resolution/answer-provided Provided answer to issue, question or feedback. label Jun 9, 2021
@ghost
Copy link

ghost commented Jun 11, 2021

Thanks for reaching out. I'm closing this issue as it was marked with "Answer Provided" and it hasn't had activity for 2 days.

@ghost ghost closed this as completed Jun 11, 2021
@lindhe
Copy link

lindhe commented Jun 11, 2021

The release has not been rolled out to all zones yet, so I think it should stay open a little while longer.

@OvervCW
Copy link
Author

OvervCW commented Jun 15, 2021

@lindhe Is it possible to see the rollout status somewhere?

@lindhe
Copy link

lindhe commented Jun 15, 2021

No, unfortunately not. I spoke to Microsoft support regarding this (I had run into the same issue) and we identified that the patch had not been rolled out to all zones yet. I asked them if it was possible to see, but they told me the current rollout information was only available via an internal system on their end.

@lindhe
Copy link

lindhe commented Jun 21, 2021

@qpetraroia Can we reopen the issue since it's not fully rolled out yet?

@OvervCW
Copy link
Author

OvervCW commented Jul 9, 2021

I just confirmed with support that this release is still not fully rolled out.

@marwanad
Copy link

marwanad commented Jul 9, 2021

@OvervCW which region and which k8s version are you using? This should have completed today.

@lindhe
Copy link

lindhe commented Jul 9, 2021

Until very recently, it didn't work in West Europe. I have not checked today.

EDIT: Checked just now, it's still broken. I've got a reasonable workflow setup for debugging it. I'm attaching instructions in case someone want to try it.

expander-profile.zip

@marwanad marwanad reopened this Jul 9, 2021
@marwanad
Copy link

marwanad commented Jul 9, 2021

@lindhe thanks for the detailed report. I'll give it a shot and report back here. This could be a different issue than the upstream bug fix linked above.

@ghost
Copy link

ghost commented Jul 11, 2021

Thanks for reaching out. I'm closing this issue as it was marked with "Answer Provided" and it hasn't had activity for 2 days.

@ghost ghost closed this as completed Jul 11, 2021
@marwanad marwanad removed the resolution/answer-provided Provided answer to issue, question or feedback. label Jul 11, 2021
@marwanad marwanad reopened this Jul 11, 2021
@marwanad
Copy link

@OvervCW @lindhe I just tested with your setup and it's always scaling up the correct pool for me. Can you test again? That was on 1.19 fwiw. You should also see a log line from the priority package but that seems missing in your report. Are you creating the configmap properly in the kube-system namespace?

I0712 02:42:51.678873       1 scale_up.go:322] Pod default/nginx-7df96bd847-2xgcc is unschedulable
I0712 02:42:51.678903       1 scale_up.go:322] Pod default/nginx-7df96bd847-h5ts5 is unschedulable
I0712 02:42:51.679020       1 scale_up.go:322] Pod default/nginx-7df96bd847-s6r9h is unschedulable
I0712 02:42:51.679026       1 scale_up.go:322] Pod default/nginx-7df96bd847-nf2f6 is unschedulable
I0712 02:42:51.680873       1 priority.go:166] priority expander: aks-spot-xxx-vmss chosen as the highest available

Using 1.19 in southcentralus fwiw.

We can connect over email to get to the bottom of this.

@OvervCW
Copy link
Author

OvervCW commented Jul 12, 2021

@marwanad Is it possible that it's still broken with 1.18 but fixed with 1.19? Because that's the cluster version I'm currently running. I'm about to upgrade to 1.19 and then 1.20 this week, so I'll check if the problem is fixed then.

@marwanad
Copy link

marwanad commented Jul 12, 2021

@OvervCW I just verified with 1.18 as well. You can send me your cluster FQDN on the email listed on my GitHub or {myGitHubAlias}@microsoft.com and I can take a closer look.

edit: I was able to repro after a few tries. I think I have a clue on what's happening. I'll update later.

@marwanad
Copy link

marwanad commented Jul 12, 2021

@OvervCW @lindhe I think I found the issue so it was a combination of both. The latter one had to do with the hot swap of the expander configs. Will get a fix rolled out. Sorry for that!

@lindhe
Copy link

lindhe commented Jul 19, 2021

I'm on summer leave right now, so cannot test again until I'm back at work. Think I'll have to take another live session with our Microsoft support rep. to verify that I'm not doing something wrong.

@marwanad
Copy link

marwanad commented Aug 3, 2021

@OvervCW @lindhe the fix should be out now. I'll mark as resolved. Feel free to re-open if you're still facing issues.

@lindhe
Copy link

lindhe commented Aug 9, 2021

Hi, I'm back again now! Thank you for your patience.

I tried it again now, and now it seems to work properly! I tried both with 1.19.11 and 1.20.7 and I was able to scale all the way up to the limit of the node pool and it scaled according to the priority. I guess the rollout was just slower than expected initially.

We can close this issue as far as I'm concerned. Well done.

@OvervCW
Copy link
Author

OvervCW commented Aug 9, 2021

Can confirm that it works for us as well, on our existing 1.20.7 cluster!

@OvervCW OvervCW closed this as completed Aug 9, 2021
@Azure Azure locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants