azure-wi-webhook-controller-manager brokes aks upgrade because of pdb #824

mac2000 · 2023-04-05T10:02:47Z

Our attempt to upgrade AKS failed after a hour and put whole cluster in failed state 🤷‍♂️

Failed to save Kubernetes service 'capybara002-aks-eun-dev'.

Error: Drain of aks-infra002-17497433-vmss00000y did not complete:

Too many req pod azure-wi-webhook-controller-manager-749b9d5cc8-2chl9 on node aks-infra002-17497433-vmss00000y:

azure-workload-identity-system/azure-wi-webhook-controller-manager-749b9d5cc8-2chl9 blocked by pdb azure-wi-webhook-controller-manager with unready pods [].

See http://aka.ms/aks/debugdrainfailures

Imagine how bad and scarry it was :) thankfully it was not production but development

In AKS upgrade docs there is mention about this:

Ensure that any PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time otherwise the drain/evict operation will fail. If the drain operation fails, the upgrade operation will fail by design to ensure that the applications are not disrupted. Please correct what caused the operation to stop (incorrect PDBs, lack of quota, and so on) and re-try the operation.

But this budgets are hidden inside chart and not configurable and/or mentioned anywhere in docs (to be fair I did not even know that we have one in our cluster)

As a workaround I have manually replaced minAvailable to maxUnavailable and with help of az cli rerun node pool upgrade and then whole cluster - it helped and cluster healthy again

The problem here: if we leave this as is - it will mean that we can not upgrade, and even worse, enable auto upgrade because it will be always failing exactly because of this deployment

So I am wondering if there some other workarounds we may use or may be at least make this part of chart configurable

The text was updated successfully, but these errors were encountered:

mac2000 · 2023-04-06T14:29:22Z

Meanwhile the workaround is

before making an upgrade run

kubectl -n azure-workload-identity-system patch pdb azure-wi-webhook-controller-manager --type=json -p='[{"op": "remove", "path": "/spec/minAvailable"},{"op": "add", "path": "/spec/maxUnavailable", "value": 1}]'

after upgrade

kubectl -n azure-workload-identity-system patch pdb azure-wi-webhook-controller-manager --type=json -p='[{"op": "remove", "path": "/spec/maxUnavailable"},{"op": "add", "path": "/spec/minAvailable", "value": 1}]'

but this is applicable only for manual upgrades

mac2000 · 2023-04-07T04:08:50Z

Another possible workaround is to configure rolling update in a such a way that it will guarantee that at least one instance will be always running, yes, for a short period of time there may be two instances while things moving, but it will allow to get rid of pdb at all

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable

# ...
spec:
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
# ...

which means that rolling update will try add new instance before removing old one, there will be always at least one instance, and in case of upgrades, for a short period there may be two instances

with this in place we may add flag to disable pdb at all

mac2000 added the bug Something isn't working label Apr 5, 2023

aramase mentioned this issue Apr 6, 2023

feat: make podDisruptionBudget minAvailable/maxUnavailable configurable #827

Merged

aramase added this to the v1.1.0 milestone Apr 11, 2023

aramase closed this as completed in #827 Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure-wi-webhook-controller-manager brokes aks upgrade because of pdb #824

azure-wi-webhook-controller-manager brokes aks upgrade because of pdb #824

mac2000 commented Apr 5, 2023 •

edited

mac2000 commented Apr 6, 2023

mac2000 commented Apr 7, 2023

azure-wi-webhook-controller-manager brokes aks upgrade because of pdb #824

azure-wi-webhook-controller-manager brokes aks upgrade because of pdb #824

Comments

mac2000 commented Apr 5, 2023 • edited

mac2000 commented Apr 6, 2023

mac2000 commented Apr 7, 2023

mac2000 commented Apr 5, 2023 •

edited