Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure-wi-webhook-controller-manager brokes aks upgrade because of pdb #824

Closed
mac2000 opened this issue Apr 5, 2023 · 2 comments · Fixed by #827
Closed

azure-wi-webhook-controller-manager brokes aks upgrade because of pdb #824

mac2000 opened this issue Apr 5, 2023 · 2 comments · Fixed by #827
Labels
bug Something isn't working
Milestone

Comments

@mac2000
Copy link

mac2000 commented Apr 5, 2023

Our attempt to upgrade AKS failed after a hour and put whole cluster in failed state 🤷‍♂️

Failed to save Kubernetes service 'capybara002-aks-eun-dev'.

Error: Drain of aks-infra002-17497433-vmss00000y did not complete:

Too many req pod azure-wi-webhook-controller-manager-749b9d5cc8-2chl9 on node aks-infra002-17497433-vmss00000y:

azure-workload-identity-system/azure-wi-webhook-controller-manager-749b9d5cc8-2chl9 blocked by pdb azure-wi-webhook-controller-manager with unready pods [].

See http://aka.ms/aks/debugdrainfailures

Imagine how bad and scarry it was :) thankfully it was not production but development

In AKS upgrade docs there is mention about this:

Ensure that any PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time otherwise the drain/evict operation will fail. If the drain operation fails, the upgrade operation will fail by design to ensure that the applications are not disrupted. Please correct what caused the operation to stop (incorrect PDBs, lack of quota, and so on) and re-try the operation.

But this budgets are hidden inside chart and not configurable and/or mentioned anywhere in docs (to be fair I did not even know that we have one in our cluster)

As a workaround I have manually replaced minAvailable to maxUnavailable and with help of az cli rerun node pool upgrade and then whole cluster - it helped and cluster healthy again

The problem here: if we leave this as is - it will mean that we can not upgrade, and even worse, enable auto upgrade because it will be always failing exactly because of this deployment

So I am wondering if there some other workarounds we may use or may be at least make this part of chart configurable

@mac2000 mac2000 added the bug Something isn't working label Apr 5, 2023
@mac2000
Copy link
Author

mac2000 commented Apr 6, 2023

Meanwhile the workaround is

before making an upgrade run

kubectl -n azure-workload-identity-system patch pdb azure-wi-webhook-controller-manager --type=json -p='[{"op": "remove", "path": "/spec/minAvailable"},{"op": "add", "path": "/spec/maxUnavailable", "value": 1}]'

after upgrade

kubectl -n azure-workload-identity-system patch pdb azure-wi-webhook-controller-manager --type=json -p='[{"op": "remove", "path": "/spec/maxUnavailable"},{"op": "add", "path": "/spec/minAvailable", "value": 1}]'

but this is applicable only for manual upgrades

@mac2000
Copy link
Author

mac2000 commented Apr 7, 2023

Another possible workaround is to configure rolling update in a such a way that it will guarantee that at least one instance will be always running, yes, for a short period of time there may be two instances while things moving, but it will allow to get rid of pdb at all

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-unavailable

# ...
spec:
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
# ...

which means that rolling update will try add new instance before removing old one, there will be always at least one instance, and in case of upgrades, for a short period there may be two instances

with this in place we may add flag to disable pdb at all

@aramase aramase added this to the v1.1.0 milestone Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants