You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our attempt to upgrade AKS failed after a hour and put whole cluster in failed state 🤷♂️
Failed to save Kubernetes service 'capybara002-aks-eun-dev'.
Error: Drain of aks-infra002-17497433-vmss00000y did not complete:
Too many req pod azure-wi-webhook-controller-manager-749b9d5cc8-2chl9 on node aks-infra002-17497433-vmss00000y:
azure-workload-identity-system/azure-wi-webhook-controller-manager-749b9d5cc8-2chl9 blocked by pdb azure-wi-webhook-controller-manager with unready pods [].
See http://aka.ms/aks/debugdrainfailures
Imagine how bad and scarry it was :) thankfully it was not production but development
Ensure that any PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time otherwise the drain/evict operation will fail. If the drain operation fails, the upgrade operation will fail by design to ensure that the applications are not disrupted. Please correct what caused the operation to stop (incorrect PDBs, lack of quota, and so on) and re-try the operation.
But this budgets are hidden inside chart and not configurable and/or mentioned anywhere in docs (to be fair I did not even know that we have one in our cluster)
As a workaround I have manually replaced minAvailable to maxUnavailable and with help of az cli rerun node pool upgrade and then whole cluster - it helped and cluster healthy again
The problem here: if we leave this as is - it will mean that we can not upgrade, and even worse, enable auto upgrade because it will be always failing exactly because of this deployment
So I am wondering if there some other workarounds we may use or may be at least make this part of chart configurable
The text was updated successfully, but these errors were encountered:
Another possible workaround is to configure rolling update in a such a way that it will guarantee that at least one instance will be always running, yes, for a short period of time there may be two instances while things moving, but it will allow to get rid of pdb at all
which means that rolling update will try add new instance before removing old one, there will be always at least one instance, and in case of upgrades, for a short period there may be two instances
with this in place we may add flag to disable pdb at all
Our attempt to upgrade AKS failed after a hour and put whole cluster in failed state 🤷♂️
Imagine how bad and scarry it was :) thankfully it was not production but development
In AKS upgrade docs there is mention about this:
But this budgets are hidden inside chart and not configurable and/or mentioned anywhere in docs (to be fair I did not even know that we have one in our cluster)
As a workaround I have manually replaced
minAvailable
tomaxUnavailable
and with help of az cli rerun node pool upgrade and then whole cluster - it helped and cluster healthy againThe problem here: if we leave this as is - it will mean that we can not upgrade, and even worse, enable auto upgrade because it will be always failing exactly because of this deployment
So I am wondering if there some other workarounds we may use or may be at least make this part of chart configurable
The text was updated successfully, but these errors were encountered: