-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need to support rolling update in the ingress controller. #75
Comments
@emilverwoerd are you using the latest ingress controller helm chart? There was a bug in AKS because of which ingress controller would stop receiving pod update events to the order of minutes resulting in delay in updating the backend pools in the application gateway. To comment on your solution, using the cluster IP instead of the pod IP for the backend IP address is a very bad idea. The cluster IP is a VIP (virtual IP address) that is used for layer 4 (TCP) load balancing. Adding the cluster IP as the backend to the application gateway instead of the actual pod IP would result in breaking session affinity if enabled in the application gateway. The correct solution is to observe the deployments associated with a service and not just the endpoints and update the backend pool sets. Meanwhile if you haven't updated to the latest helm chart could you kindly update to the latest and retry the rolling update to the observe the behavior? |
This is a classic case of supporting blue-green deployments. We should try and support this if it is not already working. |
We are currently using version 0.1.4 with latest helm chart so that should be the latest. What we are experiencing is that the site is temporarily unavailable due the fact that the pods are terminated and the new pods are running but it takes some time for Application Gateway to update the backend-pools so what happens is that for a brief amount of time Application Gateway is configured with IP-addresses from the old containers that are not running anymore so when kubernetes is done upgrading the containers applications gateway is not ready yet and this can take a couple of moments. So the old containers should be terminated when Azure Application Gateway is done updating the backend-pools. |
@emilverwoerd just checked the commits and the fix for the AKS event subscription was already present in the ingress controller helm chart 0.1.4, so you should have already had that fix: So I think the issue will still persist for you even after upgrading to 0.1.5. Could you kindly share your deployment spec here (the relevant parts such as "rolling update strategy", readiness probes and any preStop life cycle hooks you might have added)? Please read below for possible solutions that others have tried. I dug around a bit as to how other ingress controllers were dealing with Zero downtime. Found these two blogs : The problem you are facing seems to be a common issue with other ingress controllers as well (nginx controllers are the ones cited in the articles above). If you follow the articles, to achieve zero downtime upgrades "with ingress" there are two components that are required in your deployment spec. The first is the
|
Okay I will update my update spec here but I don't understand how you should specify a rolling update if AG isn't completed before the pods are recreated. That was also the reason I tought it would be better to use the Cluster IP of the service since it isn't chaning and the ip of the pods are. But I also understand that AG isn't part of the kubernetes platform so that it is not possible to do so. But than to perform a correct update you should wait unitl with container termination uniitl AG is ready. I will try the pre stop as workaround and will let you know if it works. |
@emilverwoerd the AG would be updated only after the new pods are created, since Kubernetes will update the The Hope that explains the proposed solution? |
@emilverwoerd Can you please share your findings with this config in deployment Spec.
Also can you please share your config for these settings in deployment spec.
This article also has good insights. |
I tried to add the command sleep but when performing that it gives issues with the readinessprobe and it terminates the pod. Also the time it takes to update the backend-pool really takes some time so it has no different effect than without the sleep command. We use the following spec for our rollingUpdate |
@emilverwoerd could you provide your subscription ID? want to make sure you are using Application Gateway v2 and not v1. Did you create the AKS cluster and application gateway through the templates? We will try reproducing this problem at our end as well, but without the readiness probe kubernetes wouldn't know when to update the endpoints object with the new pods, so the whole rolling update process would be flaky. So I would think we need to get the readiness probes working for rolling updates. |
@asridharan we are on the following subscription '282d71e4-f66b-4e8f-8e49-4faea8667362' but we are using Gateway v2. And we created the cluster through our ARM templates so I could send you those templates if you want. thx in advance for checking it out |
@emilverwoerd @asridharan I think the |
Our environment tier with Azure by VPN. My nginx-ingress external IP is similar to my local network. When I hit nginx-ingress I can easily access my application but through Application Gateway getting 502. I test Application Gateway to nginx-ingress ip by NetworkTroubleshoot from Azure, that's also working fine. Just curious to know In order to route traffic from Azure Application Gateway to ingress, application-gateway-kubernetes-ingress is mandatory or I can go with nginx-ingress as well |
@kernelv5 sorry for the late response, but one thing you want to check is that AG subnet is able to route to subnets that your pods are connected to? If AG is not able to connect to your pods than you might be getting a 502 error. |
I think this is currently still an issue. I 'fixed' it somehow to use the preStop hook, in combination with the The sleep i have around 45 seconds, and the termination grace period around 90 seconds which seems to work for our case. Would be nice if this gets implemented... |
We are dealing with the same issues. The workaround suggested by @Baklap4 functions, but is far from ideal. The ingress controller pod (image: mcr.microsoft.com/azure-application-gateway/kubernetes-ingress:tag) is initiating a reconfiguration of the appgw as soon as any of the ‘connected’ resources changes (= expected behaviour). This reconfiguration process is first fully completed, before a new reconfiguration process is initiated. During a rolling redeployment several changes (stopping pods, creating pods, deleting pods) are happening relatively shortly after each other. A redeployment therefore causes a discrepancy between the configuration of the appgw and the actual situation in the cluster. This discrepancy is resulting in 502s and/or 503s which is not very ‘rolling’. An example to illustrate:
Is there a possibility to somehow supersede the reconfiguration process initiated by the ingress controller pod as soon as there is another change happening within the connected resources in the cluster? |
Closing this issue. |
@akshaysngupta Where are the tracking issues for the longterm-support solution? Making the backendpool pick up changes faster? |
Describe the bug
When performing an rolling update of any kind of service you want the site or service to stay online. But during an update a 502 is given with the exception bad gateway. The problem occurs due the fact that Application Gateway is using the internal IP-addres of the nodes in the backend-pool instead of the Cluster IP of the specified service.
So what happens is that kubernetes is spinning up different nodes with new ip-addresses depending on the replica count and that the original ip-addresses are removed which are used by the Application Gateway. A couple of minutes later the backend-pool is updated with the new IP-addreses of the nodes. But we want the ClusterIP address to be used in the backend-pool so Kubernetes can perform correct load-balancing
To Reproduce
Redeploy a service and check if it is online
The text was updated successfully, but these errors were encountered: