-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster configuration "reverts" to something old #298
Comments
@ritazh by any chance do you know what may happen here? @seniorquico I'm aware that you have a support ticket open in the portal, I guess that it would help if you add its ID here so someone can look at it. For me, it looks like something is wrong on your manager nodes. Did you try deploy a fresh AKS cluster and see if the same problem happens? |
Thanks, @galvesribeiro. The Azure support ticket is 118040617949323. We did deploy a brand new AKS cluster at the end of last week. (We wanted to leave this problematic cluster alone/intact in case anyone on the AKS team wants to dig into anything.) The new AKS cluster has been running the same suite of services without any issues. |
Hummm. That is what I was suspicious... AKS still in preview, so I wonder if any of those cluster updates along the way didn't messed with anything. I mean, not just your updates, but the internal ones in the hosted infrastructure. If that somehow affect the quorum on etcd, that may be the source of your problems. |
I finally received a response from Microsoft support, and it was simply, "...the infrastructure has been affected and the engineers have solved the situation." Unfortunately, our cluster still wasn't working. In addition, there have been new reports of problems. We decided to just delete the AKS resource and move on. Sorry to anyone following along... there's no explanation or resolution. Overall... This issue along with many others (#112, #131, #324) shows that AKS isn't near GA/production quality. We've switched our production workloads from ACS/AKS to Google Kubernetes Engine. |
Sorry to hear that... Yes, AKS still in preview but I'm using it in production since the early private previews and it is working fine for me. |
I've seen comments that the team has recently worked to improve the reliability of the etcd service in AKS. Unfortunately, we're encountering a new issue that I haven't yet read about on this tracker:
In the past couple weeks we experienced a new issue where the cluster would revert to a configuration state dated in late February. And by reverted, I mean absolutely everything reverted. It was like a time warp to late February. We manually worked through redeploying the latest configurations and components without incident (I intended to open a GitHub issue sooner to start this dialog... sorry all).
After things had appeared to stabilize, we upgraded the cluster to the latest available version using the Azure CLI. After running fine for a few days, the cluster again reverted to the same configuration state dated in late February. However, this time, nodes that had been created after the upgrade do not show up in kubectl/dashboard despite showing as available in the Azure portal.
EDIT: I want to confirm that we looked at the "obvious", ensuring this wasn't a RollingUpdate issue pulling us back. We've had numerous, successful updates to the resources in this cluster since the timepoint in late February. We also can't find anything significant about the date in late February. We performed no out-of-the-ordinary system changes (no K8s upgrades, no scaling events, etc.). We do have a CD pipeline connected and generally push several new image tags per day.
The text was updated successfully, but these errors were encountered: