Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster configuration "reverts" to something old #298

Closed
seniorquico opened this issue Apr 11, 2018 · 5 comments
Closed

Cluster configuration "reverts" to something old #298

seniorquico opened this issue Apr 11, 2018 · 5 comments

Comments

@seniorquico
Copy link

seniorquico commented Apr 11, 2018

I've seen comments that the team has recently worked to improve the reliability of the etcd service in AKS. Unfortunately, we're encountering a new issue that I haven't yet read about on this tracker:

In the past couple weeks we experienced a new issue where the cluster would revert to a configuration state dated in late February. And by reverted, I mean absolutely everything reverted. It was like a time warp to late February. We manually worked through redeploying the latest configurations and components without incident (I intended to open a GitHub issue sooner to start this dialog... sorry all).

After things had appeared to stabilize, we upgraded the cluster to the latest available version using the Azure CLI. After running fine for a few days, the cluster again reverted to the same configuration state dated in late February. However, this time, nodes that had been created after the upgrade do not show up in kubectl/dashboard despite showing as available in the Azure portal.

EDIT: I want to confirm that we looked at the "obvious", ensuring this wasn't a RollingUpdate issue pulling us back. We've had numerous, successful updates to the resources in this cluster since the timepoint in late February. We also can't find anything significant about the date in late February. We performed no out-of-the-ordinary system changes (no K8s upgrades, no scaling events, etc.). We do have a CD pipeline connected and generally push several new image tags per day.

@galvesribeiro
Copy link

@ritazh by any chance do you know what may happen here?

@seniorquico I'm aware that you have a support ticket open in the portal, I guess that it would help if you add its ID here so someone can look at it.

For me, it looks like something is wrong on your manager nodes.

Did you try deploy a fresh AKS cluster and see if the same problem happens?

@seniorquico
Copy link
Author

Thanks, @galvesribeiro. The Azure support ticket is 118040617949323.

We did deploy a brand new AKS cluster at the end of last week. (We wanted to leave this problematic cluster alone/intact in case anyone on the AKS team wants to dig into anything.) The new AKS cluster has been running the same suite of services without any issues.

@galvesribeiro
Copy link

The new AKS cluster has been running the same suite of services without any issues.

Hummm. That is what I was suspicious... AKS still in preview, so I wonder if any of those cluster updates along the way didn't messed with anything. I mean, not just your updates, but the internal ones in the hosted infrastructure. If that somehow affect the quorum on etcd, that may be the source of your problems.

@seniorquico
Copy link
Author

I finally received a response from Microsoft support, and it was simply, "...the infrastructure has been affected and the engineers have solved the situation." Unfortunately, our cluster still wasn't working. In addition, there have been new reports of problems. We decided to just delete the AKS resource and move on. Sorry to anyone following along... there's no explanation or resolution.

Overall... This issue along with many others (#112, #131, #324) shows that AKS isn't near GA/production quality. We've switched our production workloads from ACS/AKS to Google Kubernetes Engine.

@galvesribeiro
Copy link

Sorry to hear that... Yes, AKS still in preview but I'm using it in production since the early private previews and it is working fine for me.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants