Cluster configuration "reverts" to something old #298

seniorquico · 2018-04-11T15:07:03Z

I've seen comments that the team has recently worked to improve the reliability of the etcd service in AKS. Unfortunately, we're encountering a new issue that I haven't yet read about on this tracker:

In the past couple weeks we experienced a new issue where the cluster would revert to a configuration state dated in late February. And by reverted, I mean absolutely everything reverted. It was like a time warp to late February. We manually worked through redeploying the latest configurations and components without incident (I intended to open a GitHub issue sooner to start this dialog... sorry all).

After things had appeared to stabilize, we upgraded the cluster to the latest available version using the Azure CLI. After running fine for a few days, the cluster again reverted to the same configuration state dated in late February. However, this time, nodes that had been created after the upgrade do not show up in kubectl/dashboard despite showing as available in the Azure portal.

EDIT: I want to confirm that we looked at the "obvious", ensuring this wasn't a RollingUpdate issue pulling us back. We've had numerous, successful updates to the resources in this cluster since the timepoint in late February. We also can't find anything significant about the date in late February. We performed no out-of-the-ordinary system changes (no K8s upgrades, no scaling events, etc.). We do have a CD pipeline connected and generally push several new image tags per day.

galvesribeiro · 2018-04-11T15:32:05Z

@ritazh by any chance do you know what may happen here?

@seniorquico I'm aware that you have a support ticket open in the portal, I guess that it would help if you add its ID here so someone can look at it.

For me, it looks like something is wrong on your manager nodes.

Did you try deploy a fresh AKS cluster and see if the same problem happens?

seniorquico · 2018-04-11T15:37:52Z

Thanks, @galvesribeiro. The Azure support ticket is 118040617949323.

We did deploy a brand new AKS cluster at the end of last week. (We wanted to leave this problematic cluster alone/intact in case anyone on the AKS team wants to dig into anything.) The new AKS cluster has been running the same suite of services without any issues.

galvesribeiro · 2018-04-11T16:25:11Z

The new AKS cluster has been running the same suite of services without any issues.

Hummm. That is what I was suspicious... AKS still in preview, so I wonder if any of those cluster updates along the way didn't messed with anything. I mean, not just your updates, but the internal ones in the hosted infrastructure. If that somehow affect the quorum on etcd, that may be the source of your problems.

seniorquico · 2018-04-25T14:27:50Z

I finally received a response from Microsoft support, and it was simply, "...the infrastructure has been affected and the engineers have solved the situation." Unfortunately, our cluster still wasn't working. In addition, there have been new reports of problems. We decided to just delete the AKS resource and move on. Sorry to anyone following along... there's no explanation or resolution.

Overall... This issue along with many others (#112, #131, #324) shows that AKS isn't near GA/production quality. We've switched our production workloads from ACS/AKS to Google Kubernetes Engine.

galvesribeiro · 2018-04-25T14:31:06Z

Sorry to hear that... Yes, AKS still in preview but I'm using it in production since the early private previews and it is working fine for me.

seniorquico closed this as completed Apr 25, 2018

ghost locked as resolved and limited conversation to collaborators Aug 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster configuration "reverts" to something old #298

Cluster configuration "reverts" to something old #298

seniorquico commented Apr 11, 2018 •

edited

Loading

galvesribeiro commented Apr 11, 2018

seniorquico commented Apr 11, 2018

galvesribeiro commented Apr 11, 2018

seniorquico commented Apr 25, 2018

galvesribeiro commented Apr 25, 2018

Cluster configuration "reverts" to something old #298

Cluster configuration "reverts" to something old #298

Comments

seniorquico commented Apr 11, 2018 • edited Loading

galvesribeiro commented Apr 11, 2018

seniorquico commented Apr 11, 2018

galvesribeiro commented Apr 11, 2018

seniorquico commented Apr 25, 2018

galvesribeiro commented Apr 25, 2018

seniorquico commented Apr 11, 2018 •

edited

Loading