-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn if a new etcd cluster is seeded, move control files, report the last backup snapshot taken #16416
Comments
Any comment on this 🙏🏻 ? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/kind feature
/kind bug
Yesterday we experienced a huge downtime in our preproduction cluster. Just that this is said it was completely our fault but I think we could add various gates to kops to prevent similar things to be happen.
I will first add bullets to what I suggest and then I will paste a postmortem documentation from our internal docs I just wrote (there are some hints in there for non platform engineers, sorry about that) which explains in detail what happened.
It could be that some things might be relevant to https://github.com/kubernetes-sigs/etcdadm and not kops.
If even one of these suggestions are accepted I am happy to provide a pull request to implement the changes.
Now that said I'm pasting the postmortem here which explains our mistake:
Summary
In the afternoon a cluster spec change (mainly related to micro cost reductions) as well as instance groups changes have been applied via kops v1.28.4, the changes were:
<internal pr>
<internal pr>
These changes have been applied the usual way using
kops edit cluster
andkops edit ig <ig-name>
. Once the changes were added the changes have been applied usingkops update cluster --yes
.Once this command was applied the kubernetes-api was unavailable for ~5minutes. It eventually came back online. However the state was empty. Meaning the staging cluster basically was created from scratch as a new cluster. Turns out all three master nodes have initialized a new etcd cluster (The database in which the kubernetes-api server stores all resources).
In other words this update command killed the entire cluster.
This outage lasted for ~3h until we eventually recovered.
Trigger
A restart of the etcd-manager (issued via a kops update cluster command).
Resolution and root cause
Now how to resolve this? Well a backup was required. Naturally kops ships etcd-manager which automatically takes etcd snapshots and stores them within the kops s3 state bucket. See https://github.com/kopeio/etcd-manager.
Usually one can now revert to the last snapshot (15min interval) and recreate the etcd cluster which will resolve this issue.
However Murphy hit us there. The last backup was uploaded in January and suddenly stopped ever since. The reason for this was this pr
<internal pr>
Which also caused that this outage happened in the first place.
What this pr did was adding an S3 lifecycle rule targeting
<cluster-name>/backups/etcd/main
to delete snapshots older than 90days. Another micro cost saving change.However what we were unaware is that kops/etcd-manager stores control files within the same path beneath
./control
. The lifecycle rule deleted these files alongside the snapshots we actually wanted to delete. etcd-manager detects that the files were gone in its usual reconcile loop which was visible in/var/log/etcd.log
.The relevant part was
no cluster spec set - must seed new cluster
. This check happens before an etcd snapshot is taken. And since that time no more snapshots were uploaded. We were also unaware of it because there is no monitoring for these snapshots. We only check if our backups are properly functioning before we roll out a kubernetes upgrade.And as the message says
must seed new cluster
this exactly happened when the kops upgrade command restarted the etcd manager three months later. We were left with a new cluster.Luckily etcd-manager archives the old etcd datadir to *-trashcan. The steps how it was recoverable are done as follow:
ETCDCTL_API=3 etcdctl snaspshot restore --skip-hash-check=true
(This step is necessary as the trashcan is not a snapshot but rather a plain copy of the etcd data dir)ETCDCTL_API=3 etcdctl snaspshot save
etcd-manager-ctl restore-backup
with said backup. This command can be executed from the local machine as the entire restore process is just s3 driven.crictl stop <id>
which will start the recovery process and spin up an etcd cluster with our state before the outage happened.Alternative method of disaster recovery
In the case If there was no trashcan archive or any other (old) etcd snapshots available we do backup our cluster also with velero. However restoring from velero means creating a new cluster and this can have various other implications and is definitely more time consuming.
We are also fully declarative (gitops style) and could reinitialize a new cluster from these specifications. However there are still some legacy applications which are not yet declarative. Also another point I realized after this outage was I would not be able to restore the sealed secrets as the private encryption keys would be lost forever (These need to be separately exported just in case both velero and etcd snapshots fail). That said some apps store state in kubernetes directly which is obviously not declarative and would be lost.
Other clusters
This lifecycle s3 change was also introduced to other clusters in January. Meaning since January our clusters are q ticking time bomb. Any process interruption, node interruption or manual restarts would have killed them.
On these etcd-manager leader control plane nodes the same logs are found during the reconcile loop regarding the reseeding of the cluster after a restart.
I created these control files manually in the s3 bucket. However the etcd-manager did not pick up on these as after analyzing the source code it caches the control specs and only reloads at restart or a leader change. After some tests in other multi node cluster I came to the conclusion it would not reseed the cluster once etcd-manager is restarted as it would find the control files again in the bucket.
However until we restart the etcd-manager we are left without backups as this state is cached.
Conclusion
The lifecycle rule should have never been created. It was definitely our own fault. The retention should have rather been configured natively via kops, see https://kops.sigs.k8s.io/cluster_spec/#etcd-backups-retention.
That said I also don’t expect such import control files being underneath a backup path. Also kops did not warn me about that the update will A. kill the cluster and B. there was no recent backup.
That these files will be recreated was visible in the kops dry run (which I only recognized while writing this document). But even if I saw this before applying it I would not have realized that this will kill etcd.
We have been extremely lucky there was no process interruption in production this far and I actually did some cluster changes today which triggered this on the staging cluster first.
The text was updated successfully, but these errors were encountered: