Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn if a new etcd cluster is seeded, move control files, report the last backup snapshot taken #16416

Closed
raffis opened this issue Mar 22, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@raffis
Copy link
Contributor

raffis commented Mar 22, 2024

/kind feature
/kind bug

Yesterday we experienced a huge downtime in our preproduction cluster. Just that this is said it was completely our fault but I think we could add various gates to kops to prevent similar things to be happen.
I will first add bullets to what I suggest and then I will paste a postmortem documentation from our internal docs I just wrote (there are some hints in there for non platform engineers, sorry about that) which explains in detail what happened.

It could be that some things might be relevant to https://github.com/kubernetes-sigs/etcdadm and not kops.

  • I do not expect important files as control/etcd-cluster-spec beneath a backup path which only contains snapshots besides these files. I strongly suggest to move these files out of any path suggesting its "just" backups.
  • kops update should inform a user if these files are not found and in a more expressive way then just telling the files are created. Aka it should inform if a new etcd cluster gets seeded.
  • kops update should inform me about the last available and valid etcd snapshot and ask for permission if there is no recent snapshot available to proceed.
  • etcd-manager should not cache the spec file forever (unless a restart or leader change happens). It also recognizes that the file got deleted so I think it should also recognize if it was created again
  • etcd-manager metrics like last_snapshot_time and state related metrics

If even one of these suggestions are accepted I am happy to provide a pull request to implement the changes.

Now that said I'm pasting the postmortem here which explains our mistake:

Summary

In the afternoon a cluster spec change (mainly related to micro cost reductions) as well as instance groups changes have been applied via kops v1.28.4, the changes were:

  • <internal pr>
  • <internal pr>

These changes have been applied the usual way using kops edit cluster and kops edit ig <ig-name> . Once the changes were added the changes have been applied using kops update cluster --yes.

Once this command was applied the kubernetes-api was unavailable for ~5minutes. It eventually came back online. However the state was empty. Meaning the staging cluster basically was created from scratch as a new cluster. Turns out all three master nodes have initialized a new etcd cluster (The database in which the kubernetes-api server stores all resources).

In other words this update command killed the entire cluster.
This outage lasted for ~3h until we eventually recovered.

Trigger

A restart of the etcd-manager (issued via a kops update cluster command).

Resolution and root cause

Now how to resolve this? Well a backup was required. Naturally kops ships etcd-manager which automatically takes etcd snapshots and stores them within the kops s3 state bucket. See https://github.com/kopeio/etcd-manager.

Usually one can now revert to the last snapshot (15min interval) and recreate the etcd cluster which will resolve this issue.

However Murphy hit us there. The last backup was uploaded in January and suddenly stopped ever since. The reason for this was this pr

<internal pr>

Which also caused that this outage happened in the first place.
What this pr did was adding an S3 lifecycle rule targeting <cluster-name>/backups/etcd/main to delete snapshots older than 90days. Another micro cost saving change.

However what we were unaware is that kops/etcd-manager stores control files within the same path beneath ./control. The lifecycle rule deleted these files alongside the snapshots we actually wanted to delete. etcd-manager detects that the files were gone in its usual reconcile loop which was visible in /var/log/etcd.log.

I0321 13:02:04.989764    5119 s3fs.go:338] Reading file "s3://<bucket-name>/<cluster-name>/backups/etcd/main/control/etcd-cluster-created"
I0321 13:02:05.382051    5119 controller.go:355] detected that there is no existing cluster
I0321 13:02:05.382063    5119 commands.go:41] refreshing commands
I0321 13:02:05.481669    5119 vfs.go:119] listed commands in s3://<bucket-name>/<cluster-name>/backups/etcd/main/control: 0 commands
I0321 13:02:05.481685    5119 s3fs.go:338] Reading file "s3://<bucket-name>/<cluster-name>/backups/etcd/main/control/etcd-cluster-spec"
I0321 13:02:05.579865    5119 controller.go:388] no cluster spec set - must seed new cluster
I0321 13:02:15.581677    5119 controller.go:185] starting controller iteration

The relevant part was no cluster spec set - must seed new cluster . This check happens before an etcd snapshot is taken. And since that time no more snapshots were uploaded. We were also unaware of it because there is no monitoring for these snapshots. We only check if our backups are properly functioning before we roll out a kubernetes upgrade.

And as the message says must seed new cluster this exactly happened when the kops upgrade command restarted the etcd manager three months later. We were left with a new cluster.

Luckily etcd-manager archives the old etcd datadir to *-trashcan. The steps how it was recoverable are done as follow:

  1. Download the etcd db from the *-trashcan folder directly from the master node using scp to a local machine.
  2. Start a local etcd instance
  3. Import the db using ETCDCTL_API=3 etcdctl snaspshot restore --skip-hash-check=true (This step is necessary as the trashcan is not a snapshot but rather a plain copy of the etcd data dir)
  4. Restart instance with the new data directory created
  5. Export a db snapshot using ETCDCTL_API=3 etcdctl snaspshot save
  6. Gzip the snapshot created
  7. Manually create a folder in the state s3 bucket and name it etcd.backup.gz
  8. Recreate a _etcd_backup.meta file copied from an old existing backup (or from another cluster0)
  9. Having both of these files in a new backup folder ready made it possible to start etcd-manager-ctl restore-backup with said backup. This command can be executed from the local machine as the entire restore process is just s3 driven.
  10. The etcd pods need to be manually restarted on all control plane nodes using crictl stop <id>which will start the recovery process and spin up an etcd cluster with our state before the outage happened.
  11. kube-apiserver eventually recovered (If I remember correctly I killed the processes on all control nodes also). Meaning we have api responses again and our data back.
  12. Some follow up symptoms were that new pods didn’t start anymore. The scheduler assigned them but they didn’t start. Kubelet on all nodes did not recover apparently. I simply looped through the node list and issued a kubelet systemd restart. At this point everything recovered to normal after some pods have been restarted (like cluster-autoscaler).

Alternative method of disaster recovery

In the case If there was no trashcan archive or any other (old) etcd snapshots available we do backup our cluster also with velero. However restoring from velero means creating a new cluster and this can have various other implications and is definitely more time consuming.

We are also fully declarative (gitops style) and could reinitialize a new cluster from these specifications. However there are still some legacy applications which are not yet declarative. Also another point I realized after this outage was I would not be able to restore the sealed secrets as the private encryption keys would be lost forever (These need to be separately exported just in case both velero and etcd snapshots fail). That said some apps store state in kubernetes directly which is obviously not declarative and would be lost.

Other clusters

This lifecycle s3 change was also introduced to other clusters in January. Meaning since January our clusters are q ticking time bomb. Any process interruption, node interruption or manual restarts would have killed them.

On these etcd-manager leader control plane nodes the same logs are found during the reconcile loop regarding the reseeding of the cluster after a restart.
I created these control files manually in the s3 bucket. However the etcd-manager did not pick up on these as after analyzing the source code it caches the control specs and only reloads at restart or a leader change. After some tests in other multi node cluster I came to the conclusion it would not reseed the cluster once etcd-manager is restarted as it would find the control files again in the bucket.
However until we restart the etcd-manager we are left without backups as this state is cached.

Conclusion

The lifecycle rule should have never been created. It was definitely our own fault. The retention should have rather been configured natively via kops, see https://kops.sigs.k8s.io/cluster_spec/#etcd-backups-retention.

That said I also don’t expect such import control files being underneath a backup path. Also kops did not warn me about that the update will A. kill the cluster and B. there was no recent backup.

That these files will be recreated was visible in the kops dry run (which I only recognized while writing this document). But even if I saw this before applying it I would not have realized that this will kill etcd.

Will create resources:
  ManagedFile/etcd-cluster-spec-events
  	Base                	s3://<bucket-name>/<cluster-name>/backups/etcd/events
  	Location            	/control/etcd-cluster-spec

  ManagedFile/etcd-cluster-spec-main
  	Base                	s3://<bucket-name>/<cluster-name>/backups/etcd/main
  	Location            	/control/etcd-cluster-spec

We have been extremely lucky there was no process interruption in production this far and I actually did some cluster changes today which triggered this on the staging cluster first.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/bug Categorizes issue or PR as related to a bug. labels Mar 22, 2024
@raffis
Copy link
Contributor Author

raffis commented May 16, 2024

Any comment on this 🙏🏻 ?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 14, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 13, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 13, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants