Skip to content

Latest commit

 

History

History
17 lines (11 loc) · 3.07 KB

[Chapter 9] disaster-recovery.md

File metadata and controls

17 lines (11 loc) · 3.07 KB

Disaster Recovery & Back Ups

Although Kubernetes (and especially managed Kubernetes services like GKE) provide out-of-the-box reliability and resiliency with self-healing capabilities, production systems still require disaster recovery plans to protect against both human error (e.g. accidentally deleting a namespace/cluster) and failures of infrastructure outside of Kubernetes (e.g. persistent volumes).

Backups are Still Needed in HA Clusters

A key point that is often lost when running services in high availability (HA) mode is that HA (and thus replication) is not the same as having backups. HA protects against zonal failures, but it will not protect against data corruption or accidental removals. It is very easy to mix up context or namespaces and accidentally delete or update the wrong resources. Since Kubernetes is a declarative system, you can easily revert back in terms of Kubernetes resources, but attached persistent volumes may not recover easily.

Backups

The cluster state for Kubernetes is stored in the etcd datastore. Fortunately, managed Kubernetes services take care of etcd maintenance so Kubernetes does not suffer a catastrophic failure on the master plane. If a managed service is not an option, Kubernetes documentation has detailed instructions for backing up etcd data.

As for Kubernetes resources, backing up resource states can be easily achieved by using git with Kubernetes manifest files. For example, if you have Helm Charts, you can store Helm Charts via Chart Museum and check in values.yaml files to git. Another option is to use a tool like kube-backup to sync Kubernetes resource states to git.

This leaves us with persistent volumes for stateful workloads to back up. Perhaps you are running Elastic Search as a StatefulSet or have custom databases for a specific need. For example, Leverege uses a self-managed TimescaleDB for a time series database within Kubernetes. Since data is stored in an attached persistent disk, we needed to automate backups for it in case of failures.

Velero/Ark

Velero is an open-source back up tool from Heptio that works well with persistent disks. Velero is compatible with popular data storage options (e.g. S3, GCS, Azure Storage) and that bucket to be set up prior to deployment. There is a Helm chart available for easy installs, but it currently lags behind the official release version (it also still uses the old name, ark, to add to the confusion). Both the Helm chart and the manual installation steps listed on the documentation website detail how to create snapshots, restore from backups, and also schedule backups.

Although Leverege has not seen the need to use Velero in production to recover from failures, we have also found that it is a good tool to use in testing and replay capabilities to troubleshoot and debug issues with backed up data. Velero can also be used to migrate data between projects as well.