Skip to content

Etcd Backup on Pollux#1197

Open
JamesDoingStuff wants to merge 3 commits intomainfrom
jg/etcd-backup
Open

Etcd Backup on Pollux#1197
JamesDoingStuff wants to merge 3 commits intomainfrom
jg/etcd-backup

Conversation

@JamesDoingStuff
Copy link
Contributor

The dev_resources/build CI is currently failing due, I think, to check_k8s_resources.py not handling CronJobs well - specifically, having a resources field but no replicas. I'll look into fixing this

Adds:

  • CronJob that executes daily to take a snapshot of one of the etcd PVs and upload it to Echo. Backups are timestamped and stored under the path dls-workflows-prod/<staging/prod>/etcd-snapshot-<timestamp>.db. The contents are encrypted. The job deletes backups older than 2 days.
  • CronJob to download the snapshot and perform an etcdctl snapshot restore on the provided etcd volume. This job won't automatically.
  • Script (scripts/restore-etcd.sh) that scales down the etcd and the vcluster, performs the above job for each etcd volume, then returns the cluster to initial levels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have a staging being backed up and not prod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to roll it out on staging first and just make sure all is well - I just needed to add something to the Values.yaml for prod

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! I would add a simple ticket to the backlog in case it has not been added yet

namespace: workflows
type: Opaque
{{ else }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be an empty line after all of these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it do anything? I see that the other templates have one, so I don't mind adding one in if it's convention, just curious


# Delete old backed up objects, with age >= 2 days.
echo "deleting old backups from echo s3"
rclone delete --min-age=2d echo:dls-workflows-prod/${PREFIX}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine for this PR but we should decide our strategy for how many and how long we want to keep backups for.

@JamesDoingStuff
Copy link
Contributor Author

Made a couple of minor changes to get this passing CI (last 2 commits) so if someone could just sanity check those please, that'd be great :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants