-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide commands to check WAL archive destination is usable #443
Provide commands to check WAL archive destination is usable #443
Conversation
56311a0
to
c1861ea
Compare
from barman.xlog import check_archive_usable | ||
|
||
try: | ||
import argparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the argparse import need to be wrapped in a try-except?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question and I think the answer is "there is no longer any good reason".
The first commit to introduce this was 83e9979 where the barman-wal-restore.py
script is added. This would have needed to support python 2.6 (and presumably older versions of 3.x) which didn't ship with argparse so there was a genuine possibility it could run without argparse being available, therefore providing a friendly error message was useful.
Now we only support python versions which include argparse I don't think there is a need to do this at all. I'll remove it from this PR and while we're at it I'll update the other scripts too.
After a bit more thinking I think it's probably sufficient to be able to say "no archives or timeline history files on timeline X or any greater timeline". Archives on the same timeline as end-of-recovery don't matter when we're doing a TLI increment. Pg does a TLI increment if restoring from backup (backup label exists) or if started up in archive recovery or streaming replication. It only skips a TLI increment if it thinks the startup is normal crash recovery. The only way I can see that happening here is if someone takes a disk or filesystem snapshot and starts a new Pg from it without any recovery configuration or starting up as a standby. In either case what actually matters is the timeline and LSN at which Pg will write the first WAL seg after end-of-recovery, not the current timeline and LSN of the Pg instance. (Will update soon, thinking about it some more) |
@ringerc So we could maybe simplify this to just passing in a |
85a6833
to
9b0e116
Compare
(just rebased and updated now that the argh->argparse work has landed - everything else remains the same in the PR for now but I'll squash what needs to be squashed before merging) |
@ringerc I've just pushed a commit which removes @jthreefoot-edb I know you've already reviewed this but could you please take another look now the options have changed? |
Looks good to me. |
dfea625
to
8817d96
Compare
Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. All existing WAL files belong to an older timeline than that specified by the --timeline argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files on the same or higher timeline as the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432
We only support python versions which ship with argparse now so it is no longer necessary to catch ImportError and print a friendly error when importing argparse.
8817d96
to
875fb4e
Compare
Adds commands to barman and barman cloud which check that a barman server
or cloud location is safe to use as an archive destination for a new
PostgreSQL server.
A location is considered safe if either:
specified by the --timeline argument.
A file is considered a WAL file if it passes the
is_any_xlog_file
checkin
barman/xlog.py
so this applies to WAL files, history files, partialWAL files and backup labels.
The commands added are:
The motivation for this patch is to provide a way that external
orchestration tools can validate the WAL archive destination is safe for
a newly provisioned PostgreSQL cluster, given such a cluster may use the
exact same name as an old cluster.
In such scenarios, any WAL files on the same or higher timeline as the
WALs being written by the new cluster will cause any attempt to restore
from a backup to fail.
Reasons why external orchestration tooling may re-use the same cluster
name and archive destination include (but are not limited to):
The sysid will be different but this does not affect the archive
destination so any archived WALs relating to the older cluster will be
present in the same location.
old cluster. The cluster has the same sysid and starts with a segment
ID > 1 and timeline > 1. The same archive destination used by the old
cluster will be used for the restored cluster.
archive destination.
All of these cases lead to the situation where WAL archiving and backup
is functioning normally but any attempts to restore from those backups
will fail. This is dangerous for anyone relying on the databases managed
by external orchestration/automation.
The commands provided by this patch do not solve the problem alone
because neither Barman nor PostgreSQL have the necessary context. The
commands can, however, be added to external automation in order to
catch archive safety issues at the provisioning stage.
Closes #432