Provide commands to check WAL archive destination is usable #443

mikewallace1979 · 2021-11-02T17:23:50Z

Adds commands to barman and barman cloud which check that a barman server
or cloud location is safe to use as an archive destination for a new
PostgreSQL server.

A location is considered safe if either:

There are no WAL files at all in the archive.
All existing WAL files belong to an older timeline than that
specified by the --timeline argument.

A file is considered a WAL file if it passes the is_any_xlog_file check
in barman/xlog.py so this applies to WAL files, history files, partial
WAL files and backup labels.

The commands added are:

barman check-wal-archive
barman-cloud-check-wal-archive

The motivation for this patch is to provide a way that external
orchestration tools can validate the WAL archive destination is safe for
a newly provisioned PostgreSQL cluster, given such a cluster may use the
exact same name as an old cluster.

In such scenarios, any WAL files on the same or higher timeline as the
WALs being written by the new cluster will cause any attempt to restore
from a backup to fail.

Reasons why external orchestration tooling may re-use the same cluster
name and archive destination include (but are not limited to):

A new cluster is created via initdb with the same name as the old one.
The sysid will be different but this does not affect the archive
destination so any archived WALs relating to the older cluster will be
present in the same location.
A cluster is restored from a base backup and uses the same name as the
old cluster. The cluster has the same sysid and starts with a segment
ID > 1 and timeline > 1. The same archive destination used by the old
cluster will be used for the restored cluster.
A new cluster is started which happens to re-use the same name and
archive destination.

All of these cases lead to the situation where WAL archiving and backup
is functioning normally but any attempts to restore from those backups
will fail. This is dangerous for anyone relying on the databases managed
by external orchestration/automation.

The commands provided by this patch do not solve the problem alone
because neither Barman nor PostgreSQL have the necessary context. The
commands can, however, be added to external automation in order to
catch archive safety issues at the provisioning stage.

Closes #432

jthreefoot-edb · 2021-11-04T16:16:34Z

barman/clients/cloud_check_wal_archive.py

+from barman.xlog import check_archive_usable
+
+try:
+    import argparse


Why does the argparse import need to be wrapped in a try-except?

This is a good question and I think the answer is "there is no longer any good reason".

The first commit to introduce this was 83e9979 where the barman-wal-restore.py script is added. This would have needed to support python 2.6 (and presumably older versions of 3.x) which didn't ship with argparse so there was a genuine possibility it could run without argparse being available, therefore providing a friendly error message was useful.

Now we only support python versions which include argparse I don't think there is a need to do this at all. I'll remove it from this PR and while we're at it I'll update the other scripts too.

ringerc · 2021-11-08T06:57:32Z

After a bit more thinking I think it's probably sufficient to be able to say "no archives or timeline history files on timeline X or any greater timeline".

Archives on the same timeline as end-of-recovery don't matter when we're doing a TLI increment. Pg does a TLI increment if restoring from backup (backup label exists) or if started up in archive recovery or streaming replication. It only skips a TLI increment if it thinks the startup is normal crash recovery. The only way I can see that happening here is if someone takes a disk or filesystem snapshot and starts a new Pg from it without any recovery configuration or starting up as a standby.

In either case what actually matters is the timeline and LSN at which Pg will write the first WAL seg after end-of-recovery, not the current timeline and LSN of the Pg instance.

(Will update soon, thinking about it some more)

mikewallace1979 · 2021-11-09T12:45:27Z

@ringerc So we could maybe simplify this to just passing in a --current-timeline and not require a WAL segment at all?

mikewallace1979 · 2021-11-09T14:09:57Z

(just rebased and updated now that the argh->argparse work has landed - everything else remains the same in the PR for now but I'll squash what needs to be squashed before merging)

mikewallace1979 · 2021-11-16T18:32:35Z

@ringerc I've just pushed a commit which removes --current-wal-segment and changes the behaviour so if --timeline is provided then it passes only if all WAL files in the archive relate to an earlier timeline.

@jthreefoot-edb I know you've already reviewed this but could you please take another look now the options have changed?

jthreefoot-edb · 2021-11-16T20:23:11Z

@jthreefoot-edb I know you've already reviewed this but could you please take another look now the options have changed?

Looks good to me.

Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. All existing WAL files belong to an older timeline than that specified by the --timeline argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files on the same or higher timeline as the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432

We only support python versions which ship with argparse now so it is no longer necessary to catch ImportError and print a friendly error when importing argparse.

edb-sonar-app · 2021-11-16T22:23:24Z

SonarQube Quality Gate:

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

95.1% Coverage
1.8% Duplication

mikewallace1979 mentioned this pull request Nov 2, 2021

Implement new barman and/or barman-cloud subcommands to test the archive-destination contents #432

Closed

mikewallace1979 force-pushed the dev/432-barman-wal-archive-usable-check branch 2 times, most recently from 56311a0 to c1861ea Compare November 4, 2021 14:04

jthreefoot-edb reviewed Nov 4, 2021

View reviewed changes

jthreefoot-edb approved these changes Nov 4, 2021

View reviewed changes

mikewallace1979 force-pushed the dev/432-barman-wal-archive-usable-check branch from 85a6833 to 9b0e116 Compare November 9, 2021 14:09

mikewallace1979 marked this pull request as ready for review November 16, 2021 18:32

mikewallace1979 requested a review from jthreefoot-edb November 16, 2021 20:21

jthreefoot-edb approved these changes Nov 16, 2021

View reviewed changes

mikewallace1979 force-pushed the dev/432-barman-wal-archive-usable-check branch from dfea625 to 8817d96 Compare November 16, 2021 22:01

mikewallace1979 added 3 commits November 16, 2021 22:17

Update docs for check-wal-archive commands

3003446

Remove friendly argparse import errors

875fb4e

We only support python versions which ship with argparse now so it is no longer necessary to catch ImportError and print a friendly error when importing argparse.

mikewallace1979 force-pushed the dev/432-barman-wal-archive-usable-check branch from 8817d96 to 875fb4e Compare November 16, 2021 22:20

mikewallace1979 merged commit 3bc7505 into master Nov 16, 2021

mikewallace1979 deleted the dev/432-barman-wal-archive-usable-check branch November 16, 2021 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide commands to check WAL archive destination is usable #443

Provide commands to check WAL archive destination is usable #443

mikewallace1979 commented Nov 2, 2021 •

edited

Loading

jthreefoot-edb Nov 4, 2021

mikewallace1979 Nov 5, 2021

ringerc commented Nov 8, 2021 •

edited

Loading

mikewallace1979 commented Nov 9, 2021

mikewallace1979 commented Nov 9, 2021

mikewallace1979 commented Nov 16, 2021

jthreefoot-edb commented Nov 16, 2021

edb-sonar-app bot commented Nov 16, 2021

Provide commands to check WAL archive destination is usable #443

Provide commands to check WAL archive destination is usable #443

Conversation

mikewallace1979 commented Nov 2, 2021 • edited Loading

jthreefoot-edb Nov 4, 2021

Choose a reason for hiding this comment

mikewallace1979 Nov 5, 2021

Choose a reason for hiding this comment

ringerc commented Nov 8, 2021 • edited Loading

mikewallace1979 commented Nov 9, 2021

mikewallace1979 commented Nov 9, 2021

mikewallace1979 commented Nov 16, 2021

jthreefoot-edb commented Nov 16, 2021

edb-sonar-app bot commented Nov 16, 2021

mikewallace1979 commented Nov 2, 2021 •

edited

Loading

ringerc commented Nov 8, 2021 •

edited

Loading