Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide commands to check WAL archive destination is usable #443

Merged
merged 3 commits into from
Nov 16, 2021

Conversation

mikewallace1979
Copy link
Contributor

@mikewallace1979 mikewallace1979 commented Nov 2, 2021

Adds commands to barman and barman cloud which check that a barman server
or cloud location is safe to use as an archive destination for a new
PostgreSQL server.

A location is considered safe if either:

  1. There are no WAL files at all in the archive.
  2. All existing WAL files belong to an older timeline than that
    specified by the --timeline argument.

A file is considered a WAL file if it passes the is_any_xlog_file check
in barman/xlog.py so this applies to WAL files, history files, partial
WAL files and backup labels.

The commands added are:

  • barman check-wal-archive
  • barman-cloud-check-wal-archive

The motivation for this patch is to provide a way that external
orchestration tools can validate the WAL archive destination is safe for
a newly provisioned PostgreSQL cluster, given such a cluster may use the
exact same name as an old cluster.

In such scenarios, any WAL files on the same or higher timeline as the
WALs being written by the new cluster will cause any attempt to restore
from a backup to fail.

Reasons why external orchestration tooling may re-use the same cluster
name and archive destination include (but are not limited to):

  • A new cluster is created via initdb with the same name as the old one.
    The sysid will be different but this does not affect the archive
    destination so any archived WALs relating to the older cluster will be
    present in the same location.
  • A cluster is restored from a base backup and uses the same name as the
    old cluster. The cluster has the same sysid and starts with a segment
    ID > 1 and timeline > 1. The same archive destination used by the old
    cluster will be used for the restored cluster.
  • A new cluster is started which happens to re-use the same name and
    archive destination.

All of these cases lead to the situation where WAL archiving and backup
is functioning normally but any attempts to restore from those backups
will fail. This is dangerous for anyone relying on the databases managed
by external orchestration/automation.

The commands provided by this patch do not solve the problem alone
because neither Barman nor PostgreSQL have the necessary context. The
commands can, however, be added to external automation in order to
catch archive safety issues at the provisioning stage.

Closes #432

from barman.xlog import check_archive_usable

try:
import argparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the argparse import need to be wrapped in a try-except?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question and I think the answer is "there is no longer any good reason".

The first commit to introduce this was 83e9979 where the barman-wal-restore.py script is added. This would have needed to support python 2.6 (and presumably older versions of 3.x) which didn't ship with argparse so there was a genuine possibility it could run without argparse being available, therefore providing a friendly error message was useful.

Now we only support python versions which include argparse I don't think there is a need to do this at all. I'll remove it from this PR and while we're at it I'll update the other scripts too.

@ringerc
Copy link

ringerc commented Nov 8, 2021

After a bit more thinking I think it's probably sufficient to be able to say "no archives or timeline history files on timeline X or any greater timeline".

Archives on the same timeline as end-of-recovery don't matter when we're doing a TLI increment. Pg does a TLI increment if restoring from backup (backup label exists) or if started up in archive recovery or streaming replication. It only skips a TLI increment if it thinks the startup is normal crash recovery. The only way I can see that happening here is if someone takes a disk or filesystem snapshot and starts a new Pg from it without any recovery configuration or starting up as a standby.

In either case what actually matters is the timeline and LSN at which Pg will write the first WAL seg after end-of-recovery, not the current timeline and LSN of the Pg instance.

(Will update soon, thinking about it some more)

@mikewallace1979
Copy link
Contributor Author

@ringerc So we could maybe simplify this to just passing in a --current-timeline and not require a WAL segment at all?

@mikewallace1979 mikewallace1979 force-pushed the dev/432-barman-wal-archive-usable-check branch from 85a6833 to 9b0e116 Compare November 9, 2021 14:09
@mikewallace1979
Copy link
Contributor Author

(just rebased and updated now that the argh->argparse work has landed - everything else remains the same in the PR for now but I'll squash what needs to be squashed before merging)

@mikewallace1979
Copy link
Contributor Author

@ringerc I've just pushed a commit which removes --current-wal-segment and changes the behaviour so if --timeline is provided then it passes only if all WAL files in the archive relate to an earlier timeline.

@jthreefoot-edb I know you've already reviewed this but could you please take another look now the options have changed?

@mikewallace1979 mikewallace1979 marked this pull request as ready for review November 16, 2021 18:32
@jthreefoot-edb
Copy link
Contributor

@jthreefoot-edb I know you've already reviewed this but could you please take another look now the options have changed?

Looks good to me.

@mikewallace1979 mikewallace1979 force-pushed the dev/432-barman-wal-archive-usable-check branch from dfea625 to 8817d96 Compare November 16, 2021 22:01
Adds commands to barman and barman cloud which check that a barman server
or cloud location is safe to use as an archive destination for a new
PostgreSQL server.

A location is considered safe if either:

1. There are no WAL files at all in the archive.
2. All existing WAL files belong to an older timeline than that
   specified by the --timeline argument.

A file is considered a WAL file if it passes the `is_any_xlog_file` check
in `barman/xlog.py` so this applies to WAL files, history files, partial
WAL files and backup labels.

The commands added are:

* barman check-wal-archive
* barman-cloud-check-wal-archive

The motivation for this patch is to provide a way that external
orchestration tools can validate the WAL archive destination is safe for
a newly provisioned PostgreSQL cluster, given such a cluster may use the
exact same name as an old cluster.

In such scenarios, any WAL files on the same or higher timeline as the
WALs being written by the new cluster will cause any attempt to restore
from a backup to fail.

Reasons why external orchestration tooling may re-use the same cluster
name and archive destination include (but are not limited to):

* A new cluster is created via initdb with the same name as the old one.
  The sysid will be different but this does not affect the archive
  destination so any archived WALs relating to the older cluster will be
  present in the same location.
* A cluster is restored from a base backup and uses the same name as the
  old cluster. The cluster has the same sysid and starts with a segment
  ID > 1 and timeline > 1. The same archive destination used by the old
  cluster will be used for the restored cluster.
* A new cluster is started which happens to re-use the same name and
  archive destination.

All of these cases lead to the situation where WAL archiving and backup
is functioning normally *but* any attempts to restore from those backups
will fail. This is dangerous for anyone relying on the databases managed
by external orchestration/automation.

The commands provided by this patch do not solve the problem alone
because neither Barman nor PostgreSQL have the necessary context. The
commands can, however, be added to external automation in order to
catch archive safety issues at the provisioning stage.

Closes #432
We only support python versions which ship with argparse now so it is no
longer necessary to catch ImportError and print a friendly error when
importing argparse.
@mikewallace1979 mikewallace1979 force-pushed the dev/432-barman-wal-archive-usable-check branch from 8817d96 to 875fb4e Compare November 16, 2021 22:20
@edb-sonar-app
Copy link

edb-sonar-app bot commented Nov 16, 2021

@mikewallace1979 mikewallace1979 merged commit 3bc7505 into master Nov 16, 2021
@mikewallace1979 mikewallace1979 deleted the dev/432-barman-wal-archive-usable-check branch November 16, 2021 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement new barman and/or barman-cloud subcommands to test the archive-destination contents
3 participants