-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement new barman and/or barman-cloud subcommands to test the archive-destination contents #432
Comments
I agree this is an important feature that would allow us to catch in an easier way sharing the same bucket with different servers. Maybe an option "--require-empty-target"? |
This is to prevent a variety of issues where a new cluster or a restored backup of an old cluster can write into an archive directory that was already used by a prior cluster. The obvious hazard here is that WAL segments could be overwritten, or if overwrite is not allowed, WAL segment archive failures. For example, if using k8s and a CNP The bigger issue is actually with failover and recovery, not backup restore. Especially in CNP where failover and rolling update etc is routine. If the CNP Cluster restarts a replica pod, does a failover, or restores a base backup, it will use
When this happens the logs will look something like this:
It is important to understand that there are multiple ways we can land up writing timeline history files and/or WAL segments with an older LSN and/or timeline than what's already in an archive directory, and not all of them begin with segment 1:
All of these are operator error: the operator should ensure that the name used for backup archives is unique for each and every postgres cluster they bring up, either as a restore or by initdb. But in practice people won't do that, especially with deployment automation, cloud management platforms, declarative configurations deeply layered inside other systems, etc. And right now there isn't a good way to detect such a misconfiguration and fail-fast. Ideally the misconfigured DB wouldn't start up and start writing WAL at all, but that's not something we can do a lot about when dealing with standalone postgres instances where we don't necessarily know if this instance is a new startup of an existing instance or a restored copy from a backup. Really robust solutions probably need to happen in the management layers that are responsible for creating new clusters / restoring backups. The biggest help barman can be to those layers is to expose an interface that lets deployment automation (CNP, ansible scripts, or whatever else) quickly and easily check if a given archive destination is empty or non-empty. That way they can fail-fast if it's non-empty with an informative error, and we don't have to inefficiently list archive directory contents for each barman wal archive command run. |
ProposalAdd a barman and barman-cloud subcommand to check an archive destination to ensure it is empty. This subcommand should return 0 (non-error) on empty, 1 (error) on empty, 2 (error) on failure to check / inaccessible destination, etc. If feasible, offer the option to ignore existing WAL segs and timeline history files in the archive that are on the starting server's history but safely in the past, so the same check can be used for safer replica promotions too. Rely on the higher level agent automation to run it when creating a cluster. Tools would not run this check when promoting a replica during a failover or when starting up an existing cluster after a restart, crash, etc. Only when making a logically distinct new cluster with a new history: restoring a backup, making a staging/dev copy of a database, initdb'ing a new database, etc. It would be useful to be able to specify additional options to ignore any WAL segs and timeline history files older than the server's current timeline and LSN. We could use this variant during promotion of a replica to make sure that the archive destination doesn't have any WAL from a different promotion, e.g. in split-brain scenarios where fencing and STONITH failed or in cases of operator error where the operator failed to tell us they were creating a new cluster. So if feasible I'd like the option
RationaleAs established above, the proposal in this ticket won't address the issue as written. Barman's archive command could look-ahead in the archive from whatever WAL seg or timeline history file that it's being asked to store, to make sure there are no existing WAL segs or timeline history files "in the future" in the same archive destination. But that would be a prohibitive performance cost, especially for cloud blob stores where enumeration is expensive. What barman can do cheaply and efficiently is help the higher level automation (human admin, Ansible/Puppet code, k8s Operator, etc) answer the question "is this archive destination safe to use for a new cluster". Then let that automation ask the question when it knows it's making a new cluster. And for a new cluster, an archive destination is safe if it's empty. It'd be nice to also be able to sanity check promotion safety as noted above. But again it'd need to be a command exposed to the user or their automation agent, not something done for each archive command invocation. BackgroundThere's nothing that barman alone can do to prevent this problem, because it needs to be told when a postgres instance is a new and independent cluster vs a continuation of an existing one. At its roots this is a postgres limitation - and one that doesn't have a lot of simple solutions available. When PostgreSQL is started up it can't possibly know if it's the same postgres instance that just shut down, or if it has been copied somewhere else / restored from backup. It could be a SAN snapshot, cloned VM, or all sorts of things. If it's a continuation of an existing cluster it must WAL archives to the same location to ensure backup continuity. If it's a new independent cluster, it must not write WAL archives to the same location as its ancestor instance to avoid intermingling WAL and timeline history files from cluster histories that are diverging - whether due to backup restore from an older point in time, creation of a separate dev/staging instance or whatever else. PostgreSQL really wants the human operator or their automation agent to tell it whenever it gets restored from a base backup or copied to a new cluster instance. It cannot know it's a truly new and independent cluster based on the existence of a backup label file or based on being a promoted replica, because both of those are normal for HA/failover setups that maintain a single consistent linear history too. Right now the user "tells" PostgreSQL when it is a new and independent PostgreSQL cluster instance by changing the Barman has exactly the same limitations as postgres itself here. The PostgreSQL's Timeline ID helps Pg detect misconfigurations and wrong archives when it is in recovery, but it doesn't do a lot to prevent multiple instances or branches in an instance's history from writing to a single archive destination. It doesn't solve this issue, it's only a partial defense for it. Arguably PostgreSQL itself could expose better user interface for this and be more defensive about it - for example, it could have an option to re-generate the cluster sysid on end-of-recovery promotion, and offer a placeholder for passing the cluster sysid to the What barman can do is expose helper interfaces to users and higher level automation tools like Ansible modules, k8s Operators, etc. Those things do know when they're making a "new" cluster vs bringing up new replicas / restarting an existing cluster. That's what I'm proposing we do here. |
@mnencia @gbartolini @amenonsen @mikewallace1979 Should I turn the feature spec of the above into a separate ticket, then we close this one? |
Thanks for the detailed write up @ringerc - that is the perfect amount of context. I'm happy continuing on this issue and updating the title to reflect your proposal. |
@mnencia Cool. The TL;DR is then: Implement new barman and/or barman-cloud subcommands to test the archive-destination contents. At minimum have a subcommand that returns 0 if the archive destination is empty, or 1 if it is non-empty. Preferably also have the option to pass the server's current timeline and last WAL segment. If specified, ignore any timeline history files <= the current timeline, any WAL segments < the current timeline, and any WAL segments = the current timeline but < the current WAL segment. Possible syntax might be:
|
Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. Any existing WAL files meet one of the following criteria: a. They belong to an older timeline than that specified by the current_timeline argument. b. They are on the same timeline as the current_timeline argument but the segment is less than that specified in the current_wal_segment argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files which have a higher timeline or segment than the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432
Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. Any existing WAL files meet one of the following criteria: a. They belong to an older timeline than that specified by the current_timeline argument. b. They are on the same timeline as the current_timeline argument but the segment is less than that specified in the current_wal_segment argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files which have a higher timeline or segment than the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432
The proposed PR #443 adds These accept If either I didn't add the If those flags would still be helpful to external automation scripts then let me know and we can still add them. One other thing to clarify - the PR currently assumes |
Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. Any existing WAL files meet one of the following criteria: a. They belong to an older timeline than that specified by the current_timeline argument. b. They are on the same timeline as the current_timeline argument but the segment is less than that specified in the current_wal_segment argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files which have a higher timeline or segment than the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432
Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. Any existing WAL files meet one of the following criteria: a. They belong to an older timeline than that specified by the current_timeline argument. b. They are on the same timeline as the current_timeline argument but the segment is less than that specified in the current_wal_segment argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files which have a higher timeline or segment than the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432
Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. Any existing WAL files meet one of the following criteria: a. They belong to an older timeline than that specified by the current_timeline argument. b. They are on the same timeline as the current_timeline argument but the segment is less than that specified in the current_wal_segment argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files which have a higher timeline or segment than the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432
Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. Any existing WAL files meet one of the following criteria: a. They belong to an older timeline than that specified by the current_timeline argument. b. They are on the same timeline as the current_timeline argument but the segment is less than that specified in the current_wal_segment argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files which have a higher timeline or segment than the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432
@mikewallace1979 Thanks. Just saw this (GH not notifying me for some reason). Will look. |
Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. Any existing WAL files meet one of the following criteria: a. They belong to an older timeline than that specified by the current_timeline argument. b. They are on the same timeline as the current_timeline argument but the segment is less than that specified in the current_wal_segment argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files which have a higher timeline or segment than the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432
Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. All existing WAL files belong to an older timeline than that specified by the --timeline argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files on the same or higher timeline as the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432
Adds commands to barman and barman cloud which check that a barman server or cloud location is safe to use as an archive destination for a new PostgreSQL server. A location is considered safe if either: 1. There are no WAL files at all in the archive. 2. All existing WAL files belong to an older timeline than that specified by the --timeline argument. A file is considered a WAL file if it passes the `is_any_xlog_file` check in `barman/xlog.py` so this applies to WAL files, history files, partial WAL files and backup labels. The commands added are: * barman check-wal-archive * barman-cloud-check-wal-archive The motivation for this patch is to provide a way that external orchestration tools can validate the WAL archive destination is safe for a newly provisioned PostgreSQL cluster, given such a cluster may use the exact same name as an old cluster. In such scenarios, any WAL files on the same or higher timeline as the WALs being written by the new cluster will cause any attempt to restore from a backup to fail. Reasons why external orchestration tooling may re-use the same cluster name and archive destination include (but are not limited to): * A new cluster is created via initdb with the same name as the old one. The sysid will be different but this does not affect the archive destination so any archived WALs relating to the older cluster will be present in the same location. * A cluster is restored from a base backup and uses the same name as the old cluster. The cluster has the same sysid and starts with a segment ID > 1 and timeline > 1. The same archive destination used by the old cluster will be used for the restored cluster. * A new cluster is started which happens to re-use the same name and archive destination. All of these cases lead to the situation where WAL archiving and backup is functioning normally *but* any attempts to restore from those backups will fail. This is dangerous for anyone relying on the databases managed by external orchestration/automation. The commands provided by this patch do not solve the problem alone because neither Barman nor PostgreSQL have the necessary context. The commands can, however, be added to external automation in order to catch archive safety issues at the provisioning stage. Closes #432
We could add an option to
barman-cloud-wal-archive
to enable the following behavior:barman-cloud-wal-archive
should fail if the target position contains any backup or WALbarman-cloud-wal-archive
should fail if the target contains any WAL in that segmentthe rationale is to prevent mixing WALs from different instances.
The text was updated successfully, but these errors were encountered: