-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of backup tree deletion #946
Conversation
barman/server.py
Outdated
deleted = False | ||
backups_to_delete = backup.walk_backups_tree() | ||
for backup in backups_to_delete: | ||
deleted = self.perform_delete_backup(backup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to take a closer look at this PR.
IAC I wonder that this deleted
variable will hold only the value of the "root" backup deletion.
What should we do if any child fails to be deleted, e.g. if we can't acquire a lock? At a first glance it seems we would have a leftover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW by coincidence I hit an issue more or less like the one I brought up above. After that I was never able to run barman delete
and had to physically remove the files from barman_home/barman_server/base
directory to recover from that:
(barman) [vagrant@barmandevhost barman]$ barman list-backup pg17
pg17 20240624T215009 'child2.1' - Mon Jun 24 19:50:10 2024 - Size: 35.7 MiB - WAL Size: 0 B
pg17 20240624T214947 'child2' - Mon Jun 24 19:49:48 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240624T214834 'child1.1.2' - Mon Jun 24 19:48:35 2024 - Size: 35.7 MiB - WAL Size: 112.0 MiB
pg17 20240624T214828 'child1.1.1' - Mon Jun 24 19:48:29 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240624T214819 'child1.1' - Mon Jun 24 19:48:21 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240624T214809 'child1' - Mon Jun 24 19:48:10 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240624T214757 'root_backup' - Mon Jun 24 19:48:00 2024 - Size: 355.7 MiB - WAL Size: 32.0 MiB
(barman) [vagrant@barmandevhost barman]$ barman delete pg17 root_backup
Backup 20240624T214757 has incremental backups which depent on it. Deleting all backups in the tree
Deleting backup 20240624T214828 for server pg17
Deleted backup 20240624T214828 (start time: Mon Jun 24 21:50:59 2024, elapsed time: less than one second)
Deleting backup 20240624T214834 for server pg17
Deleted backup 20240624T214834 (start time: Mon Jun 24 21:50:59 2024, elapsed time: less than one second)
Deleting backup 20240624T214819 for server pg17
Deleted backup 20240624T214819 (start time: Mon Jun 24 21:50:59 2024, elapsed time: less than one second)
Deleting backup 20240624T214809 for server pg17
Deleted backup 20240624T214809 (start time: Mon Jun 24 21:50:59 2024, elapsed time: less than one second)
Deleting backup 20240624T214843 for server pg17
EXCEPTION: Could not find backup_id 20240624T214843
See log file for more details.
(barman) [vagrant@barmandevhost barman]$ barman list-backup pg17
(barman) [vagrant@barmandevhost barman]$ barman list-backup pg17
pg17 20240624T215009 'child2.1' - Mon Jun 24 19:50:10 2024 - Size: 35.7 MiB - WAL Size: 0 B
pg17 20240624T214947 'child2' - Mon Jun 24 19:49:48 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240624T214757 'root_backup' - Mon Jun 24 19:48:00 2024 - Size: 355.7 MiB - WAL Size: 240.0 MiB
(barman) [vagrant@barmandevhost barman]$ barman delete pg17 root_backup
Backup 20240624T214757 has incremental backups which depent on it. Deleting all backups in the tree
Deleting backup 20240624T214809 for server pg17
EXCEPTION: Could not find backup_id 20240624T214809
See log file for more details.
(barman) [vagrant@barmandevhost barman]$ barman delete pg17 20240624T215009
Deleting backup 20240624T215009 for server pg17
Deleted backup 20240624T215009 (start time: Mon Jun 24 21:52:31 2024, elapsed time: less than one second)
(barman) [vagrant@barmandevhost barman]$ barman list-backup all
pg17 20240624T214947 'child2' - Mon Jun 24 19:49:48 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240624T214757 'root_backup' - Mon Jun 24 19:48:00 2024 - Size: 355.7 MiB - WAL Size: 240.0 MiB
(barman) [vagrant@barmandevhost barman]$ barman delete pg17 20240624T214947
Backup 20240624T214947 has incremental backups which depent on it. Deleting all backups in the tree
Deleting backup 20240624T214952 for server pg17
EXCEPTION: Could not find backup_id 20240624T214952
See log file for more details.
(barman) [vagrant@barmandevhost barman]$ barman list-backup all
pg17 20240624T214947 'child2' - Mon Jun 24 19:49:48 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240624T214757 'root_backup' - Mon Jun 24 19:48:00 2024 - Size: 355.7 MiB - WAL Size: 240.0 MiB
(barman) [vagrant@barmandevhost barman]$ barman delete pg17 20240624T214757
Backup 20240624T214757 has incremental backups which depent on it. Deleting all backups in the tree
Deleting backup 20240624T214809 for server pg17
EXCEPTION: Could not find backup_id 20240624T214809
See log file for more details.
(barman) [vagrant@barmandevhost barman]$ barman list-backup all
pg17 20240624T214947 'child2' - Mon Jun 24 19:49:48 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240624T214757 'root_backup' - Mon Jun 24 19:48:00 2024 - Size: 355.7 MiB - WAL Size: 240.0 MiB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can treat this whole deletion of a chain as a single unit, something like transactions in postgres, all or nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to take a closer look at this PR. IAC I wonder that this
deleted
variable will hold only the value of the "root" backup deletion. What should we do if any child fails to be deleted, e.g. if we can't acquire a lock? At a first glance it seems we would have a leftover.
That's something I was thinking as well. I mean, the errors will be logged so the user will know which backups couldn't be deleted and could potentially delete them manually. On the other hand, if we stop the loop in case any backup fails along the way it would also require the user to manually fix whatever is wrong and try again. Both requires a manual intervention, but thinking more about it, maybe it's better to force him to delete everything before deleting the parent to avoid leftovers, as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@barthisrael I think this error that you faced is because we are not updating the children_ids
in the backup_info as children are deleted. So when you try to run the delete for the second time it tries to find backups that were already deleted but still referenced in the parent file. That's something important that I was not even remembering: update the references as backups are deleted 😵
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed some changes to stop the process if an incremental fails to be deleted and also a change to update the children_ids they are deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to take a closer look at this PR. IAC I wonder that this
deleted
variable will hold only the value of the "root" backup deletion. What should we do if any child fails to be deleted, e.g. if we can't acquire a lock? At a first glance it seems we would have a leftover.
The lock issue you raised is interesting, and I think that we should look into locking sequence and be sure that a lock is acquired and kept for the whole chain-delete action. avoiding other interactions from external processes.
I'm just thinking out loud here btw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can treat this whole deletion of a chain as a single unit, something like transactions in postgres, all or nothing.
This is potentially a good idea if we can implement something like a dry run of the deletion of the whole chain...
Otherwise can potentially become something that is going to consume a lot of disk space...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock issue you raised is interesting, and I think that we should look into locking sequence and be sure that a lock is acquired and kept for the whole chain-delete action. avoiding other interactions from external processes. I'm just thinking out loud here btw.
Do you think we need a lock on every single backup beforehand or should a lock on the server be enough? I was thinking of acquiring a server lock on the start and then procede to acquire backup locks on demand as incrementals are being deleted.
As I see in the code both backup and backup delete actions require a server lock acquisition so none of those actions will be able to run while the chain is being deleted. The server lock will protect the whole chain and the backup lock will protect each backup during its deletion. At the end I think that would server the same purpose as holding all locks at once, right? I don't know if I'm missing something.
What do you guys think? @gcalacoci @barthisrael @andremagui
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to know more about locking to make meaningful suggestions :D but i feel that acquiring just one lock would be enough. As it is, in the start of delete_backup method in the server there is a check if its possible to acquire a server lock and then releases it, but in fact just one lock is created when deleting backups just before the call to backup_manager. So does it make sense to have one lock for all deletions or a lock for each deletion?
Sorry, instead of helping i just added more questions!!!!
9d5eba6
to
ce1b93e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change introduced here look good and aligned with what we discussed during our call.
I think we are only missing the "general server lock" when deleting the backups. Other than that we are doing what is expected:
- Delete all backups in the tree, updating the references between nodes along the way;
- If an issue is faced while deleting a backup of the three, stop and return an error
6a732c0
to
860d7dd
Compare
This seems to be working fine:
$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000443 'child_2_1' - Thu Jun 27 22:04:44 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000436 'child_1_2' - Thu Jun 27 22:04:37 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000433 'child_1_1' - Thu Jun 27 22:04:34 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000419 'child_2' - Thu Jun 27 22:04:20 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000416 'child_1' - Thu Jun 27 22:04:17 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000321 'full_1' - Thu Jun 27 22:03:25 2024 - Size: 251.7 MiB - WAL Size: 96.0 MiB
$ barman delete pg17 child_2
WARNING: Backup 20240628T000419 has incremental backups which depend on it. Deleting all backups in the tree
Deleting backup 20240628T000443 for server pg17
Deleted backup 20240628T000443 (start time: Fri Jun 28 00:08:05 2024, elapsed time: less than one second)
Deleting backup 20240628T000419 for server pg17
Deleted backup 20240628T000419 (start time: Fri Jun 28 00:08:05 2024, elapsed time: less than one second) $ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000436 'child_1_2' - Thu Jun 27 22:04:37 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240628T000433 'child_1_1' - Thu Jun 27 22:04:34 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000416 'child_1' - Thu Jun 27 22:04:17 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240628T000321 'full_1' - Thu Jun 27 22:03:25 2024 - Size: 251.7 MiB - WAL Size: 96.0 MiB
$ barman delete pg17 child_1_2
Deleting backup 20240628T000436 for server pg17
Deleted backup 20240628T000436 (start time: Fri Jun 28 00:08:58 2024, elapsed time: less than one second) $ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000433 'child_1_1' - Thu Jun 27 22:04:34 2024 - Size: 35.7 MiB - WAL Size: 96.0 MiB
pg17 20240628T000416 'child_1' - Thu Jun 27 22:04:17 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240628T000321 'full_1' - Thu Jun 27 22:03:25 2024 - Size: 251.7 MiB - WAL Size: 96.0 MiB
$ barman delete pg17 full_1
WARNING: Backup 20240628T000321 has incremental backups which depend on it. Deleting all backups in the tree
Deleting backup 20240628T000433 for server pg17
Deleted backup 20240628T000433 (start time: Fri Jun 28 00:09:34 2024, elapsed time: less than one second)
Deleting backup 20240628T000416 for server pg17
Deleted backup 20240628T000416 (start time: Fri Jun 28 00:09:34 2024, elapsed time: less than one second)
Deleting backup 20240628T000321 for server pg17
Delete associated WAL segments:
000000010000000300000039
00000001000000030000003A
00000001000000030000003B
00000001000000030000003C
00000001000000030000003D
00000001000000030000003E
00000001000000030000003F
000000010000000300000040
000000010000000300000041
000000010000000300000042
000000010000000300000043
000000010000000300000044
000000010000000300000045
000000010000000300000046
000000010000000300000047
000000010000000300000048
Deleted backup 20240628T000321 (start time: Fri Jun 28 00:09:34 2024, elapsed time: less than one second) $ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work here!
The implementation is clean, and seems to be working fine.
Now the usual nitpicking -- otherwise I wouldn't be Israel 😆
- I put a couple of suggestions regarding docstrings of
delete_backup
andperform_delete_backup
- Let's add
References: BAR-181
to the PR description - Let's add
References: BAR-181
to the 3 commits - Let's modify the first commit message a bit, just so we explain the "why" behind the following couple of changes:
1 - Move some validations (check keep, check redundancy) were
moved from the backup manager to the server as they now will
need to happen in the very first stage of the deletion.
3 - The whole chain deletion will happen inside a server lock and
every individual backup deletion will happen inside backup lock.
Points 1.
would make it easier for us to understand the code, if we don't look it for a while.
Points 2.
, 3.
and 4.
are relevant if we need to inspect the commit log some day in the future.
When a postgres full backup with incrementals is deleted we have to make sure that all its descendants are deleted along with it. Similarly, when a child backup is deleted we have to make sure that its reference is removed from is parent. To achieve this, the following changes were made: 1 - Some validations (check keep and check redundancy) were moved from the backup manager to the server as they now will need to happen in the very first stage of the deletion, which happens in the server context. This becomes required once we have to make sure that the parent itself can be deleted before proceding with the deletion of any of its children. 2 - Split the 'delete' method of the server into two separate methods so it gets easier to distinguish the prepare phase (where children are gathered) and the actual delete phase (where the backup manager is called). 3 - The whole chain deletion will happen inside a server lock and every individual backup deletion will happen inside backup lock. This ensures that the whole process is done consistently, thus avoiding errors in the middle which would require the user to try again. References: BAR-181 Signed-off-by: Gustavo William <gustavowilliam0980@gmail.com>
As some validations had to be moved from the backup manager to the server, some tests which were making use of those validations also needed to change accordingly. Also, tests involving the lock phase also changed as the locking mechanism changed slightly too. References: BAR-181 Signed-off-by: Gustavo William <gustavowilliam0980@gmail.com>
Adding tests to test the logic of deleting children backups when a parent is deleted. A new test was added to ensure that the children are indeed being deleted along with the parent. A current test was modified to make sure that, once a child has been deleted, its reference is completely removed from its parent. References: BAR-181 Signed-off-by: Gustavo William <gustavowilliam0980@gmail.com>
@barthisrael Thanks, I completely forgot about the references 😆. Just pushed the changes you asked for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait for @martinmarques review before merging. but in the meantime LGTM
A random thought that just came up: maybe we need to handle this also in Barman cloud? |
IMHO barman cloud and incremental is something that is going to come later. we first focus on local, and then add cloud. IIRC we discussed this with @martinmarques during the design. |
I really don't recall this specific detail, sorry. IAC what you said makes sense to me, as we discussed about implementing this in incremental steps. |
This PR got bigger than I expected so I will try to explain the changes per commit.
In 129c672 is the implementation for deleting descendant backup nodes when a parent is deleted. To achieve this, I divided the current
delete
method of theServer
into two methods:delete
which will now have some validations to make sure the backup is deletable (this used to be done later in the backup manager) as well as pass the whole chain for deletion inside a server level lock;perform_delete
will be doing some of the same work that the previous delete method was doing i.e. validating file permissions, holding the backup level lock and calling theBackupManager.delete
. One thing to notice here is that this code used to acquire a server lock in the beginning, but it was released before the deletion even started , I also think the error message there was not entirely correct as it would say something like "there's another process for this backup id" when the lock was actually on the server, not on the backup. I "fixed" those things, but would like a check-up just to make sure.In b0f38be I fixed the tests that were broken due to moving validations and locks around. I had to create new tests in the server logic as well as change some existing ones and remove some code from the backup manager tests.
In b41d409 I created new tests for the children deletion logic. It basically tests that the whole chain is being passed for deletion and also that a child deletion also deletes its reference in its parent.
References: BAR-181