Fix cleanup in ON CLUSTER backups#83835
Conversation
4e80808 to
087a4ba
Compare
jkartseva
left a comment
There was a problem hiding this comment.
How was this change tested? Can the test plan be reproduced and added as a test?
| return; | ||
| for (auto & [host, host_info] : state.hosts) | ||
| { | ||
| if (!host_info.finished && (std::find(unfinished_hosts.begin(), unfinished_hosts.end(), host) == unfinished_hosts.end())) |
There was a problem hiding this comment.
Would host_info. finished == true be a valid state at this point? It looks like it wouldn't. Probably it's worth adding a debug message.
There was a problem hiding this comment.
You're right, normally that isn't expected. However sometimes ZooKeeper performs an operation but reports a connection loss - so it's better to check if the operation is actually complete before retrying it again. I added a debug message and a comment here.
| bool waitOtherHostsFinish(bool) const override { return true; } | ||
| bool finish(bool) override { return true; } | ||
| bool cleanup(bool) override { return true; } | ||
| void setError(std::exception_ptr, bool) override { is_error_set = true; } |
There was a problem hiding this comment.
Inform about which error was set through a debug message?
There was a problem hiding this comment.
Function setError() is called only from BackupStarter::onException() and RestoreStarter::onException() after logging the error, so the error has been already logged as the point when setError() is called. I added a comment about that.
| } | ||
| else if (!get_node_failed_check_for_non_existence().empty()) | ||
| { | ||
| show_error_before_next_attempt(fmt::format("Node {} exists", get_node_failed_check_for_non_existence())); |
There was a problem hiding this comment.
(Probably beyond the scope of this PR) The final exception
throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
"Couldn't create the 'finish' node for {} after {} attempts",
current_host_desc, max_attempts_after_bad_version);
lacks of information about the last error.
There was a problem hiding this comment.
Good idea! I added this information to the exception.
| if (backup && backup_is_corrupted && backups_worker.remove_backup_files_after_failure && backup_coordination | ||
| && backup_coordination->isErrorSet() && backup_coordination->finished() | ||
| && (!backup_coordination->isBackupQuerySentToOtherHosts() || backup_coordination->allHostsFinished())) |
There was a problem hiding this comment.
This condition is hard to read compared to the previous
bool should_remove_files_in_backup = backup && !is_internal_backup && backups_worker.remove_backup_files_after_failure;
Did I read it correctly that the corrupted files were never removed for internal backups?
Isn't backup_coordination->allHostsFinished() imply backup_coordination->finished() ?
There was a problem hiding this comment.
I tried to improve readability, so bool should_remove_files_in_backup is back.
| if (backup_coordination && backup_coordination->finished() && | ||
| (!backup_coordination->isBackupQuerySentToOtherHosts() || backup_coordination->allHostsFinished())) |
There was a problem hiding this comment.
Same, isn't backup_coordination->allHostsFinished() imply backup_coordination->finished() ?
There was a problem hiding this comment.
Yes, if backup_coordination->allHostsFinished() == true then backup_coordination->finished() == true.
However if backup_coordination->isBackupQuerySentToOtherHosts() == false then we shouldn't check backup_coordination->allHostsFinished() at all.
I changed the code to improve readability.
|
Will this PR fix #81968 by any chance? |
|
CI failures are unrelated: |
1f9ca3e
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix cleanup in
ON CLUSTERbackups. We need to clean coordination nodes in ZooKeeper correctly after a backup or restore failed or got cancelled.This PR is required to fix tests for
ON CLUSTERbackups and restores.