Skip to content

Fix cleanup in ON CLUSTER backups#83835

Merged
vitlibar merged 3 commits intoClickHouse:masterfrom
vitlibar:fix-cleanup-in-on-cluster-backups
Jul 21, 2025
Merged

Fix cleanup in ON CLUSTER backups#83835
vitlibar merged 3 commits intoClickHouse:masterfrom
vitlibar:fix-cleanup-in-on-cluster-backups

Conversation

@vitlibar
Copy link
Copy Markdown
Member

@vitlibar vitlibar commented Jul 16, 2025

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix cleanup in ON CLUSTER backups. We need to clean coordination nodes in ZooKeeper correctly after a backup or restore failed or got cancelled.

This PR is required to fix tests for ON CLUSTER backups and restores.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Jul 16, 2025

Workflow [PR], commit [4c6f324]

Summary:

job_name test_name status info comment
Stateless tests (amd_debug, AsyncInsert, s3 storage, parallel) failure
03352_concurrent_rename_alter FAIL

@clickhouse-gh clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jul 16, 2025
@vitlibar vitlibar changed the title Fix cleanup in on cluster backups Fix cleanup in ON CLUSTER backups Jul 16, 2025
@vitlibar vitlibar force-pushed the fix-cleanup-in-on-cluster-backups branch from 4e80808 to 087a4ba Compare July 16, 2025 12:45
Copy link
Copy Markdown
Member

@jkartseva jkartseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was this change tested? Can the test plan be reproduced and added as a test?

return;
for (auto & [host, host_info] : state.hosts)
{
if (!host_info.finished && (std::find(unfinished_hosts.begin(), unfinished_hosts.end(), host) == unfinished_hosts.end()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would host_info. finished == true be a valid state at this point? It looks like it wouldn't. Probably it's worth adding a debug message.

Copy link
Copy Markdown
Member Author

@vitlibar vitlibar Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, normally that isn't expected. However sometimes ZooKeeper performs an operation but reports a connection loss - so it's better to check if the operation is actually complete before retrying it again. I added a debug message and a comment here.

Comment thread src/Backups/BackupCoordinationLocal.h Outdated
bool waitOtherHostsFinish(bool) const override { return true; }
bool finish(bool) override { return true; }
bool cleanup(bool) override { return true; }
void setError(std::exception_ptr, bool) override { is_error_set = true; }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inform about which error was set through a debug message?

Copy link
Copy Markdown
Member Author

@vitlibar vitlibar Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function setError() is called only from BackupStarter::onException() and RestoreStarter::onException() after logging the error, so the error has been already logged as the point when setError() is called. I added a comment about that.

}
else if (!get_node_failed_check_for_non_existence().empty())
{
show_error_before_next_attempt(fmt::format("Node {} exists", get_node_failed_check_for_non_existence()));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Probably beyond the scope of this PR) The final exception

throw Exception(ErrorCodes::FAILED_TO_SYNC_BACKUP_OR_RESTORE,
                    "Couldn't create the 'finish' node for {} after {} attempts",
                    current_host_desc, max_attempts_after_bad_version);

lacks of information about the last error.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I added this information to the exception.

Comment thread src/Backups/BackupsWorker.cpp Outdated
Comment on lines +494 to +496
if (backup && backup_is_corrupted && backups_worker.remove_backup_files_after_failure && backup_coordination
&& backup_coordination->isErrorSet() && backup_coordination->finished()
&& (!backup_coordination->isBackupQuerySentToOtherHosts() || backup_coordination->allHostsFinished()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is hard to read compared to the previous
bool should_remove_files_in_backup = backup && !is_internal_backup && backups_worker.remove_backup_files_after_failure;

Did I read it correctly that the corrupted files were never removed for internal backups?

Isn't backup_coordination->allHostsFinished() imply backup_coordination->finished() ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to improve readability, so bool should_remove_files_in_backup is back.

Comment thread src/Backups/BackupsWorker.cpp Outdated
Comment on lines +504 to +505
if (backup_coordination && backup_coordination->finished() &&
(!backup_coordination->isBackupQuerySentToOtherHosts() || backup_coordination->allHostsFinished()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, isn't backup_coordination->allHostsFinished() imply backup_coordination->finished() ?

Copy link
Copy Markdown
Member Author

@vitlibar vitlibar Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if backup_coordination->allHostsFinished() == true then backup_coordination->finished() == true.

However if backup_coordination->isBackupQuerySentToOtherHosts() == false then we shouldn't check backup_coordination->allHostsFinished() at all.

I changed the code to improve readability.

@SaltTan
Copy link
Copy Markdown
Contributor

SaltTan commented Jul 18, 2025

Will this PR fix #81968 by any chance?

@vitlibar
Copy link
Copy Markdown
Member Author

CI failures are unrelated:

@vitlibar vitlibar enabled auto-merge July 21, 2025 07:03
@vitlibar vitlibar added this pull request to the merge queue Jul 21, 2025
Merged via the queue into ClickHouse:master with commit 1f9ca3e Jul 21, 2025
122 of 124 checks passed
@vitlibar vitlibar deleted the fix-cleanup-in-on-cluster-backups branch July 21, 2025 07:31
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants