New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change ZooKeeper path for zero-copy locks for shared data #32061
Conversation
…DROP DETACHED PART
@@ -7515,4 +7560,193 @@ bool StorageReplicatedMergeTree::createEmptyPartInsteadOfLost(zkutil::ZooKeeperP | |||
return true; | |||
} | |||
|
|||
|
|||
void StorageReplicatedMergeTree::convertZeroCopySchema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we shouldn't implement compatibility inside clickhouse server for non-production ready features. Let's implement a seperate tool for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Separate convertaion tool looks easy to make, but without downtime with a low chance to data corruption during process, when some replicas are updated, some yet not, when part created after convertation process and destroyed before it complete. Looks like this is accepted risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move it in script utils/zero_copy/zero_copy_schema_converter.py
@@ -5203,6 +5292,75 @@ PartitionCommandsResultInfo MergeTreeData::unfreezeAll( | |||
return unfreezePartitionsByMatcher([] (const String &) { return true; }, backup_name, local_context); | |||
} | |||
|
|||
bool MergeTreeData::removeSharedDetachedPart(DiskPtr disk, const String & path, const String & part_name) | |||
{ | |||
bool keep_shared = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we have it as a separate variable? It doesn't change anywhere. removeSharedRecursive
takes it by value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least const bool keep_shared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This variable is useless after refactoring.
Removed (and wholly method renamed to 'removeDetachedPart' and logic moved to 'StorageReplicatedMergeTree').
FreezeMetaData meta; | ||
if (meta.load(disk, path)) | ||
return removeSharedDetachedPart(disk, path, part_name, meta.zookeeper_name, meta.replica_name, | ||
"", meta.is_replicated); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why zookeeper_path
empty?????
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Freezed part with existing FreezeMetaData can be created only in updated version of ClickHouse. In this case it cannot have zero-copy record on old place.
zkutil::ZooKeeperPtr zookeeper; | ||
if (zookeeper_name == default_zookeeper_name) | ||
{ | ||
zookeeper = getContext()->getZooKeeper(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns our not perfect abstractions into trash:
grep 'ZooKeeper' src/Storages/MergeTree/MergeTreeData.{h,cpp} | wc -l
0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic moved to StorageReplicatedMergeTree.
return keep_shared; | ||
} | ||
|
||
bool MergeTreeData::removeSharedDetachedPart(DiskPtr disk, const String & path, const String & part_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this "shared" code is here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we unfreeze freezed part for removed table then we have no info in ClickHouse is this part replicated or not. There are only files on disk and may be lock in ZooKeeper, but none MergeTreeData object for this non existing table.
I see no other way except call one universal method with support both variants inside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be this code should be in some kind of "MergeTreeGarbageCollector" class. Is it better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But now none logic to unfreeze entry backup, parts unfreezed only for existing tables, so possible to move code to StorageReplicatedMergeTree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic moved to StorageReplicatedMergeTree.
/// Fetch part only if some replica has it on shared storage like S3 | ||
/// Overridden in StorageReplicatedMergeTree | ||
virtual bool tryToFetchIfShared(const IMergeTreeDataPart &, const DiskPtr &, const String &) { return false; } | ||
|
||
virtual String getZooKeeperName() const { return ""; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, MergeTreeData doesn't know anything about zookeeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to StorageReplicatedMergeTree.
} | ||
else if (code != Coordination::Error::ZOK) | ||
{ | ||
throw zkutil::KeeperException(code, zookeeper_table_id_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to retry if it's concurrent process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tryCreate can return only one of ZOK, ZNONODE, ZNODEEXISTS, ZNOCHILDRENFOREPHEMERALS
ZNONODE and ZNOCHILDRENFOREPHEMERALS mean that something wrong with table root node. Concurrent creation can produce only ZNODEEXISTS, in this case one replica wins, other replicas get this error and should read id from zookeeper.
String getUniqueId() const; | ||
|
||
/// Return hardlink count for part. | ||
/// Required for keep data on remote FS when part has shadow copies. | ||
UInt32 getRefCount() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to clarify that it's not some general-purpose refcount. The name should be something like getNumberOfReferenecesForZeroCopyReplication
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to getNumberOfRefereneces
localBackup(part->volume->getDisk(), src_part_path, backup_part_path); | ||
localBackup(disk, src_part_path, backup_part_path); | ||
|
||
freezeMetaData(disk, part, backup_part_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment -- "Store metadata for replicated tables and do nothing for ordinary merge tree"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
|
||
void StorageReplicatedMergeTree::freezeMetaData(DiskPtr disk, DataPartPtr, String backup_part_path) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to name it createAndStoreFreezeMetadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -263,6 +274,10 @@ class StorageReplicatedMergeTree final : public shared_ptr_helper<StorageReplica | |||
|
|||
bool createEmptyPartInsteadOfLost(zkutil::ZooKeeperPtr zookeeper, const String & lost_part_name); | |||
|
|||
String getZooKeeperName() const { return zookeeper_name; } | |||
|
|||
virtual String getTableUniqID() const override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need virtual
here. Also need a comment about purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
zkutil::ZooKeeperPtr zookeeper; | ||
if (detached_zookeeper_name == default_zookeeper_name) | ||
{ | ||
zookeeper = getContext()->getZooKeeper(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we cannot just use table's zookeeper with getZooKeeper()
. ZooKeeper may change for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remove it now. Later we will need this for unfreeze parts of removed tables.
fs::path checksums = fs::path(path) / "checksums.txt"; | ||
if (disk->exists(checksums)) | ||
{ | ||
auto ref_count = disk->getRefCount(checksums); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have getRefCount
in IMergeTreeDataPart
? Why we cannot use it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this moment we haven't existing part object, only detached files on disk.
if (ref_count == 0) | ||
{ | ||
String id = disk->getUniqueId(checksums); | ||
keep_shared = !StorageReplicatedMergeTree::unlockSharedDataById(id, table_uuid, part_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it possible, when ref_count == 0
and unlockSharedDataById
still says keep_shared
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref_count is local counter for freezed/detached/etc copies of metadata on this replica.
unlockSharedDataById remove lock for this replica in ZooKeeper and check if any locks of other replicas exist.
|
||
String zero_copy = fmt::format("zero_copy_{}", toString(disk_type)); | ||
|
||
String new_path = fs::path(settings.remote_fs_zero_copy_zookeeper_path.toString()) / zero_copy / table_uuid / part_name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now we store all zero copy data on a single path in ZK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It splitted by table_id.
I can't use something like table zookeeper_path, because in this case after drop table and recreate with same zookeeper path will be collisions between old and new part locks.
@@ -2568,6 +2572,8 @@ bool MergeTreeData::renameTempPartAndReplace( | |||
out_covered_parts->emplace_back(std::move(covered_part)); | |||
} | |||
|
|||
part->cleanupOldName(old_part_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add something in changelog about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fix is important so let's merge it.
@Mergifyio update |
✅ Branch has been successfully updated |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Change ZooKeeper path for zero-copy marks for shared data.
Fix for remove marks in ZooKeeper for renamed parts.
Detailed description / Documentation draft:
Before locks for shared data was in table node in zookeeper.
As result after removing part in table or entry table was no more info about shared use, and UNFREEZE/DROP DETACHED removes S3-objects required in other replicas (in other freezed backup for example).
Now this info keeps outside table node.
This change breaks backward compatibility.
Upgrade script converts records in ZooKeeper.
Fix for remove "lost" locks leaved after part renames.