Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ZooKeeper path for zero-copy locks for shared data #32061

Merged
merged 24 commits into from Jan 8, 2022

Conversation

ianton-ru
Copy link
Contributor

@ianton-ru ianton-ru commented Dec 1, 2021

Changelog category (leave one):

  • Backward Incompatible Change

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Change ZooKeeper path for zero-copy marks for shared data.
Fix for remove marks in ZooKeeper for renamed parts.

Detailed description / Documentation draft:
Before locks for shared data was in table node in zookeeper.
As result after removing part in table or entry table was no more info about shared use, and UNFREEZE/DROP DETACHED removes S3-objects required in other replicas (in other freezed backup for example).
Now this info keeps outside table node.
This change breaks backward compatibility.
Upgrade script converts records in ZooKeeper.
Fix for remove "lost" locks leaved after part renames.

@robot-clickhouse robot-clickhouse added the pr-backward-incompatible Pull request with backwards incompatible changes label Dec 1, 2021
@ianton-ru ianton-ru changed the title Mdb 15474 Change ZooKeeper path for zero-copy marks for shared data Dec 1, 2021
src/Storages/MergeTree/IMergeTreeDataPart.cpp Show resolved Hide resolved
@@ -7515,4 +7560,193 @@ bool StorageReplicatedMergeTree::createEmptyPartInsteadOfLost(zkutil::ZooKeeperP
return true;
}


void StorageReplicatedMergeTree::convertZeroCopySchema()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we shouldn't implement compatibility inside clickhouse server for non-production ready features. Let's implement a seperate tool for it.

Copy link
Contributor Author

@ianton-ru ianton-ru Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Separate convertaion tool looks easy to make, but without downtime with a low chance to data corruption during process, when some replicas are updated, some yet not, when part created after convertation process and destroyed before it complete. Looks like this is accepted risk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move it in script utils/zero_copy/zero_copy_schema_converter.py

@@ -5203,6 +5292,75 @@ PartitionCommandsResultInfo MergeTreeData::unfreezeAll(
return unfreezePartitionsByMatcher([] (const String &) { return true; }, backup_name, local_context);
}

bool MergeTreeData::removeSharedDetachedPart(DiskPtr disk, const String & path, const String & part_name)
{
bool keep_shared = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we have it as a separate variable? It doesn't change anywhere. removeSharedRecursive takes it by value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least const bool keep_shared

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This variable is useless after refactoring.
Removed (and wholly method renamed to 'removeDetachedPart' and logic moved to 'StorageReplicatedMergeTree').

FreezeMetaData meta;
if (meta.load(disk, path))
return removeSharedDetachedPart(disk, path, part_name, meta.zookeeper_name, meta.replica_name,
"", meta.is_replicated);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why zookeeper_path empty?????

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Freezed part with existing FreezeMetaData can be created only in updated version of ClickHouse. In this case it cannot have zero-copy record on old place.

zkutil::ZooKeeperPtr zookeeper;
if (zookeeper_name == default_zookeeper_name)
{
zookeeper = getContext()->getZooKeeper();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns our not perfect abstractions into trash:

grep 'ZooKeeper' src/Storages/MergeTree/MergeTreeData.{h,cpp} | wc -l
0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic moved to StorageReplicatedMergeTree.

return keep_shared;
}

bool MergeTreeData::removeSharedDetachedPart(DiskPtr disk, const String & path, const String & part_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this "shared" code is here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we unfreeze freezed part for removed table then we have no info in ClickHouse is this part replicated or not. There are only files on disk and may be lock in ZooKeeper, but none MergeTreeData object for this non existing table.
I see no other way except call one universal method with support both variants inside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be this code should be in some kind of "MergeTreeGarbageCollector" class. Is it better?

Copy link
Contributor Author

@ianton-ru ianton-ru Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now none logic to unfreeze entry backup, parts unfreezed only for existing tables, so possible to move code to StorageReplicatedMergeTree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic moved to StorageReplicatedMergeTree.

src/Storages/MergeTree/MergeTreeData.h Show resolved Hide resolved
/// Fetch part only if some replica has it on shared storage like S3
/// Overridden in StorageReplicatedMergeTree
virtual bool tryToFetchIfShared(const IMergeTreeDataPart &, const DiskPtr &, const String &) { return false; }

virtual String getZooKeeperName() const { return ""; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, MergeTreeData doesn't know anything about zookeeper.

Copy link
Contributor Author

@ianton-ru ianton-ru Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to StorageReplicatedMergeTree.

@alesapin alesapin self-assigned this Dec 22, 2021
@alesapin alesapin changed the title Change ZooKeeper path for zero-copy marks for shared data Change ZooKeeper path for zero-copy locks for shared data Dec 24, 2021
}
else if (code != Coordination::Error::ZOK)
{
throw zkutil::KeeperException(code, zookeeper_table_id_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to retry if it's concurrent process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryCreate can return only one of ZOK, ZNONODE, ZNODEEXISTS, ZNOCHILDRENFOREPHEMERALS
ZNONODE and ZNOCHILDRENFOREPHEMERALS mean that something wrong with table root node. Concurrent creation can produce only ZNODEEXISTS, in this case one replica wins, other replicas get this error and should read id from zookeeper.

String getUniqueId() const;

/// Return hardlink count for part.
/// Required for keep data on remote FS when part has shadow copies.
UInt32 getRefCount() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to clarify that it's not some general-purpose refcount. The name should be something like getNumberOfReferenecesForZeroCopyReplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to getNumberOfRefereneces

localBackup(part->volume->getDisk(), src_part_path, backup_part_path);
localBackup(disk, src_part_path, backup_part_path);

freezeMetaData(disk, part, backup_part_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment -- "Store metadata for replicated tables and do nothing for ordinary merge tree"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}


void StorageReplicatedMergeTree::freezeMetaData(DiskPtr disk, DataPartPtr, String backup_part_path) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to name it createAndStoreFreezeMetadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -263,6 +274,10 @@ class StorageReplicatedMergeTree final : public shared_ptr_helper<StorageReplica

bool createEmptyPartInsteadOfLost(zkutil::ZooKeeperPtr zookeeper, const String & lost_part_name);

String getZooKeeperName() const { return zookeeper_name; }

virtual String getTableUniqID() const override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need virtual here. Also need a comment about purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

zkutil::ZooKeeperPtr zookeeper;
if (detached_zookeeper_name == default_zookeeper_name)
{
zookeeper = getContext()->getZooKeeper();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we cannot just use table's zookeeper with getZooKeeper(). ZooKeeper may change for some reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove it now. Later we will need this for unfreeze parts of removed tables.

fs::path checksums = fs::path(path) / "checksums.txt";
if (disk->exists(checksums))
{
auto ref_count = disk->getRefCount(checksums);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have getRefCount in IMergeTreeDataPart? Why we cannot use it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this moment we haven't existing part object, only detached files on disk.

if (ref_count == 0)
{
String id = disk->getUniqueId(checksums);
keep_shared = !StorageReplicatedMergeTree::unlockSharedDataById(id, table_uuid, part_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it possible, when ref_count == 0 and unlockSharedDataById still says keep_shared?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref_count is local counter for freezed/detached/etc copies of metadata on this replica.
unlockSharedDataById remove lock for this replica in ZooKeeper and check if any locks of other replicas exist.


String zero_copy = fmt::format("zero_copy_{}", toString(disk_type));

String new_path = fs::path(settings.remote_fs_zero_copy_zookeeper_path.toString()) / zero_copy / table_uuid / part_name;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we store all zero copy data on a single path in ZK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It splitted by table_id.
I can't use something like table zookeeper_path, because in this case after drop table and recreate with same zookeeper path will be collisions between old and new part locks.

@@ -2568,6 +2572,8 @@ bool MergeTreeData::renameTempPartAndReplace(
out_covered_parts->emplace_back(std::move(covered_part));
}

part->cleanupOldName(old_part_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add something in changelog about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fix is important so let's merge it.

@alesapin
Copy link
Member

alesapin commented Jan 7, 2022

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Jan 7, 2022

update

✅ Branch has been successfully updated

@alesapin alesapin merged commit 733ed7c into ClickHouse:master Jan 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-backward-incompatible Pull request with backwards incompatible changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants