Support multi zookeeper clusters #15087

fastio · 2020-09-21T13:25:10Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

1. Add configuration for multi zookeeper clusters.

Now it's possible to make different ReplicatedMergeTree engine with different ZooKeeper clusters.
The auxiliary zookeepers configuration (#14155 ) is reused to support this feature.
We can create table like follow:

CREATE TABLE t1 (a String) ENGINE= ReplicatedMergeTree('test:/tables/t1','{replica}') ORDER BY a;

It means that the metadata of the table named t1 is saved in zookeeper cluster named test, which configurated in
auxiliary_zookeepers.

CREATE TABLE t2 (a String) ENGINE= ReplicatedMergeTree('/tables/t2','{replica}') ORDER BY a;
It means that the metadata of the table named t2 is saved in the default zookeeper cluster, which configurated in
zookeeper.

<auxiliary_zookeepers>
        <test>
            <node index = "1">
                <host>172.19.0.11</host>
                <port>2181</port>
            </node>
        </test>
    </auxiliary_zookeepers>

In fact, the implementation compatibles with the old configuration.

Detailed description / Documentation draft:

fastio · 2020-09-21T13:28:47Z

In production environment, hundreds of machines hold thousands of ReplicatedMergeTree tables.
The only zookeeper cluster is seriously overloaded even good performance machine used.
In some scenarios, we want to change the current zookeeper cluster to another one smoothly.

sundy-li · 2020-09-22T02:46:25Z

@fastio Good feature. We are also suffering from this problem， there are nearly 20 million znodes in our cluster, and high qps often leads zxid overflows.

But

How to change the current zookeeper cluster to another one smoothly? We can't use zk copy to migrate the znodes. The only way I can think of is to copy all data from old tables to the new tables.
Introduce zookeeper cluster to ClickHouse may lead to more stability dependence of the system.
Maybe it's a good way to create a new role in ClickHouse, Zookeeper Federation or Zookeeper Proxy or Embedded Zookeeper in ClickHouse ...

fastio · 2020-09-22T12:24:37Z

@sundy-li Thanks.

How to change the current zookeeper cluster to another one smoothly? We can't use zk copy to migrate the znodes. The only way I can think of is to copy all data from old tables to the new tables.

True, we can't copy data of zookeeper cluster to another one. If the multi zookeeper clusters supported at clickhouse cluster level, we can add a new zookeeper cluster, and migrate the meta data of table from old zookeeper to new. When migrating the meta data of one table, the other tables are not affected.

Introduce zookeeper cluster to ClickHouse may lead to more stability dependence of the system.

Maybe it's a good way to create a new role in ClickHouse, Zookeeper Federation or Zookeeper Proxy or Emberdde Zookeeper in ClickHouse ...

I agree with you, i.e. RAFT protocol can be used to implement the leader election and data replication for ReplicatedMergeTree.

qoega · 2020-10-01T11:07:26Z

In production environment, hundreds of machines hold thousands of ReplicatedMergeTree tables.
The only zookeeper cluster is seriously overloaded even good performance machine used.
In some scenarios, we want to change the current zookeeper cluster to another one smoothly.

I guess it will be very useful if somebody can write a documentation article with proper steps how to move table from single zookeeper to another one.
I believe people start to use another ZK only when their installation is very large, but not when they set up their first cluster.

zhouyao1994 · 2020-10-10T04:38:51Z

@sundy-li Thanks.

How to change the current zookeeper cluster to another one smoothly? We can't use zk copy to migrate the znodes. The only way I can think of is to copy all data from old tables to the new tables.

True, we can't copy data of zookeeper cluster to another one. If the multi zookeeper clusters supported at clickhouse cluster level, we can add a new zookeeper cluster, and migrate the meta data of table from old zookeeper to new. When migrating the meta data of one table, the other tables are not affected.

Introduce zookeeper cluster to ClickHouse may lead to more stability dependence of the system.

Maybe it's a good way to create a new role in ClickHouse, Zookeeper Federation or Zookeeper Proxy or Emberdde Zookeeper in ClickHouse ...

I agree with you, i.e. RAFT protocol can be used to implement the leader election and data replication for ReplicatedMergeTree.

Hi @sundy-li , in order to avoid copy data, when change zookeper cluster, is it posible to use detatch or atatch opration?

fastio · 2020-10-26T06:34:59Z

In production environment, hundreds of machines hold thousands of ReplicatedMergeTree tables.
The only zookeeper cluster is seriously overloaded even good performance machine used.
In some scenarios, we want to change the current zookeeper cluster to another one smoothly.

I guess it will be very useful if somebody can write a documentation article with proper steps how to move table from single zookeeper to another one.
I believe people start to use another ZK only when their installation is very large, but not when they set up their first cluster.

@qoega Thanks for your reply. Yes you are right.
The Zookeeper cluster is important to ClickHouse. we do not like the ClickHouse becomes the read-only mode in the production environment. Don't put all your eggs into one basket.
How to change the ReplicatedMergeTree table's metadata from one Zookeeper to another one. Steps as follow:

Create a new replicated merge tree table, named X_new
Freeze the origin table by ALTER TABLE X FREEZE
Move the data of X to {path_prefx}/X_new/detached
Attach the partition to X_new
Rename the table X to X_drop
Rename the table X_new to X, and drop the table X_drop.

fastio · 2020-10-29T01:55:12Z

@Akazz Hi, could you review this pr? Thanks

fastio · 2020-11-05T09:23:42Z

Hi team, Maybe we have another solution for clickhouse to support multi zookeeper clusters.
We can pass the zookeeper cluster information by extending more parameters when creating a replicated merge tree table.
For instance,

CREATE TABLE test (a String) ENGINE=ReplicatedMergeTree('/clickhouse/table/test', '{replica}', '10.10.0.1:2181,10.10.0.2:2181,10.10.0.3:2181',...,) ORDER BY a

If the zookeeper endpoint is not set, we use the configuration of zookeeper in the config.xml,
otherwise.

How do you think about this?

amosbird · 2020-11-09T08:24:23Z

In production environment, hundreds of machines hold thousands of ReplicatedMergeTree tables.
The only zookeeper cluster is seriously overloaded even good performance machine used.
In some scenarios, we want to change the current zookeeper cluster to another one smoothly.

I guess it will be very useful if somebody can write a documentation article with proper steps how to move table from single zookeeper to another one.
I believe people start to use another ZK only when their installation is very large, but not when they set up their first cluster.

It seems we can extend our current ALTER PARTITION primitives to support cross-zookeeper operations. We already support ALTER FETCH PARTITION from an auxiliary zookeeper #14155 . We can optimize the implementation to avoid copy, and we might also add ALTER SEND PARTITION/PART if necessary.

@fastio btw, I feel that adding another configuration section, a.k.a auxiliary_zookeepers, is more compatible. And we just need to create tables like

CREATE TABLE test (a String) ENGINE=ReplicatedMergeTree('another_zk:/clickhouse/table/test', '{replica}') ORDER BY a

Akazz · 2020-11-10T11:34:18Z

@fastio
Your suggestion about extending parameters list of ReplicatedMergeTree() with immediate declaration of ZK endpoints in SQL statement is quite interesting, although we would need to ALTER TABLE zk_endpoint every time any of ZK-node addresses/ports change.

I would say that for now (as long as PR #14155 is in ) it is better to integrate with @amosbird's auxiliary_zookeepers interface - zkutil::ZooKeeperPtr Context::getAuxiliaryZooKeeper(const String & name) const. At the same time I see no problem with keeping zookeeper_cluster merge tree setting as you suggest it.

Akazz · 2020-11-10T11:42:04Z

src/Storages/MergeTree/MergeTreeSettings.h

@@ -112,7 +112,7 @@ struct Settings;
    /** Obsolete settings. Kept for backward compatibility only. */ \
    M(UInt64, min_relative_delay_to_yield_leadership, 120, "Obsolete setting, does nothing.", 0) \
    M(UInt64, check_delay_period, 60, "Obsolete setting, does nothing.", 0) \
-
+    M(String, zookeeper_cluster, "default", "Name of zookeeper cluster.", 0) \


You might want to extract all references to token "default" and expose it as a globally defined const

True, it's better to be a globally const . I'll fix it.

Akazz · 2020-11-10T12:04:41Z

src/Storages/MergeTree/registerStorageMergeTree.cpp

@@ -768,6 +771,7 @@ static StoragePtr create(const StorageFactory::Arguments & args)
            std::move(storage_settings),


Why do you need a separate field for storing zookeeper_cluster when storage_settings is available?

It's not necessary, should be removed.

fastio · 2020-11-13T04:00:20Z

In production environment, hundreds of machines hold thousands of ReplicatedMergeTree tables.
The only zookeeper cluster is seriously overloaded even good performance machine used.
In some scenarios, we want to change the current zookeeper cluster to another one smoothly.

I guess it will be very useful if somebody can write a documentation article with proper steps how to move table from single zookeeper to another one.
I believe people start to use another ZK only when their installation is very large, but not when they set up their first cluster.

It seems we can extend our current ALTER PARTITION primitives to support cross-zookeeper operations. We already support ALTER FETCH PARTITION from an auxiliary zookeeper #14155 . We can optimize the implementation to avoid copy, and we might also add ALTER SEND PARTITION/PART if necessary.

@fastio btw, I feel that adding another configuration section, a.k.a auxiliary_zookeepers, is more compatible. And we just need to create tables like
CREATE TABLE test (a String) ENGINE=ReplicatedMergeTree('another_zk:/clickhouse/table/test', '{replica}') ORDER BY a

Thanks for your suggestion.

fastio · 2020-11-13T04:01:29Z

@fastio
Your suggestion about extending parameters list of ReplicatedMergeTree() with immediate declaration of ZK endpoints in SQL statement is quite interesting, although we would need to ALTER TABLE zk_endpoint every time any of ZK-node addresses/ports change.

I would say that for now (as long as PR #14155 is in ) it is better to integrate with @amosbird's auxiliary_zookeepers interface - zkutil::ZooKeeperPtr Context::getAuxiliaryZooKeeper(const String & name) const. At the same time I see no problem with keeping zookeeper_cluster merge tree setting as you suggest it.

Thanks for your review. I'll fix the issues soon.

Akazz · 2020-11-16T10:34:18Z

Work to be continued in #17070

robot-clickhouse added the pr-improvement Pull request with some product improvements label Sep 21, 2020

alesapin added the can be tested label Sep 21, 2020

Akazz self-assigned this Sep 22, 2020

alexey-milovidov mentioned this pull request Sep 23, 2020

Different ZK clusters for replication and distributed DDL. #12332

Closed

Akazz added the st-waiting-for-fix We are waiting for fixes in this issue or pull request label Nov 10, 2020

Akazz reviewed Nov 10, 2020

View reviewed changes

fastio added 4 commits November 13, 2020 23:17

Support multi zookeeper clusters

7e65a88

fix build

bb71613

Reuse auxiliary zookeepers configurations

b89d170

fix style

55e77a2

fastio mentioned this pull request Nov 16, 2020

Support multiple ZooKeeper clusters #17070

Merged

Akazz removed the st-waiting-for-fix We are waiting for fixes in this issue or pull request label Nov 16, 2020

Akazz closed this Nov 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi zookeeper clusters #15087

Support multi zookeeper clusters #15087

fastio commented Sep 21, 2020 •

edited

fastio commented Sep 21, 2020

sundy-li commented Sep 22, 2020 •

edited

fastio commented Sep 22, 2020 •

edited

qoega commented Oct 1, 2020

zhouyao1994 commented Oct 10, 2020

fastio commented Oct 26, 2020 •

edited

fastio commented Oct 29, 2020

fastio commented Nov 5, 2020

amosbird commented Nov 9, 2020

Akazz commented Nov 10, 2020

Akazz Nov 10, 2020

fastio Nov 13, 2020

Akazz Nov 10, 2020

fastio Nov 13, 2020

fastio commented Nov 13, 2020

fastio commented Nov 13, 2020

Akazz commented Nov 16, 2020

		@@ -768,6 +771,7 @@ static StoragePtr create(const StorageFactory::Arguments & args)
		std::move(storage_settings),

Support multi zookeeper clusters #15087

Support multi zookeeper clusters #15087

Conversation

fastio commented Sep 21, 2020 • edited

fastio commented Sep 21, 2020

sundy-li commented Sep 22, 2020 • edited

fastio commented Sep 22, 2020 • edited

qoega commented Oct 1, 2020

zhouyao1994 commented Oct 10, 2020

fastio commented Oct 26, 2020 • edited

fastio commented Oct 29, 2020

fastio commented Nov 5, 2020

amosbird commented Nov 9, 2020

Akazz commented Nov 10, 2020

Akazz Nov 10, 2020

Choose a reason for hiding this comment

fastio Nov 13, 2020

Choose a reason for hiding this comment

Akazz Nov 10, 2020

Choose a reason for hiding this comment

fastio Nov 13, 2020

Choose a reason for hiding this comment

fastio commented Nov 13, 2020

fastio commented Nov 13, 2020

Akazz commented Nov 16, 2020

fastio commented Sep 21, 2020 •

edited

sundy-li commented Sep 22, 2020 •

edited

fastio commented Sep 22, 2020 •

edited

fastio commented Oct 26, 2020 •

edited