Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi zookeeper clusters #15087

Closed
wants to merge 4 commits into from
Closed

Support multi zookeeper clusters #15087

wants to merge 4 commits into from

Conversation

fastio
Copy link
Contributor

@fastio fastio commented Sep 21, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

    1. Add configuration for multi zookeeper clusters.

Now it's possible to make different ReplicatedMergeTree engine with different ZooKeeper clusters.
The auxiliary zookeepers configuration (#14155 ) is reused to support this feature.
We can create table like follow:

CREATE TABLE t1 (a String) ENGINE= ReplicatedMergeTree('test:/tables/t1','{replica}') ORDER BY a;

It means that the metadata of the table named t1 is saved in zookeeper cluster named test, which configurated in
auxiliary_zookeepers.

CREATE TABLE t2 (a String) ENGINE= ReplicatedMergeTree('/tables/t2','{replica}') ORDER BY a;
It means that the metadata of the table named t2 is saved in the default zookeeper cluster, which configurated in
zookeeper.

<auxiliary_zookeepers>
        <test>
            <node index = "1">
                <host>172.19.0.11</host>
                <port>2181</port>
            </node>
        </test>
    </auxiliary_zookeepers>

In fact, the implementation compatibles with the old configuration.

Detailed description / Documentation draft:

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Sep 21, 2020
@fastio
Copy link
Contributor Author

fastio commented Sep 21, 2020

In production environment, hundreds of machines hold thousands of ReplicatedMergeTree tables.
The only zookeeper cluster is seriously overloaded even good performance machine used.
In some scenarios, we want to change the current zookeeper cluster to another one smoothly.

@sundy-li
Copy link
Contributor

sundy-li commented Sep 22, 2020

@fastio Good feature. We are also suffering from this problem, there are nearly 20 million znodes in our cluster, and high qps often leads zxid overflows.

But

  1. How to change the current zookeeper cluster to another one smoothly? We can't use zk copy to migrate the znodes. The only way I can think of is to copy all data from old tables to the new tables.

  2. Introduce zookeeper cluster to ClickHouse may lead to more stability dependence of the system.

  3. Maybe it's a good way to create a new role in ClickHouse, Zookeeper Federation or Zookeeper Proxy or Embedded Zookeeper in ClickHouse ...

@fastio
Copy link
Contributor Author

fastio commented Sep 22, 2020

@sundy-li Thanks.

  1. How to change the current zookeeper cluster to another one smoothly? We can't use zk copy to migrate the znodes. The only way I can think of is to copy all data from old tables to the new tables.

True, we can't copy data of zookeeper cluster to another one. If the multi zookeeper clusters supported at clickhouse cluster level, we can add a new zookeeper cluster, and migrate the meta data of table from old zookeeper to new. When migrating the meta data of one table, the other tables are not affected.

  1. Introduce zookeeper cluster to ClickHouse may lead to more stability dependence of the system.

  2. Maybe it's a good way to create a new role in ClickHouse, Zookeeper Federation or Zookeeper Proxy or Emberdde Zookeeper in ClickHouse ...

I agree with you, i.e. RAFT protocol can be used to implement the leader election and data replication for ReplicatedMergeTree.

@qoega
Copy link
Member

qoega commented Oct 1, 2020

In production environment, hundreds of machines hold thousands of ReplicatedMergeTree tables.
The only zookeeper cluster is seriously overloaded even good performance machine used.
In some scenarios, we want to change the current zookeeper cluster to another one smoothly.

I guess it will be very useful if somebody can write a documentation article with proper steps how to move table from single zookeeper to another one.
I believe people start to use another ZK only when their installation is very large, but not when they set up their first cluster.

@zhouyao1994
Copy link

@sundy-li Thanks.

  1. How to change the current zookeeper cluster to another one smoothly? We can't use zk copy to migrate the znodes. The only way I can think of is to copy all data from old tables to the new tables.

True, we can't copy data of zookeeper cluster to another one. If the multi zookeeper clusters supported at clickhouse cluster level, we can add a new zookeeper cluster, and migrate the meta data of table from old zookeeper to new. When migrating the meta data of one table, the other tables are not affected.

  1. Introduce zookeeper cluster to ClickHouse may lead to more stability dependence of the system.
  2. Maybe it's a good way to create a new role in ClickHouse, Zookeeper Federation or Zookeeper Proxy or Emberdde Zookeeper in ClickHouse ...

I agree with you, i.e. RAFT protocol can be used to implement the leader election and data replication for ReplicatedMergeTree.

Hi @sundy-li , in order to avoid copy data, when change zookeper cluster, is it posible to use detatch or atatch opration?

@fastio
Copy link
Contributor Author

fastio commented Oct 26, 2020

In production environment, hundreds of machines hold thousands of ReplicatedMergeTree tables.
The only zookeeper cluster is seriously overloaded even good performance machine used.
In some scenarios, we want to change the current zookeeper cluster to another one smoothly.

I guess it will be very useful if somebody can write a documentation article with proper steps how to move table from single zookeeper to another one.
I believe people start to use another ZK only when their installation is very large, but not when they set up their first cluster.

@qoega Thanks for your reply. Yes you are right.
The Zookeeper cluster is important to ClickHouse. we do not like the ClickHouse becomes the read-only mode in the production environment. Don't put all your eggs into one basket.
How to change the ReplicatedMergeTree table's metadata from one Zookeeper to another one. Steps as follow:

  1. Create a new replicated merge tree table, named X_new
  2. Freeze the origin table by ALTER TABLE X FREEZE
  3. Move the data of X to {path_prefx}/X_new/detached
  4. Attach the partition to X_new
  5. Rename the table X to X_drop
  6. Rename the table X_new to X, and drop the table X_drop.

@fastio
Copy link
Contributor Author

fastio commented Oct 29, 2020

@Akazz Hi, could you review this pr? Thanks

@fastio
Copy link
Contributor Author

fastio commented Nov 5, 2020

Hi team, Maybe we have another solution for clickhouse to support multi zookeeper clusters.
We can pass the zookeeper cluster information by extending more parameters when creating a replicated merge tree table.
For instance,

CREATE TABLE test (a String) ENGINE=ReplicatedMergeTree('/clickhouse/table/test', '{replica}', '10.10.0.1:2181,10.10.0.2:2181,10.10.0.3:2181',...,) ORDER BY a

If the zookeeper endpoint is not set, we use the configuration of zookeeper in the config.xml,
otherwise.

How do you think about this?

@amosbird
Copy link
Collaborator

amosbird commented Nov 9, 2020

In production environment, hundreds of machines hold thousands of ReplicatedMergeTree tables.
The only zookeeper cluster is seriously overloaded even good performance machine used.
In some scenarios, we want to change the current zookeeper cluster to another one smoothly.

I guess it will be very useful if somebody can write a documentation article with proper steps how to move table from single zookeeper to another one.
I believe people start to use another ZK only when their installation is very large, but not when they set up their first cluster.

It seems we can extend our current ALTER PARTITION primitives to support cross-zookeeper operations. We already support ALTER FETCH PARTITION from an auxiliary zookeeper #14155 . We can optimize the implementation to avoid copy, and we might also add ALTER SEND PARTITION/PART if necessary.

@fastio btw, I feel that adding another configuration section, a.k.a auxiliary_zookeepers, is more compatible. And we just need to create tables like

CREATE TABLE test (a String) ENGINE=ReplicatedMergeTree('another_zk:/clickhouse/table/test', '{replica}') ORDER BY a

@Akazz
Copy link
Contributor

Akazz commented Nov 10, 2020

@fastio
Your suggestion about extending parameters list of ReplicatedMergeTree() with immediate declaration of ZK endpoints in SQL statement is quite interesting, although we would need to ALTER TABLE zk_endpoint every time any of ZK-node addresses/ports change.

I would say that for now (as long as PR #14155 is in ) it is better to integrate with @amosbird's auxiliary_zookeepers interface - zkutil::ZooKeeperPtr Context::getAuxiliaryZooKeeper(const String & name) const. At the same time I see no problem with keeping zookeeper_cluster merge tree setting as you suggest it.

@Akazz Akazz added the st-waiting-for-fix We are waiting for fixes in this issue or pull request label Nov 10, 2020
@@ -112,7 +112,7 @@ struct Settings;
/** Obsolete settings. Kept for backward compatibility only. */ \
M(UInt64, min_relative_delay_to_yield_leadership, 120, "Obsolete setting, does nothing.", 0) \
M(UInt64, check_delay_period, 60, "Obsolete setting, does nothing.", 0) \

M(String, zookeeper_cluster, "default", "Name of zookeeper cluster.", 0) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to extract all references to token "default" and expose it as a globally defined const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it's better to be a globally const . I'll fix it.

@@ -768,6 +771,7 @@ static StoragePtr create(const StorageFactory::Arguments & args)
std::move(storage_settings),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a separate field for storing zookeeper_cluster when storage_settings is available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary, should be removed.

@fastio
Copy link
Contributor Author

fastio commented Nov 13, 2020

In production environment, hundreds of machines hold thousands of ReplicatedMergeTree tables.
The only zookeeper cluster is seriously overloaded even good performance machine used.
In some scenarios, we want to change the current zookeeper cluster to another one smoothly.

I guess it will be very useful if somebody can write a documentation article with proper steps how to move table from single zookeeper to another one.
I believe people start to use another ZK only when their installation is very large, but not when they set up their first cluster.

It seems we can extend our current ALTER PARTITION primitives to support cross-zookeeper operations. We already support ALTER FETCH PARTITION from an auxiliary zookeeper #14155 . We can optimize the implementation to avoid copy, and we might also add ALTER SEND PARTITION/PART if necessary.

@fastio btw, I feel that adding another configuration section, a.k.a auxiliary_zookeepers, is more compatible. And we just need to create tables like

CREATE TABLE test (a String) ENGINE=ReplicatedMergeTree('another_zk:/clickhouse/table/test', '{replica}') ORDER BY a

Thanks for your suggestion.

@fastio
Copy link
Contributor Author

fastio commented Nov 13, 2020

@fastio
Your suggestion about extending parameters list of ReplicatedMergeTree() with immediate declaration of ZK endpoints in SQL statement is quite interesting, although we would need to ALTER TABLE zk_endpoint every time any of ZK-node addresses/ports change.

I would say that for now (as long as PR #14155 is in ) it is better to integrate with @amosbird's auxiliary_zookeepers interface - zkutil::ZooKeeperPtr Context::getAuxiliaryZooKeeper(const String & name) const. At the same time I see no problem with keeping zookeeper_cluster merge tree setting as you suggest it.

Thanks for your review. I'll fix the issues soon.

@Akazz Akazz removed the st-waiting-for-fix We are waiting for fixes in this issue or pull request label Nov 16, 2020
@Akazz
Copy link
Contributor

Akazz commented Nov 16, 2020

Work to be continued in #17070

@Akazz Akazz closed this Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants