New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big cpu/network load increase in HA when one of the nodes is down #50580
Comments
@antonio2368 maybe you could take a look? |
I think @tavplubix could have better ideas if something changed recently. A similar issue was mentioned by another user where there is a high ZK <-> CH network load, also 2 replica setup is used. |
Hi @majedrze |
Hi @alifirat ! |
After 1 day of monitoring, we are not so sure that it resolves the issue (still monitoring). |
Well I can confirm now that by disable the setting, the data transfer to ZK has been reduced a lot |
Related to #21338 |
Hi,
We set-up the cluster with two clickhouse nodes (23.3.1.2823, Ubuntu 20.04) and a three-node zookeeper ensemble.
We have a very serious performance issue with Clickhouse replicated tables. Short story is that whenever a node is lost (i.e. due to maintenance shutdown), on the remaining healthy node we experience a 10-20x increase in cpu and network traffic on input.
While this is happening, there is a massive log-spam of:
<Information> default.table_m (ReplicatedMergeTreePartCheckThread): Found parts with the same min block and with the same max block as the missing part <some part> on replica <replica_name>. Hoping that it will eventually appear as a result of a merge. Parts: <huge parts list>
which we believe is the cause of some insane zookeeper traffic (1-2Gbps per clickhouse node):
Which in turn is probably caused by:
There are various amounts of parts stuck in the replication_queue while this is happening. Parts being stuck is of course understandable (since there might be a needed part on the shutdown node), but num_tries is increasing very, very quickly (seemingly on each query). This is probably the cause of high network load on zookeeper.
There is also a huge, tenfold increase in metrics such as SelectedRows (which might explain the cpu load increase):
We have found that adding a third node and implementing insert_quorum greatly reduced the amount of parts listed in system.replication_queue and SelectedRows amount to negligible levels. This is of course only a temporary workaround until the underlying issue is fixed. Nevertheless the issue with excessive zookeeper traffic still remains. If one of clickhouse nodes is shut down, each healthy clickhouse node still receives close to 1Gbps of traffic from zookeeper due to excessive zookeeper querying.
Could we possibly get any insight into what is happening here and if possible is there any known workaround?
The exact schema is as follows:
The most frequent query that is run:
SELECT what, value, count FROM table_t WHERE what IN ('a', 'b')
The text was updated successfully, but these errors were encountered: