Nodes with always_fetch_merge_parts=1
using significant CPU (25 core+) while waiting for remote merges to complete
#51921
Labels
Describe the situation
We have a 3 node Clickhouse cluster (running 23.5.2) replicated using Zookeeper. One node is treated as the writer node (always_fetch_merge_parts=0), and the other two nodes are treated as reader nodes (always_fetch_merge_parts=1). The intention of this configuration is that we often have intensive DDL queries & merges which we want to avoid customers noticing the impact of, so we run all of those intensive queries on the writer node.
When there are active merges on the writer node, we notice that CPU spikes significantly on the reader nodes (where always_fetch_merge_parts=1). We observe the reader nodes using a constant ~25 cores of CPU until the remote merge is completed.
Upon further investigation, it appears that merges on the reader nodes are rapidly spinning while they wait for the writer node to finish, which is causing the substantial CPU usage. We believe this is the case for the following reasons:
num_tries
onsystem.replication_queue
is increasing at a rate of around 40k retries per minute on the reader nodes.select * from system.merges
on the reader nodes shows that all merges consistently haveelapsed
times of <1s (with new thread IDs each time)<Information> [database]::all_5897_19763_6_8432 (MergeFromLogEntryTask): Code: 234. DB::Exception: No active replica has part all_5897_19763_6_8432 or covering part (cannot execute queue-0000035437: MERGE_PARTS with virtual parts [all_5897_19763_6_8432]). (NO_REPLICA_HAS_PART)
SYSTEM STOP MERGES;
on the reader nodes immediately causes the node to drop from ~25 cores of CPU to near-zero CPU usage.How to reproduce
Configure a Clickhouse cluster with two nodes and a
ReplicatedMergeTree
shared between them. On one node, setalways_fetch_merge_parts=1
. On the other node, ensure that a long-running merge begins on the replicated table. Observe that the node withalways_fetch_merge_parts=1
has extreme CPU usage until the merge is complete.Expected performance
I expect that the node with
always_fetch_merge_parts=1
should have almost no CPU impact from remote merges, and to backoff on retries if a merge is taking a long time instead of running tens of thousands of polls per minute.Additional context
I believe this is a related issue to #21338 (and #50580, #38944). Decided to open a new ticket instead of adding to the existing tickets since the behavior we're seeing is different from #50580 (no nodes are down), and it seems like the solution proposed in #21338 might not be the only one to solve this issue (the number of retries seems unnecessary, regardless of the of the traffic generated by each retry).
The text was updated successfully, but these errors were encountered: