Big cpu/network load increase in HA when one of the nodes is down #50580

majedrze · 2023-06-05T09:39:40Z

Hi,

We set-up the cluster with two clickhouse nodes (23.3.1.2823, Ubuntu 20.04) and a three-node zookeeper ensemble.

We have a very serious performance issue with Clickhouse replicated tables. Short story is that whenever a node is lost (i.e. due to maintenance shutdown), on the remaining healthy node we experience a 10-20x increase in cpu and network traffic on input.

While this is happening, there is a massive log-spam of:

<Information> default.table_m (ReplicatedMergeTreePartCheckThread): Found parts with the same min block and with the same max block as the missing part <some part> on replica <replica_name>. Hoping that it will eventually appear as a result of a merge. Parts: <huge parts list>

which we believe is the cause of some insane zookeeper traffic (1-2Gbps per clickhouse node):

SELECT
    event,
    formatReadableSize(value)
FROM system.events
WHERE event IN ('InsertedBytes', 'ZooKeeperBytesSent', 'ZooKeeperBytesReceived')

┌─event──────────────────┬─formatReadableSize(value)─┐
│ InsertedBytes          │ 99.92 GiB                 │
│ ZooKeeperBytesSent     │ 5.52 GiB                  │
│ ZooKeeperBytesReceived │ 539.42 GiB                │
└────────────────────────┴───────────────────────────┘

Which in turn is probably caused by:

SELECT
    table,
    type,
    num_tries
FROM system.replication_queue

┌─table────────────┬─type────────┬─num_tries─┐
│ table_m          │ MERGE_PARTS │     12856 │
│ table_m          │ MERGE_PARTS │     12862 │
│ table_m          │ MERGE_PARTS │     12865 │
│ table_m          │ MERGE_PARTS │     12844 │
│ table_m          │ MERGE_PARTS │     12855 │
… truncated …

There are various amounts of parts stuck in the replication_queue while this is happening. Parts being stuck is of course understandable (since there might be a needed part on the shutdown node), but num_tries is increasing very, very quickly (seemingly on each query). This is probably the cause of high network load on zookeeper.

There is also a huge, tenfold increase in metrics such as SelectedRows (which might explain the cpu load increase):

We have found that adding a third node and implementing insert_quorum greatly reduced the amount of parts listed in system.replication_queue and SelectedRows amount to negligible levels. This is of course only a temporary workaround until the underlying issue is fixed. Nevertheless the issue with excessive zookeeper traffic still remains. If one of clickhouse nodes is shut down, each healthy clickhouse node still receives close to 1Gbps of traffic from zookeeper due to excessive zookeeper querying.

Could we possibly get any insight into what is happening here and if possible is there any known workaround?

The exact schema is as follows:

CREATE TABLE IF NOT EXISTS table_b ON CLUSTER 'cluster' (
    when DateTime CODEC(DoubleDelta, ZSTD(4)),
    value Int64 CODEC(DoubleDelta, ZSTD(4)),
    count Int64 CODEC(DoubleDelta, ZSTD(4)),
    what String CODEC(ZSTD(4))
) ENGINE = Buffer('default', 'table_m', 16, 0, 1, 100000, 1000000, 1000000, 10000000);

CREATE TABLE IF NOT EXISTS table_m ON CLUSTER 'cluster' AS metrics
 ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/{database}/replicated/{table}','{replica}')
 ORDER BY (toStartOfMinute(when), what);

CREATE VIEW IF NOT EXISTS table_m_v ON CLUSTER 'cluster' AS
 SELECT toStartOfMinute(when) as when, value, count, what FROM table_m FINAL;

CREATE TABLE IF NOT EXISTS table_t ON CLUSTER 'cluster' AS metrics
 ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/{database}/replicated/{table}','{replica}')
 ORDER BY what;
CREATE MATERIALIZED VIEW IF NOT EXISTS table_t_mv ON CLUSTER 'cluster' TO table_t AS SELECT * FROM table_m;

CREATE VIEW IF NOT EXISTS table_t_v ON CLUSTER 'cluster' AS
 SELECT what, value, count FROM table_t FINAL;

The most frequent query that is run:
SELECT what, value, count FROM table_t WHERE what IN ('a', 'b')

The text was updated successfully, but these errors were encountered:

nickitat · 2023-06-06T13:23:58Z

@antonio2368 maybe you could take a look?

antonio2368 · 2023-06-21T10:32:44Z

I think @tavplubix could have better ideas if something changed recently. A similar issue was mentioned by another user where there is a high ZK <-> CH network load, also 2 replica setup is used.

alifirat · 2023-06-21T12:33:04Z

Hi @majedrze
did you have the setting always_fetch_merged_part enable ? It was enable for us and by disable it, it solves our issue.
cc @antonio2368

majedrze · 2023-06-22T09:42:47Z

Hi @alifirat !
We never had always_fetch_merged_part enabled.

alifirat · 2023-06-22T09:46:40Z

After 1 day of monitoring, we are not so sure that it resolves the issue (still monitoring).

alifirat · 2023-06-28T20:00:08Z

Well I can confirm now that by disable the setting, the data transfer to ZK has been reduced a lot

tavplubix · 2023-06-28T21:11:32Z

Related to #21338

majedrze added the performance label Jun 5, 2023

isaacpz mentioned this issue Jul 7, 2023

Nodes with always_fetch_merge_parts=1 using significant CPU (25 core+) while waiting for remote merges to complete #51921

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big cpu/network load increase in HA when one of the nodes is down #50580

Big cpu/network load increase in HA when one of the nodes is down #50580

majedrze commented Jun 5, 2023 •

edited

nickitat commented Jun 6, 2023

antonio2368 commented Jun 21, 2023

alifirat commented Jun 21, 2023

majedrze commented Jun 22, 2023

alifirat commented Jun 22, 2023

alifirat commented Jun 28, 2023

tavplubix commented Jun 28, 2023

Big cpu/network load increase in HA when one of the nodes is down #50580

Big cpu/network load increase in HA when one of the nodes is down #50580

Comments

majedrze commented Jun 5, 2023 • edited

nickitat commented Jun 6, 2023

antonio2368 commented Jun 21, 2023

alifirat commented Jun 21, 2023

majedrze commented Jun 22, 2023

alifirat commented Jun 22, 2023

alifirat commented Jun 28, 2023

tavplubix commented Jun 28, 2023

majedrze commented Jun 5, 2023 •

edited