Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big cpu/network load increase in HA when one of the nodes is down #50580

Open
majedrze opened this issue Jun 5, 2023 · 7 comments
Open

Big cpu/network load increase in HA when one of the nodes is down #50580

majedrze opened this issue Jun 5, 2023 · 7 comments

Comments

@majedrze
Copy link

majedrze commented Jun 5, 2023

Hi,

We set-up the cluster with two clickhouse nodes (23.3.1.2823, Ubuntu 20.04) and a three-node zookeeper ensemble.

We have a very serious performance issue with Clickhouse replicated tables. Short story is that whenever a node is lost (i.e. due to maintenance shutdown), on the remaining healthy node we experience a 10-20x increase in cpu and network traffic on input.

While this is happening, there is a massive log-spam of:

<Information> default.table_m (ReplicatedMergeTreePartCheckThread): Found parts with the same min block and with the same max block as the missing part <some part> on replica <replica_name>. Hoping that it will eventually appear as a result of a merge. Parts: <huge parts list>

which we believe is the cause of some insane zookeeper traffic (1-2Gbps per clickhouse node):

SELECT
    event,
    formatReadableSize(value)
FROM system.events
WHERE event IN ('InsertedBytes', 'ZooKeeperBytesSent', 'ZooKeeperBytesReceived')

┌─event──────────────────┬─formatReadableSize(value)─┐
│ InsertedBytes          │ 99.92 GiB                 │
│ ZooKeeperBytesSent     │ 5.52 GiB                  │
│ ZooKeeperBytesReceived │ 539.42 GiB                │
└────────────────────────┴───────────────────────────┘

Which in turn is probably caused by:

SELECT
    table,
    type,
    num_tries
FROM system.replication_queue

┌─table────────────┬─type────────┬─num_tries─┐
│ table_m          │ MERGE_PARTS │     12856 │
│ table_m          │ MERGE_PARTS │     12862 │
│ table_m          │ MERGE_PARTS │     12865 │
│ table_m          │ MERGE_PARTS │     12844 │
│ table_m          │ MERGE_PARTS │     12855 │
… truncated … 

There are various amounts of parts stuck in the replication_queue while this is happening. Parts being stuck is of course understandable (since there might be a needed part on the shutdown node), but num_tries is increasing very, very quickly (seemingly on each query). This is probably the cause of high network load on zookeeper.

There is also a huge, tenfold increase in metrics such as SelectedRows (which might explain the cpu load increase):

SelectedRows

We have found that adding a third node and implementing insert_quorum greatly reduced the amount of parts listed in system.replication_queue and SelectedRows amount to negligible levels. This is of course only a temporary workaround until the underlying issue is fixed. Nevertheless the issue with excessive zookeeper traffic still remains. If one of clickhouse nodes is shut down, each healthy clickhouse node still receives close to 1Gbps of traffic from zookeeper due to excessive zookeeper querying.

Could we possibly get any insight into what is happening here and if possible is there any known workaround?

The exact schema is as follows:

CREATE TABLE IF NOT EXISTS table_b ON CLUSTER 'cluster' (
    when DateTime CODEC(DoubleDelta, ZSTD(4)),
    value Int64 CODEC(DoubleDelta, ZSTD(4)),
    count Int64 CODEC(DoubleDelta, ZSTD(4)),
    what String CODEC(ZSTD(4))
) ENGINE = Buffer('default', 'table_m', 16, 0, 1, 100000, 1000000, 1000000, 10000000);

CREATE TABLE IF NOT EXISTS table_m ON CLUSTER 'cluster' AS metrics
 ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/{database}/replicated/{table}','{replica}')
 ORDER BY (toStartOfMinute(when), what);

CREATE VIEW IF NOT EXISTS table_m_v ON CLUSTER 'cluster' AS
 SELECT toStartOfMinute(when) as when, value, count, what FROM table_m FINAL;

CREATE TABLE IF NOT EXISTS table_t ON CLUSTER 'cluster' AS metrics
 ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/{database}/replicated/{table}','{replica}')
 ORDER BY what;
CREATE MATERIALIZED VIEW IF NOT EXISTS table_t_mv ON CLUSTER 'cluster' TO table_t AS SELECT * FROM table_m;

CREATE VIEW IF NOT EXISTS table_t_v ON CLUSTER 'cluster' AS
 SELECT what, value, count FROM table_t FINAL;

The most frequent query that is run:
SELECT what, value, count FROM table_t WHERE what IN ('a', 'b')

@nickitat
Copy link
Member

nickitat commented Jun 6, 2023

@antonio2368 maybe you could take a look?

@antonio2368
Copy link
Member

I think @tavplubix could have better ideas if something changed recently. A similar issue was mentioned by another user where there is a high ZK <-> CH network load, also 2 replica setup is used.

@alifirat
Copy link

Hi @majedrze
did you have the setting always_fetch_merged_part enable ? It was enable for us and by disable it, it solves our issue.
cc @antonio2368

@majedrze
Copy link
Author

Hi @alifirat !
We never had always_fetch_merged_part enabled.

@alifirat
Copy link

After 1 day of monitoring, we are not so sure that it resolves the issue (still monitoring).

@alifirat
Copy link

Well I can confirm now that by disable the setting, the data transfer to ZK has been reduced a lot

@tavplubix
Copy link
Member

Related to #21338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants