-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replica stuck in ReadOnly State #58126
Comments
|
I would like to understand what is a problem. As I mentioned above, I;m able to fix readonly replica but after some time when node is restarted, issue occurs again. Or is there any significant difference between detach, drop and then attach, restore and sync replica and your suggestion? @den-crane |
You have found it, the reason is that the replica cannot detach the broken part and startup replication. It would be great to find the first log messages about that part so these logs would explain which part was broken and why ( Also, ClickHouse/src/Storages/MergeTree/DataPartStorageOnDiskBase.cpp Lines 85 to 95 in 21a17f8
So it's worth checking if |
I had this happen to me with 23.12.4.15 after an unexpected restart. |
Same issue after hard reset on 23.8.12.13 on a ReplicatedMergeTree engine. Here is the first mention of example part in err log:
Then following error is logged a bit later (there's a bunch of other broken part logs skipped):
A bit later exception is thrown for the example part itself:
Checksums and columns of covered-by-broken_20242002_938800_940504_564_try9 and covered-by-broken_20242002_938800_940504_564 are the same. UPD. It looks like the issue resolved after DB restart. |
Hello,
I have ClickHouse cluster(version: 23.8.7) with multiple nodes. There is a table with 12 shards, and each shard has 3 replicas.
the table is fed with data through the following flow:
app -> events_entry(TABLE, ENGINE: Null) --> mv_exchange_stats_entry(MATERIALIZED VIEW, ENGINE = Distributed) --> invoicing(TABLE, ENGINE=ReplicatedSummingMergeTree)
From time to time, when node which stores replica is restarted, replica stucks in ReadOnly state with error(from table system.replicas):
Code: 84. DB::Exception: Directory /var/lib/clickhouse/store/1e1/1e16af1b-dcd8-4d1a-9a86-a21f62290d62/detached/covered-by-broken_202311_820132_935806_43436_try9 already exists and is not empty. (DIRECTORY_ALREADY_EXISTS)
All the time zookeepers are reachable and other 2 replicas of this shard do not reports any issues.
In logs I can see:
but there is no log(or I cannot find) which explains what was the reason that replica cannot be activated.
To fix this replica, I need to detach, drop and then attach, restore and sync replica.
In case when I have 2/3 replicas available, after restart node should be able restore replica based on others.
Any suggestion what else can i check to troubleshoot this issue, or there is a chance that there is a bug?
Thank you in advance!
The text was updated successfully, but these errors were encountered: