Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not write retriable errors for Replicated mutate/merge into error log #55944

Merged
merged 5 commits into from Oct 25, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
10 changes: 5 additions & 5 deletions src/Storages/MergeTree/ReplicatedMergeMutateTaskBase.cpp
Expand Up @@ -69,11 +69,11 @@ bool ReplicatedMergeMutateTaskBase::executeStep()
else
tryLogCurrentException(log, __PRETTY_FUNCTION__);

/** This exception will be written to the queue element, and it can be looked up using `system.replication_queue` table.
* The thread that performs this action will sleep a few seconds after the exception.
* See `queue.processEntry` function.
*/
throw;
/// This exception will be written to the queue element, and it can be looked up using `system.replication_queue` table.
/// The thread that performs this action will sleep a few seconds after the exception.
/// See `queue.processEntry` function.
if (!retryable_error)
throw;
Copy link
Member

@tavplubix tavplubix Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it will not work because we have some logic built on top of exceptions... (so we have to rethow it unconditionally, otherwise replication queue may get stuck, as you can see in the failed tests)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I was thinking that maybe we have some backoff on retries of failed queue entries, it would help too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it will not work because we have some logic built on top of exceptions... (so we have to rethow it unconditionally, otherwise replication queue may get stuck, as you can see in the failed tests)

Yeah, now I see, though it is not about replication queue, but about abort in the WriteBuffer dtor, but maybe there are more problems?

Anyway, I rewrote the patch, introduced IExecutableTask::printExecutionException() to let the MergeTreeBackgroundExecutor decide. Looks a little bit hackish I too much, maybe, but what do you think?

}
catch (...)
{
Expand Down
@@ -0,0 +1 @@
Information 1
32 changes: 32 additions & 0 deletions tests/queries/0_stateless/02903_rmt_retriable_merge_exception.sql
@@ -0,0 +1,32 @@
-- Test that retriable errors during merges/mutations
-- (i.e. "No active replica has part X or covering part")
-- does not appears as errors (level=Error), only as info message (level=Information).

drop table if exists rmt1;
drop table if exists rmt2;

create table rmt1 (key Int) engine=ReplicatedMergeTree('/clickhouse/{database}', '1') order by key settings always_fetch_merged_part=1;
create table rmt2 (key Int) engine=ReplicatedMergeTree('/clickhouse/{database}', '2') order by key settings always_fetch_merged_part=0;

insert into rmt1 values (1);
insert into rmt1 values (2);

system stop pulling replication log rmt2;
optimize table rmt1 final settings alter_sync=0;

select sleep(3) format Null;
system start pulling replication log rmt2;

system flush logs;
with
(select uuid from system.tables where database = currentDatabase() and table = 'rmt1') as uuid_
select
level, count() > 0
from system.text_log
where
event_date >= yesterday() and event_time >= now() - 60 and
(
(logger_name = 'MergeTreeBackgroundExecutor' and message like '%{' || uuid_::String || '::all_0_1_1}%No active replica has part all_0_1_1 or covering part%') or
(logger_name = uuid_::String || '::all_0_1_1 (MergeFromLogEntryTask)' and message like '%No active replica has part all_0_1_1 or covering part%')
)
group by level;