New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not write retriable errors for Replicated mutate/merge into error log #55944
Do not write retriable errors for Replicated mutate/merge into error log #55944
Conversation
Fixes: e3f892f ("fix gtest with MemoryWriteBuffer, do not mute exception in ReplicatedMergeMutateTaskBase") Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
…r covering part") They should not appears in the error log, only with Information level. Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
This is an automated comment for commit 66c4a3b with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
if (!retryable_error) | ||
throw; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it will not work because we have some logic built on top of exceptions... (so we have to rethow it unconditionally, otherwise replication queue may get stuck, as you can see in the failed tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I was thinking that maybe we have some backoff on retries of failed queue entries, it would help too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it will not work because we have some logic built on top of exceptions... (so we have to rethow it unconditionally, otherwise replication queue may get stuck, as you can see in the failed tests)
Yeah, now I see, though it is not about replication queue, but about abort in the WriteBuffer dtor, but maybe there are more problems?
Anyway, I rewrote the patch, introduced IExecutableTask::printExecutionException()
to let the MergeTreeBackgroundExecutor
decide. Looks a little bit hackish I too much, maybe, but what do you think?
… retries CI: https://s3.amazonaws.com/clickhouse-test-reports/55944/bd26f7096a4a3325f7a363c4be919700cdf10ca3/stateless_tests_flaky_check__asan_.html Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
The exception cannot be simply suppressed, since sanity checks in the WriteBuffer dtor relies on the std::uncaught_exceptions(), and if the exception will be suppressed, then if the buffer was not finalized, it may abort (even though it is only in debug/sanitizers build). So instead, IExecutableTask::printExecutionException() had been introduced to distinguish when exception should be printed and when not.
1226828
to
da408df
Compare
- separate uuid extraction - add preliminary exit - disable for ordinary database - less number of attempts - add optimize_throw_if_noop and missing sync replica, to fix: 2023.10.24 15:18:35.925533 [ 640 ] {da7418c6-3d51-45bc-a0d0-4970bb0cdd51} <Debug> test_3kgjgry1.rmt1 (d18afb81-3a4b-4c02-b281-5575dce2f440): Cannot select parts for optimization: Entry for part all_1_1_0 hasn't been read from the replication log yet (in partition all) - fix in case of ZooKeeper retries 2023.10.24 11:50:24.792511 [ 1437 ] {c39fd15b-e2e6-4291-9912-39fda75ebcd5} <Trace> test_qxkzmigq.rmt1 (1c086c74-9ebe-495c-bbd2-87ab2d8ec43d): Renaming temporary part tmp_insert_all_1_1_0 to all_1_1_0 with tid (1, 1, 00000000-0000-0000-0000-000000000000). 2023.10.24 11:50:24.797320 [ 1437 ] {c39fd15b-e2e6-4291-9912-39fda75ebcd5} <Trace> test_qxkzmigq.rmt1 (1c086c74-9ebe-495c-bbd2-87ab2d8ec43d) (Replicated OutputStream): ZooKeeperWithFaultInjection call FAILED: seed=17644626169032325693 func=tryMulti path=/clickhouse/zero_copy code=Session expired message=Fault injection before operation 2023.10.24 11:50:24.797536 [ 1437 ] {c39fd15b-e2e6-4291-9912-39fda75ebcd5} <Debug> test_qxkzmigq.rmt1 (1c086c74-9ebe-495c-bbd2-87ab2d8ec43d): Undoing transaction. Rollbacking parts state to temporary and removing from working set: all_1_1_0. ... 2023.10.24 11:50:25.000349 [ 1437 ] {c39fd15b-e2e6-4291-9912-39fda75ebcd5} <Trace> test_qxkzmigq.rmt1 (1c086c74-9ebe-495c-bbd2-87ab2d8ec43d): Renaming temporary part tmp_insert_all_1_1_0 to all_2_2_0 with tid (1, 1, 00000000-0000-0000-0000-000000000000). 2023.10.24 11:50:25.007477 [ 760 ] {} <Trace> test_qxkzmigq.rmt1 (ReplicatedMergeTreeQueue): Insert entry queue-0000000000 to queue with type GET_PART with virtual parts [all_2_2_0] CI: - https://s3.amazonaws.com/clickhouse-test-reports/55944/da408df4a7296835897d7cef80d63f252df79b75/stateless_tests__tsan__s3_storage__[2_5].html - https://s3.amazonaws.com/clickhouse-test-reports/55944/da408df4a7296835897d7cef80d63f252df79b75/stateless_tests_flaky_check__asan_.html - https://s3.amazonaws.com/clickhouse-test-reports/55944/02fdd0513f7d413ce4ac39a70566855327ebfade/stateless_tests__aarch64_.html Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
02fdd05
to
66c4a3b
Compare
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Do not write retriable errors for Replicated mutate/merge into error log (fixes tons of
No active replica has part X or covering part
in the error log, since they are not actually errors)Cc: @tavplubix
Fixes: #50395 (cc @CheSema )
Fixes: e3f892f ("fix gtest with MemoryWriteBuffer, do not mute exception in ReplicatedMergeMutateTaskBase")
Note: marked as bug fix, since it pollute logs