Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't report ZK node exists to system.errors when a block was created concurrently by a different replica #46820

Conversation

Algunenano
Copy link
Member

@Algunenano Algunenano commented Feb 24, 2023

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • Don't report ZK node exists to system.errors when a block was created concurrently by a different replica

Example (comes from a CI run):

version:            23.3.1.19
name:               KEEPER_EXCEPTION
code:               999
quantity:           1
last_error_message: Transaction failed (Node exists)
traceback:          DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool)
Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, Coordination::Error, int)
Coordination::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, Coordination::Error)
zkutil::KeeperMultiException::KeeperMultiException(Coordination::Error, std::__1::vector<std::__1::shared_ptr<Coordination::Request>, std::__1::allocator<std::__1::shared_ptr<Coordination::Request>>> const&, std::__1::vector<std::__1::shared_ptr<Coordination::Response>, std::__1::allocator<std::__1::shared_ptr<Coordination::Response>>> const&)
DB::ReplicatedMergeTreeSinkImpl<false>::commitPart(std::__1::shared_ptr<DB::ZooKeeperWithFaultInjection> const&, std::__1::shared_ptr<DB::IMergeTreeDataPart>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, unsigned long, bool)
DB::ReplicatedMergeTreeSinkImpl<false>::finishDelayedChunk(std::__1::shared_ptr<DB::ZooKeeperWithFaultInjection> const&)
DB::ReplicatedMergeTreeSinkImpl<false>::consume(DB::Chunk)
DB::SinkToStorage::onConsume(DB::Chunk)
DB::ExceptionKeepingTransform::work()
DB::ExecutionThreadContext::executeTask()
DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*)
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)

Logs:

2023.02.24 09:44:39.462647 [ 271 ] {8d6ebeff-b02e-4700-83bf-2b965d241acb} <Trace> d_test_177ab302399e42f5a6f72ac44d8a9990.t_c16d3a255a7f4e4d82422c9337779ae1 (c2bed6e8-68c6-4035-b2cb-ff4ab0281d36): Renaming temporary part tmp_insert_all_2_2_0 to all_1_1_0 with tid (1, 1, 00000000-0000-0000-0000-000000000000).
2023.02.24 09:44:39.463916 [ 271 ] {8d6ebeff-b02e-4700-83bf-2b965d241acb} <Information> d_test_177ab302399e42f5a6f72ac44d8a9990.t_c16d3a255a7f4e4d82422c9337779ae1 (c2bed6e8-68c6-4035-b2cb-ff4ab0281d36) (Replicated OutputStream): Block with ID all_6014107614689632743_16975405373040374011 already exists (it was just appeared). Renaming part all_1_1_0 back to tmp_insert_all_2_2_0. Will retry write.
2023.02.24 09:44:39.463942 [ 271 ] {8d6ebeff-b02e-4700-83bf-2b965d241acb} <Debug> d_test_177ab302399e42f5a6f72ac44d8a9990.t_c16d3a255a7f4e4d82422c9337779ae1 (c2bed6e8-68c6-4035-b2cb-ff4ab0281d36): Undoing transaction. Rollbacking parts state to temporary and removing from working set: all_1_1_0.

The problems comes from using zkutil::KeeperMultiException to extract the path of the first error. Creating a zkutil::KeeperMultiException creates a KeeperException, which is a DB::Exception and thus reports to system.errors. Since the exception was created only for that operation, I've replaced by the equivalent functions instead.

Closes #32997

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Feb 24, 2023
@tavplubix tavplubix self-assigned this Feb 24, 2023
@den-crane
Copy link
Contributor

closes #32997

@Algunenano
Copy link
Member Author

closes #32997

Added to the PR description. Thanks!

I've been cleaning up these errors when I've found them. Most of them where fixed back in #38961 and #39016. There might be more, but at least those are the ones that have appeared in our CI.

@tavplubix
Copy link
Member

Stateless tests (ubsan) [2/2] - #46636 (comment)
Stress test (asan) - #37664
Stress test (debug) - #46834
Stress test (tsan) - #37664
Stress test (ubsan) - #45372

@tavplubix tavplubix merged commit ca1c793 into ClickHouse:master Feb 24, 2023
132 of 133 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Transaction failed (Node exists) with no logs
4 participants