New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize TTL merge, completely expired parts can be removed in time #42869
Conversation
parts_to_merge = delete_ttl_selector.select(parts_ranges, max_total_size_to_merge); | ||
parts_to_merge = delete_ttl_selector.select( | ||
parts_ranges, | ||
data_settings->ttl_only_drop_parts ? data_settings->max_bytes_to_merge_at_max_space_in_pool : max_total_size_to_merge); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is max_bytes_to_merge_at_max_space_in_pool
really suitable here? It's a max size for all threads in background pool, shouldn't we divide it between threads here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your reply. My idea is to allow completely expired parts to be deleted in a timely manner without being affected by merge pressure and disk space, so I took the maximum value.
I don't quite understand what you said about dividing between threads. Can you give me a more detailed hint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to doc, it relates to the number of threads Maximum in total size of parts to merge, when there are maximum free threads in background pool (or entries in replication queue).
And in another place, we spread the value among the workers.
ClickHouse/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
Lines 101 to 104 in 74a1de5
max_size = static_cast<UInt64>(interpolateExponential( | |
data_settings->max_bytes_to_merge_at_min_space_in_pool, | |
data_settings->max_bytes_to_merge_at_max_space_in_pool, | |
static_cast<double>(free_entries) / data_settings->number_of_free_entries_in_pool_to_lower_max_size_of_merge)); |
But if, as you said, it does not affect disk space, we can leave it as is.
By the way, can we pass another special value here, e.g., 0
or uint max
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@tavplubix could you please also take a quick look, just to be sure?
Changes make sense, but |
@tavplubix Ok, updated, but from my current understanding ReplicatedMergeTree doesn't seem to be a problem to make the same change, is there something wrong with it? |
Test is not flaky, but looks not related, will try to rerun Fast tests and investigate.
|
@vdimir Failed tests that don't seem to be related to changes. |
OOM |
Hi, when running master with this change alongside a server (pending upgrade) without it we can see these errors: Error in the replica running master: server_shard: 1
version: 22.11.1.836
name: NOT_IMPLEMENTED
code: 48
quantity: 1
last_error_message: There was an error on [clickhouse-02:39000]: Code: 48. DB::Exception: Unknown MergeType 4. (NOT_IMPLEMENTED) (version 22.8.8.3 (official build))
traceback: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, int, bool)
DB::DDLQueryStatusSource::generate()
DB::ISource::tryGenerate()
DB::ISource::work()
DB::ExecutionThreadContext::executeTask()
DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*)
DB::PipelineExecutor::executeImpl(unsigned long)
DB::PipelineExecutor::execute(unsigned long)
DB::CompletedPipelineExecutor::execute()
DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, std::__1::shared_ptr<DB::Context>, std::__1::function<void (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)>, std::__1::optional<DB::FormatSettings> const&)
DB::HTTPHandler::processQuery(DB::HTTPServerRequest&, DB::HTMLForm&, DB::HTTPServerResponse&, DB::HTTPHandler::Output&, std::__1::optional<DB::CurrentThread::QueryScope>&)
DB::HTTPHandler::handleRequest(DB::HTTPServerRequest&, DB::HTTPServerResponse&)
DB::HTTPServerConnection::run()
Poco::Net::TCPServerConnection::start()
Poco::Net::TCPServerDispatcher::run()
Poco::PooledThread::run()
Poco::ThreadImpl::runnableEntry(void*)
clone Many errors in the older replica (running 22.8/HEAD): server_shard: 2
version: 22.8.8.3
name: NOT_IMPLEMENTED
code: 48
quantity: 160
last_error_message: Unknown MergeType 4
traceback: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool)
DB::Exception::Exception<unsigned long>(int, fmt::v8::basic_format_string<char, fmt::v8::type_identity<unsigned long>::type>, unsigned long&&)
DB::ReplicatedMergeTreeLogEntryData::readText(DB::ReadBuffer&)
DB::ReplicatedMergeTreeQueue::pullLogsToQueue(std::__1::shared_ptr<zkutil::ZooKeeper>, std::__1::function<void (Coordination::WatchResponse const&)>, DB::ReplicatedMergeTreeQueue::PullLogsReason)
DB::StorageReplicatedMergeTree::queueUpdatingTask()
DB::BackgroundSchedulePoolTaskInfo::execute()
DB::BackgroundSchedulePool::threadFunction()
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) What would be the proper way to handle this? And also, could you include a warning in the changelog about it |
Thanks for reporting this, I've just reverted these changes, because fortunately they were not released. There are two ways to handle this (the first one is probably better):
|
It makes sense to me to have a path to introduce new features since it will be necessary in the future, but I guess this path is different depending on the change. In this case 2 seems like a good option, as you can turn it on in the settings once you upgrade the cluster completely. One of my greatest concerns is how to avoid breaking old servers when running with newer ones, mostly because otherwise it's impossible to do upgrades without strict human supervision and because sometimes you find in production that something isn't working as before and having the option to downgrade is always nice (not ideal, but it might save the night). This is why we are running our internal tests mixing CH versions in a cluster and monitoring failures and system.errors and why we happened to detect this change as incompatible. We only check a small part of all the CH features (we didn't detect the serialization change either) but I hope between that and further improvements to CH's CI itself it will be easier to avoid these problems. |
There are some general rules about backward compatibility in ClickHouse (just to clarify, the list is probably incomplete):
AFAIK, these principles are not documented anywhere and we don't have clear recommendations about upgrading in the documentation, it would be great to add some (cc: @DanRoscigno) The problem with this improvement is that it may write new "merge type" into zk despite a user did not enable any new features. It makes downgrade impossible and it's a bug (it violates the third rule, the fifth and sixth are not applicable here). The typical path of introducing an incompatible feature is the option 2 I described in the comment above. Sometimes it possible to write special code that makes new version compatible with old ones without extra settings and remove that compatibility code after one year (we had a lot of such tricks in ReplicatedMergeTree for working with metadata in zk). Unfortunately, our CI is still quite poor in terms of catching incompatibilities and sometimes we do not notice that some change can be incompatible, so we really appreciate community help on that. Maybe we can try to introduce another check that is similar to functional tests with Replicated database (it runs cluster of 3 nodes), but runs one server of old version. Or run 2 nodes cluster in BC check in Stress tests with previous and current versions. The main problem is that it's not clear how to automatically distinguish compatibility issues from expected errors (even now BC check finds tons of completely irrelevant errors that may normally happen). |
This is gold @tavplubix. Thanks a lot for all this information. |
@tavplubix I'm very sorry for not being thoughtful, I think the second approach is better, I will submit a patch to fix this problem. |
The second approach (with setting) is simpler, but definitely not better |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
When the merge task is continuously busy and the disk space is insufficient, the completely expired parts cannot be selected and dropped, resulting in insufficient disk space.
My idea is that when the entire Part expires, there is no need for additional disk space to guarantee, ensure the normal execution of TTL.