Optimize TTL merge, completely expired parts can be removed in time #42869

zhongyuankai · 2022-11-01T15:09:21Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

When the merge task is continuously busy and the disk space is insufficient, the completely expired parts cannot be selected and dropped, resulting in insufficient disk space.
My idea is that when the entire Part expires, there is no need for additional disk space to guarantee, ensure the normal execution of TTL.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

zhongyuankai · 2022-11-01T15:10:15Z

vdimir · 2022-11-02T10:22:03Z

src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp

-        parts_to_merge = delete_ttl_selector.select(parts_ranges, max_total_size_to_merge);
+        parts_to_merge = delete_ttl_selector.select(
+            parts_ranges,
+            data_settings->ttl_only_drop_parts ? data_settings->max_bytes_to_merge_at_max_space_in_pool : max_total_size_to_merge);


Is max_bytes_to_merge_at_max_space_in_pool really suitable here? It's a max size for all threads in background pool, shouldn't we divide it between threads here?

Thank you very much for your reply. My idea is to allow completely expired parts to be deleted in a timely manner without being affected by merge pressure and disk space, so I took the maximum value.
I don't quite understand what you said about dividing between threads. Can you give me a more detailed hint?

According to doc, it relates to the number of threads Maximum in total size of parts to merge, when there are maximum free threads in background pool (or entries in replication queue).

And in another place, we spread the value among the workers.

ClickHouse/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp

Lines 101 to 104 in 74a1de5

max_size = static_cast<UInt64>(interpolateExponential(

data_settings->max_bytes_to_merge_at_min_space_in_pool,

data_settings->max_bytes_to_merge_at_max_space_in_pool,

static_cast<double>(free_entries) / data_settings->number_of_free_entries_in_pool_to_lower_max_size_of_merge));

But if, as you said, it does not affect disk space, we can leave it as is.

By the way, can we pass another special value here, e.g., 0 or uint max?

vdimir

LGTM

@tavplubix could you please also take a quick look, just to be sure?

tavplubix · 2022-11-03T12:34:18Z

Changes make sense, but ttl_only_drop_parts is just a setting and we can drop whole parts by TTL even if it's set to false. It would be better to ignore estimated space for merge if we are going to drop a whole part, not only when the setting is true. See TTLDeleteMergeSelector. Also, do we need similar changes for ReplicatedMergeTree?

zhongyuankai · 2022-11-04T09:29:59Z

Changes make sense, but ttl_only_drop_parts is just a setting and we can drop whole parts by TTL even if it's set to false. It would be better to ignore estimated space for merge if we are going to drop a whole part, not only when the setting is true. See TTLDeleteMergeSelector. Also, do we need similar changes for ReplicatedMergeTree?

@tavplubix Ok, updated, but from my current understanding ReplicatedMergeTree doesn't seem to be a problem to make the same change, is there something wrong with it?

vdimir · 2022-11-04T15:53:58Z

Fast test — fail: 1, passed: 3801, skipped: 821

https://s3.amazonaws.com/clickhouse-test-reports/42869/1345f7e73da0bfbd4cc9e1c55458bcf10164c6df/fast_test.html

Test is not flaky, but looks not related, will try to rerun Fast tests and investigate.

02102_row_binary_with_names_and_types | FAIL | 1.84
2022-11-04 16:54:39 Code: 75. DB::ErrnoException: Cannot write to file (fd = 1), errno: 32, strerror: Broken pipe: (in query: SELECT toUInt32(1) AS x, 'text' as z, toDate('2020-01-01') AS y FORMAT RowBinaryWithNamesAndTypes). (CANNOT_WRITE_TO_FILE_DESCRIPTOR) 2022-11-04 16:54:39  2022-11-04 16:54:39

zhongyuankai · 2022-11-08T23:51:36Z

@vdimir Failed tests that don't seem to be related to changes.

vdimir · 2022-11-09T10:26:41Z

Stress test (debug) — Backward compatibility check: Error message in clickhouse-server.log (see bc_check_error_messages.txt) Details

Stress test (tsan) — Backward compatibility check: Error message in clickhouse-server.log (see bc_check_error_messages.txt) Details

#41369

Stress test (msan) — OOM killer (or signal 9) in clickhouse-server.log Details

OOM

Algunenano · 2022-11-10T15:55:23Z

Hi, when running master with this change alongside a server (pending upgrade) without it we can see these errors:

Error in the replica running master:

server_shard:       1
version:            22.11.1.836
name:               NOT_IMPLEMENTED
code:               48
quantity:           1
last_error_message: There was an error on [clickhouse-02:39000]: Code: 48. DB::Exception: Unknown MergeType 4. (NOT_IMPLEMENTED) (version 22.8.8.3 (official build))
traceback:          DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, int, bool)
DB::DDLQueryStatusSource::generate()
DB::ISource::tryGenerate()
DB::ISource::work()
DB::ExecutionThreadContext::executeTask()
DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*)
DB::PipelineExecutor::executeImpl(unsigned long)
DB::PipelineExecutor::execute(unsigned long)
DB::CompletedPipelineExecutor::execute()
DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, std::__1::shared_ptr<DB::Context>, std::__1::function<void (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)>, std::__1::optional<DB::FormatSettings> const&)
DB::HTTPHandler::processQuery(DB::HTTPServerRequest&, DB::HTMLForm&, DB::HTTPServerResponse&, DB::HTTPHandler::Output&, std::__1::optional<DB::CurrentThread::QueryScope>&)
DB::HTTPHandler::handleRequest(DB::HTTPServerRequest&, DB::HTTPServerResponse&)
DB::HTTPServerConnection::run()
Poco::Net::TCPServerConnection::start()
Poco::Net::TCPServerDispatcher::run()
Poco::PooledThread::run()
Poco::ThreadImpl::runnableEntry(void*)
clone

Many errors in the older replica (running 22.8/HEAD):

server_shard:       2
version:            22.8.8.3
name:               NOT_IMPLEMENTED
code:               48
quantity:           160
last_error_message: Unknown MergeType 4
traceback:          DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool)
DB::Exception::Exception<unsigned long>(int, fmt::v8::basic_format_string<char, fmt::v8::type_identity<unsigned long>::type>, unsigned long&&)
DB::ReplicatedMergeTreeLogEntryData::readText(DB::ReadBuffer&)
DB::ReplicatedMergeTreeQueue::pullLogsToQueue(std::__1::shared_ptr<zkutil::ZooKeeper>, std::__1::function<void (Coordination::WatchResponse const&)>, DB::ReplicatedMergeTreeQueue::PullLogsReason)
DB::StorageReplicatedMergeTree::queueUpdatingTask()
DB::BackgroundSchedulePoolTaskInfo::execute()
DB::BackgroundSchedulePool::threadFunction()
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)

What would be the proper way to handle this? And also, could you include a warning in the changelog about it

tavplubix · 2022-11-11T12:01:46Z

Thanks for reporting this, I've just reverted these changes, because fortunately they were not released. There are two ways to handle this (the first one is probably better):

Avoid introducing new merge type, just ignore disk space requirements for existing TTLDelete type if we are going to drop whole part.
Introduce a merge tree setting that enables new behavior, disable it by default to avoid creating log entries with new merge type, enable it by default after one year with a note about backward incompatible change in the changelog.

Algunenano · 2022-11-11T12:38:41Z

It makes sense to me to have a path to introduce new features since it will be necessary in the future, but I guess this path is different depending on the change. In this case 2 seems like a good option, as you can turn it on in the settings once you upgrade the cluster completely.

One of my greatest concerns is how to avoid breaking old servers when running with newer ones, mostly because otherwise it's impossible to do upgrades without strict human supervision and because sometimes you find in production that something isn't working as before and having the option to downgrade is always nice (not ideal, but it might save the night). This is why we are running our internal tests mixing CH versions in a cluster and monitoring failures and system.errors and why we happened to detect this change as incompatible. We only check a small part of all the CH features (we didn't detect the serialization change either) but I hope between that and further improvements to CH's CI itself it will be easier to avoid these problems.

tavplubix · 2022-11-11T14:52:11Z

There are some general rules about backward compatibility in ClickHouse (just to clarify, the list is probably incomplete):

We put some efforts to maintaining one year compatibility window (which includes 2 LTS versions). It means that any two versions should be able to work together in a cluster if difference between them is less than 1 year (or if there's less then 2 LTS versions between them). However, it's not recommended to keep cluster in such state for a long time, because some minor issues are possible (like slowdown of distributed queries, retriable errors in some background operations in ReplicatedMergeTree, etc).
If difference between versions is more than 1 year, then it's recommended to upgrade with downtime (stop all servers, upgrade all servers, run all servers) or to upgrade through an intermediate version. It's not guaranteed that such versions can work together, but major issues (like data loss/corruption) must not happen if you accidentally run them together. But it's expected that all queries may just fail, for example.
It should be possible to downgrade to a compatible version (less than 1 year old) if you have not started to use new features yet. But once you started, you cannot go back. New features are usually controlled by settings (at first disabled by default, then enabled after ~1 year). Running query of some new type or using some new syntax may also enable a new feature.
Native protocol is always compatible (you can use any version of client with any version of server)
Compatibility of experimental features can be broken at any moment in any way
Sometimes these rules can be violated if an incompatibility affects only marginal usecases. Or if the only way to fix a major issue (potential security issue, data loss/corruption, incorrect query result, etc) is to make backward incompatible change or if compatible solution would be too complicated. Such changes must be highlighted in the changelog as "backward incompatible change".

AFAIK, these principles are not documented anywhere and we don't have clear recommendations about upgrading in the documentation, it would be great to add some (cc: @DanRoscigno)

The problem with this improvement is that it may write new "merge type" into zk despite a user did not enable any new features. It makes downgrade impossible and it's a bug (it violates the third rule, the fifth and sixth are not applicable here).

The typical path of introducing an incompatible feature is the option 2 I described in the comment above. Sometimes it possible to write special code that makes new version compatible with old ones without extra settings and remove that compatibility code after one year (we had a lot of such tricks in ReplicatedMergeTree for working with metadata in zk).

Unfortunately, our CI is still quite poor in terms of catching incompatibilities and sometimes we do not notice that some change can be incompatible, so we really appreciate community help on that.

Maybe we can try to introduce another check that is similar to functional tests with Replicated database (it runs cluster of 3 nodes), but runs one server of old version. Or run 2 nodes cluster in BC check in Stress tests with previous and current versions. The main problem is that it's not clear how to automatically distinguish compatibility issues from expected errors (even now BC check finds tons of completely irrelevant errors that may normally happen).

Algunenano · 2022-11-11T15:01:43Z

This is gold @tavplubix. Thanks a lot for all this information.

zhongyuankai · 2022-11-14T02:35:03Z

Introduce a merge tree setting that enables new behavior, disable it by default to avoid creating log entries with new merge type, enable it by default after one year with a note about backward incompatible change in the changelog.

@tavplubix I'm very sorry for not being thoughtful, I think the second approach is better, I will submit a patch to fix this problem.

tavplubix · 2022-11-14T09:26:45Z

The second approach (with setting) is simpler, but definitely not better

Optimize TTL merge, completely expired parts can be removed in time

c1e2b2a

robot-ch-test-poll added the pr-improvement Pull request with some product improvements label Nov 1, 2022

vdimir reviewed Nov 2, 2022

View reviewed changes

vdimir self-assigned this Nov 2, 2022

vdimir added the can be tested Allows running workflows for external contributors label Nov 2, 2022

vdimir approved these changes Nov 3, 2022

View reviewed changes

zhongyuankai added 2 commits November 4, 2022 17:19

batter

e773eb2

Merge branch 'master' into optimze_ttl_merge

049a593

zhongyuankai and others added 3 commits November 4, 2022 20:52

fix test

ecf45bb

Merge branch 'master' into optimze_ttl_merge

1345f7e

Merge branch 'master' into optimze_ttl_merge

f359411

zhongyuankai and others added 3 commits November 6, 2022 13:40

fix test

ef6065b

Merge branch 'master' into optimze_ttl_merge

ade4173

Merge branch 'master' into optimze_ttl_merge

37099e5

vdimir merged commit 82bf992 into ClickHouse:master Nov 9, 2022

tavplubix mentioned this pull request Nov 10, 2022

Revert "Optimize TTL merge, completely expired parts can be removed in time" #43134

Merged

tavplubix mentioned this pull request Nov 14, 2022

Add to upgrade docs ClickHouse/clickhouse-docs#548

Closed

zhongyuankai mentioned this pull request Nov 14, 2022

Priority is given to deleting completely expired Parts #43222

Merged

tavplubix self-assigned this Nov 15, 2022

tavplubix mentioned this pull request Dec 5, 2022

[RFC] Replacing merge tree new engine #41005

Merged

This was referenced Nov 27, 2023

Follow up to #56306 #57257

Closed

Do not allow creating replicated table with inconsistent merge params #56833

Merged

zhongyuankai deleted the optimze_ttl_merge branch December 22, 2023 19:13

tavplubix mentioned this pull request Dec 27, 2023

Querying materialized view quantile states in ClickHouse version 21.7.8.58 from ClickHouse version 23.6.2.18 results in errors. #58234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize TTL merge, completely expired parts can be removed in time #42869

Optimize TTL merge, completely expired parts can be removed in time #42869

zhongyuankai commented Nov 1, 2022

zhongyuankai commented Nov 1, 2022

vdimir Nov 2, 2022

zhongyuankai Nov 3, 2022

vdimir Nov 3, 2022

vdimir left a comment

tavplubix commented Nov 3, 2022

zhongyuankai commented Nov 4, 2022

vdimir commented Nov 4, 2022 •

edited

zhongyuankai commented Nov 8, 2022

vdimir commented Nov 9, 2022

Algunenano commented Nov 10, 2022

tavplubix commented Nov 11, 2022

Algunenano commented Nov 11, 2022

tavplubix commented Nov 11, 2022

Algunenano commented Nov 11, 2022

zhongyuankai commented Nov 14, 2022

tavplubix commented Nov 14, 2022

	max_size = static_cast<UInt64>(interpolateExponential(
	data_settings->max_bytes_to_merge_at_min_space_in_pool,
	data_settings->max_bytes_to_merge_at_max_space_in_pool,
	static_cast<double>(free_entries) / data_settings->number_of_free_entries_in_pool_to_lower_max_size_of_merge));

Optimize TTL merge, completely expired parts can be removed in time #42869

Optimize TTL merge, completely expired parts can be removed in time #42869

Conversation

zhongyuankai commented Nov 1, 2022

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

zhongyuankai commented Nov 1, 2022

vdimir Nov 2, 2022

Choose a reason for hiding this comment

zhongyuankai Nov 3, 2022

Choose a reason for hiding this comment

vdimir Nov 3, 2022

Choose a reason for hiding this comment

vdimir left a comment

Choose a reason for hiding this comment

tavplubix commented Nov 3, 2022

zhongyuankai commented Nov 4, 2022

vdimir commented Nov 4, 2022 • edited

zhongyuankai commented Nov 8, 2022

vdimir commented Nov 9, 2022

Algunenano commented Nov 10, 2022

tavplubix commented Nov 11, 2022

Algunenano commented Nov 11, 2022

tavplubix commented Nov 11, 2022

Algunenano commented Nov 11, 2022

zhongyuankai commented Nov 14, 2022

tavplubix commented Nov 14, 2022

vdimir commented Nov 4, 2022 •

edited