Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize TTL merge, completely expired parts can be removed in time #42869

Merged
merged 9 commits into from Nov 9, 2022

Conversation

zhongyuankai
Copy link
Contributor

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

When the merge task is continuously busy and the disk space is insufficient, the completely expired parts cannot be selected and dropped, resulting in insufficient disk space.
My idea is that when the entire Part expires, there is no need for additional disk space to guarantee, ensure the normal execution of TTL.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@zhongyuankai
Copy link
Contributor Author

image

@robot-ch-test-poll robot-ch-test-poll added the pr-improvement Pull request with some product improvements label Nov 1, 2022
parts_to_merge = delete_ttl_selector.select(parts_ranges, max_total_size_to_merge);
parts_to_merge = delete_ttl_selector.select(
parts_ranges,
data_settings->ttl_only_drop_parts ? data_settings->max_bytes_to_merge_at_max_space_in_pool : max_total_size_to_merge);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is max_bytes_to_merge_at_max_space_in_pool really suitable here? It's a max size for all threads in background pool, shouldn't we divide it between threads here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your reply. My idea is to allow completely expired parts to be deleted in a timely manner without being affected by merge pressure and disk space, so I took the maximum value.
I don't quite understand what you said about dividing between threads. Can you give me a more detailed hint?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to doc, it relates to the number of threads Maximum in total size of parts to merge, when there are maximum free threads in background pool (or entries in replication queue).

And in another place, we spread the value among the workers.

max_size = static_cast<UInt64>(interpolateExponential(
data_settings->max_bytes_to_merge_at_min_space_in_pool,
data_settings->max_bytes_to_merge_at_max_space_in_pool,
static_cast<double>(free_entries) / data_settings->number_of_free_entries_in_pool_to_lower_max_size_of_merge));

But if, as you said, it does not affect disk space, we can leave it as is.

By the way, can we pass another special value here, e.g., 0 or uint max?

@vdimir vdimir self-assigned this Nov 2, 2022
@vdimir vdimir added the can be tested Allows running workflows for external contributors label Nov 2, 2022
Copy link
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tavplubix could you please also take a quick look, just to be sure?

@tavplubix
Copy link
Member

Changes make sense, but ttl_only_drop_parts is just a setting and we can drop whole parts by TTL even if it's set to false. It would be better to ignore estimated space for merge if we are going to drop a whole part, not only when the setting is true. See TTLDeleteMergeSelector. Also, do we need similar changes for ReplicatedMergeTree?

@zhongyuankai
Copy link
Contributor Author

Changes make sense, but ttl_only_drop_parts is just a setting and we can drop whole parts by TTL even if it's set to false. It would be better to ignore estimated space for merge if we are going to drop a whole part, not only when the setting is true. See TTLDeleteMergeSelector. Also, do we need similar changes for ReplicatedMergeTree?

@tavplubix Ok, updated, but from my current understanding ReplicatedMergeTree doesn't seem to be a problem to make the same change, is there something wrong with it?

@vdimir
Copy link
Member

vdimir commented Nov 4, 2022

Fast test — fail: 1, passed: 3801, skipped: 821

https://s3.amazonaws.com/clickhouse-test-reports/42869/1345f7e73da0bfbd4cc9e1c55458bcf10164c6df/fast_test.html

Test is not flaky, but looks not related, will try to rerun Fast tests and investigate.

02102_row_binary_with_names_and_types | FAIL | 1.84
2022-11-04 16:54:39 Code: 75. DB::ErrnoException: Cannot write to file (fd = 1), errno: 32, strerror: Broken pipe: (in query: SELECT toUInt32(1) AS x, 'text' as z, toDate('2020-01-01') AS y FORMAT RowBinaryWithNamesAndTypes). (CANNOT_WRITE_TO_FILE_DESCRIPTOR) 2022-11-04 16:54:39  2022-11-04 16:54:39

@zhongyuankai
Copy link
Contributor Author

@vdimir Failed tests that don't seem to be related to changes.

@vdimir
Copy link
Member

vdimir commented Nov 9, 2022

Stress test (debug) — Backward compatibility check: Error message in clickhouse-server.log (see bc_check_error_messages.txt) Details

Stress test (tsan) — Backward compatibility check: Error message in clickhouse-server.log (see bc_check_error_messages.txt) Details

#41369

Stress test (msan) — OOM killer (or signal 9) in clickhouse-server.log Details

OOM

@vdimir vdimir merged commit 82bf992 into ClickHouse:master Nov 9, 2022
@Algunenano
Copy link
Member

Hi, when running master with this change alongside a server (pending upgrade) without it we can see these errors:

Error in the replica running master:

server_shard:       1
version:            22.11.1.836
name:               NOT_IMPLEMENTED
code:               48
quantity:           1
last_error_message: There was an error on [clickhouse-02:39000]: Code: 48. DB::Exception: Unknown MergeType 4. (NOT_IMPLEMENTED) (version 22.8.8.3 (official build))
traceback:          DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, int, bool)
DB::DDLQueryStatusSource::generate()
DB::ISource::tryGenerate()
DB::ISource::work()
DB::ExecutionThreadContext::executeTask()
DB::PipelineExecutor::executeStepImpl(unsigned long, std::__1::atomic<bool>*)
DB::PipelineExecutor::executeImpl(unsigned long)
DB::PipelineExecutor::execute(unsigned long)
DB::CompletedPipelineExecutor::execute()
DB::executeQuery(DB::ReadBuffer&, DB::WriteBuffer&, bool, std::__1::shared_ptr<DB::Context>, std::__1::function<void (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&)>, std::__1::optional<DB::FormatSettings> const&)
DB::HTTPHandler::processQuery(DB::HTTPServerRequest&, DB::HTMLForm&, DB::HTTPServerResponse&, DB::HTTPHandler::Output&, std::__1::optional<DB::CurrentThread::QueryScope>&)
DB::HTTPHandler::handleRequest(DB::HTTPServerRequest&, DB::HTTPServerResponse&)
DB::HTTPServerConnection::run()
Poco::Net::TCPServerConnection::start()
Poco::Net::TCPServerDispatcher::run()
Poco::PooledThread::run()
Poco::ThreadImpl::runnableEntry(void*)
clone

Many errors in the older replica (running 22.8/HEAD):

server_shard:       2
version:            22.8.8.3
name:               NOT_IMPLEMENTED
code:               48
quantity:           160
last_error_message: Unknown MergeType 4
traceback:          DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool)
DB::Exception::Exception<unsigned long>(int, fmt::v8::basic_format_string<char, fmt::v8::type_identity<unsigned long>::type>, unsigned long&&)
DB::ReplicatedMergeTreeLogEntryData::readText(DB::ReadBuffer&)
DB::ReplicatedMergeTreeQueue::pullLogsToQueue(std::__1::shared_ptr<zkutil::ZooKeeper>, std::__1::function<void (Coordination::WatchResponse const&)>, DB::ReplicatedMergeTreeQueue::PullLogsReason)
DB::StorageReplicatedMergeTree::queueUpdatingTask()
DB::BackgroundSchedulePoolTaskInfo::execute()
DB::BackgroundSchedulePool::threadFunction()
ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>)

What would be the proper way to handle this? And also, could you include a warning in the changelog about it

@tavplubix
Copy link
Member

Thanks for reporting this, I've just reverted these changes, because fortunately they were not released. There are two ways to handle this (the first one is probably better):

  • Avoid introducing new merge type, just ignore disk space requirements for existing TTLDelete type if we are going to drop whole part.
  • Introduce a merge tree setting that enables new behavior, disable it by default to avoid creating log entries with new merge type, enable it by default after one year with a note about backward incompatible change in the changelog.

@Algunenano
Copy link
Member

It makes sense to me to have a path to introduce new features since it will be necessary in the future, but I guess this path is different depending on the change. In this case 2 seems like a good option, as you can turn it on in the settings once you upgrade the cluster completely.

One of my greatest concerns is how to avoid breaking old servers when running with newer ones, mostly because otherwise it's impossible to do upgrades without strict human supervision and because sometimes you find in production that something isn't working as before and having the option to downgrade is always nice (not ideal, but it might save the night). This is why we are running our internal tests mixing CH versions in a cluster and monitoring failures and system.errors and why we happened to detect this change as incompatible. We only check a small part of all the CH features (we didn't detect the serialization change either) but I hope between that and further improvements to CH's CI itself it will be easier to avoid these problems.

@tavplubix
Copy link
Member

There are some general rules about backward compatibility in ClickHouse (just to clarify, the list is probably incomplete):

  • We put some efforts to maintaining one year compatibility window (which includes 2 LTS versions). It means that any two versions should be able to work together in a cluster if difference between them is less than 1 year (or if there's less then 2 LTS versions between them). However, it's not recommended to keep cluster in such state for a long time, because some minor issues are possible (like slowdown of distributed queries, retriable errors in some background operations in ReplicatedMergeTree, etc).
  • If difference between versions is more than 1 year, then it's recommended to upgrade with downtime (stop all servers, upgrade all servers, run all servers) or to upgrade through an intermediate version. It's not guaranteed that such versions can work together, but major issues (like data loss/corruption) must not happen if you accidentally run them together. But it's expected that all queries may just fail, for example.
  • It should be possible to downgrade to a compatible version (less than 1 year old) if you have not started to use new features yet. But once you started, you cannot go back. New features are usually controlled by settings (at first disabled by default, then enabled after ~1 year). Running query of some new type or using some new syntax may also enable a new feature.
  • Native protocol is always compatible (you can use any version of client with any version of server)
  • Compatibility of experimental features can be broken at any moment in any way
  • Sometimes these rules can be violated if an incompatibility affects only marginal usecases. Or if the only way to fix a major issue (potential security issue, data loss/corruption, incorrect query result, etc) is to make backward incompatible change or if compatible solution would be too complicated. Such changes must be highlighted in the changelog as "backward incompatible change".

AFAIK, these principles are not documented anywhere and we don't have clear recommendations about upgrading in the documentation, it would be great to add some (cc: @DanRoscigno)

The problem with this improvement is that it may write new "merge type" into zk despite a user did not enable any new features. It makes downgrade impossible and it's a bug (it violates the third rule, the fifth and sixth are not applicable here).

The typical path of introducing an incompatible feature is the option 2 I described in the comment above. Sometimes it possible to write special code that makes new version compatible with old ones without extra settings and remove that compatibility code after one year (we had a lot of such tricks in ReplicatedMergeTree for working with metadata in zk).

Unfortunately, our CI is still quite poor in terms of catching incompatibilities and sometimes we do not notice that some change can be incompatible, so we really appreciate community help on that.

Maybe we can try to introduce another check that is similar to functional tests with Replicated database (it runs cluster of 3 nodes), but runs one server of old version. Or run 2 nodes cluster in BC check in Stress tests with previous and current versions. The main problem is that it's not clear how to automatically distinguish compatibility issues from expected errors (even now BC check finds tons of completely irrelevant errors that may normally happen).

@Algunenano
Copy link
Member

This is gold @tavplubix. Thanks a lot for all this information.

@zhongyuankai
Copy link
Contributor Author

  • Introduce a merge tree setting that enables new behavior, disable it by default to avoid creating log entries with new merge type, enable it by default after one year with a note about backward incompatible change in the changelog.

@tavplubix I'm very sorry for not being thoughtful, I think the second approach is better, I will submit a patch to fix this problem.

@tavplubix
Copy link
Member

The second approach (with setting) is simpler, but definitely not better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants