Merge parquet bloom filter and min/max evaluation #71383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

al13n321 merged 35 commits into ClickHouse:master from arthurpassos:merge_parquet_minmax_bloom_filter_evaluation

Jan 29, 2025

Contributor

arthurpassos commented Nov 1, 2024 •

edited

Loading

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Evaluate parquet bloom filters and min/max indexes together. Necessary to properly support: x = 3 or x > 5 where data = [1, 2, 4, 5]

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

Allow: All Required Checks
Allow: Stateless tests
Allow: Stateful tests
Allow: Integration Tests
Allow: Performance tests
Allow: All Builds
Allow: batch 1, 2 for multi-batch jobs
Allow: batch 3, 4, 5, 6 for multi-batch jobs

Exclude: Style check
Exclude: Fast test
Exclude: All with ASAN
Exclude: All with TSAN, MSAN, UBSAN, Coverage
Exclude: All with aarch64, release, debug

Run only fuzzers related jobs (libFuzzer fuzzers, AST fuzzers, etc.)
Exclude: AST fuzzers


          draft / poc

1b6c602

Contributor Author

arthurpassos commented Nov 1, 2024

For now, this is just an ugly POC. Done with minimal effort. Ideas for its design are welcome.

The design challenge here is if we should re-use logic in KeyCondition or not. If so, how, and how much.

Contributor Author

arthurpassos commented Nov 1, 2024 •

edited

Loading

~~Ok, looks like I messed up when switching branches and managed to lose some code :D~~

arthurpassos commented

View reviewed changes

src/Processors/Formats/Impl/Parquet/ParquetFilterCondition.cpp Outdated Show resolved Hide resolved

arthurpassos commented

View reviewed changes

src/Processors/Formats/Impl/Parquet/ParquetFilterCondition.cpp Outdated

    
                          rpn_stack.emplace_back(intersects, !contains);

                          if (rpn_stack.back().can_be_true && element.bloom_filter_data)

Contributor Author

arthurpassos Nov 1, 2024

Just highlighting bf check since diff can't do it

arthurpassos commented

View reviewed changes

src/Processors/Formats/Impl/Parquet/ParquetFilterCondition.cpp Outdated

    
                              {

                                  rpn_stack.emplace_back(true, true);

                                  if (element.bloom_filter_data)

Contributor Author

arthurpassos Nov 1, 2024

Just highlighting bf check since diff can't do it

arthurpassos commented

View reviewed changes

src/Processors/Formats/Impl/Parquet/ParquetFilterCondition.cpp Outdated Show resolved Hide resolved

src/Processors/Formats/Impl/Parquet/ParquetFilterCondition.cpp Outdated

    
                            * represented by a set of hyperrectangles.

                            */

                      }

                      else if (element.function == ConditionElement::FUNCTION_POINT_IN_POLYGON)

Contributor Author

arthurpassos Nov 1, 2024

.

src/Processors/Formats/Impl/Parquet/ParquetFilterCondition.cpp Outdated

    
                          rpn_stack.emplace_back(element.set_index->checkInRange(hyperrectangle, data_types, single_point));

                          if (rpn_stack.back().can_be_true && element.bloom_filter_data)

Contributor Author

arthurpassos Nov 1, 2024

Just highlighting bf check since diff can't do it

Contributor Author

arthurpassos commented Nov 1, 2024 •

edited

Loading

The first option that comes to my mind

Make KeyCondition::checkInHyperrectangle static and try to re-use its logic. To do that, RPN and other things would be passed in the function arguments. The RPN would contain an optional bloom_filter data field that contains hashes just like we need. Since the field is optional, it wouldn't affect existing usage. Parquet usage would need to do an external conversion like https://github.com/ClickHouse/ClickHouse/pull/71383/files#diff-13d6203d07156cfe09dfc59f2f772758a248d11dc8e96f221040579b9b5e5e34R366

The thing I dislike the most about this approach is the method name... Because it is now doing other things, not only checking the hyperrectangle.


          add a test

55e387d

Contributor Author

arthurpassos commented Nov 4, 2024 •

edited

Loading

The first option that comes to my mind

Make KeyCondition::checkInHyperrectangle static and try to re-use its logic. To do that, RPN and other things would be passed in the function arguments. The RPN would contain an optional bloom_filter data field that contains hashes just like we need. Since the field is optional, it wouldn't affect existing usage. Parquet usage would need to do an external conversion like https://github.com/ClickHouse/ClickHouse/pull/71383/files#diff-13d6203d07156cfe09dfc59f2f772758a248d11dc8e96f221040579b9b5e5e34R366

The thing I dislike the most about this approach is the method name... Because it is now doing other things, not only checking the hyperrectangle.

Another problem with this approach is that we would need some sort of ~~hash~~ find hash callback in order to avoid flooding KeyCondition with parquet stuff.

EDIT

Addressed above with an interface


          merge minmax and bf eval

679cb6e

arthurpassos changed the title ~~[Draft/ POC] Merge parquet bloom filter and min/max evaluation~~ Merge parquet bloom filter and min/max evaluation

arthurpassos marked this pull request as ready for review

November 5, 2024 14:50

Contributor Author

arthurpassos commented Nov 5, 2024

I have implemented the approach I mentioned above. Let me know if you are ok with this approach. Plus, can you enable CI?

Contributor Author

arthurpassos commented Nov 5, 2024

Basically, a new static method checkRPNAgainstHyperrectangle has been added and it is a copy of checkInHyperrectangle but with bloom filter optional checks. Existing checkInHyperrectangle is still a member function that calls the static checkRPNAgainstHyperrectangle.

Some KeyCondition members had to be made public in order to pass to checkRPNAgainstHyperrectangle.

One thing to note is that in case bf_filter=on and minmax_filter=false the minmax evaluation will happen on an infinite range. Previously, it would not happen at all.

arthurpassos mentioned this pull request

Parquet: merge bloom filter and min/max evaluation Altinity/ClickHouse#474

Closed

vdimir added the can be tested label

robot-clickhouse-ci-2 added the pr-improvement label


          trigger ci

f66be67

Contributor

robot-ch-test-poll3 commented Nov 6, 2024 •

edited by robot-clickhouse-ci-1

Loading

This is an automated comment for commit 07cb54f with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (asan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (debug)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (msan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (tsan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (ubsan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
ClickBench	Runs ClickBench with instant-attach table	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	✅ success

arthurpassos added 4 commits

November 6, 2024 10:40


          extern logical_error

a7c78bd


          update test

b5d7f78


          update test

49923bb


          explicit in constructor

f559037

Contributor

UnamedRus commented Nov 29, 2024

Necessary to properly support: x = 3 or x > 5 where data = [1, 2, 4, 5]

Correct me, if i'm wrong, but this doesn't work even for MergeTree tables right now?

Contributor Author

arthurpassos commented Dec 2, 2024

Necessary to properly support: x = 3 or x > 5 where data = [1, 2, 4, 5]

Correct me, if i'm wrong, but this doesn't work even for MergeTree tables right now?

I haven't looked through the code, but I assume it doesn't work even for MT. KeyCondition does not handle bloom filters, it is handled separately, so most likely that's the case

al13n321 requested changes

View reviewed changes

Member

al13n321 left a comment

Neat, more functionality for less code. Thanks for taking the time to simplify this!

Correct me, if i'm wrong, but this doesn't work even for MergeTree tables right now?

Yes. I think it would make sense to do a similar refactoring to MergeTreeIndexConditionBloomFilter too (move most of the logic into KeyCondition instead of having a custom RPNElement and tree traversal; then, separately, actually merging bf index checking with primary key analysis would be more tricky, may or may not be worth it).

tests/queries/0_stateless/03261_test_merge_parquet_bloom_filter_minmax_stats.reference Show resolved Hide resolved

src/Storages/MergeTree/KeyCondition.cpp Outdated Show resolved Hide resolved

src/Storages/MergeTree/KeyCondition.cpp Outdated Show resolved Hide resolved

src/Storages/MergeTree/KeyCondition.cpp Show resolved Hide resolved

src/Processors/Formats/Impl/Parquet/keyConditionRPNToParquetRPN.cpp Outdated

Comment on lines 343 to 347

    
                          if (found_empty_column)

                          {

                              condition_elements.emplace_back(Function::ALWAYS_FALSE);

                              // todo arthur

                              continue;

                          }

Member

al13n321 Dec 3, 2024

This should just be handled by KeyCondition independent of bf. I think the line element.set_index->checkInRange already takes care of empty sets?

Contributor Author

arthurpassos Dec 3, 2024

you say that because if the set is empty, it'll be false and thus rpn_stack.back().can_be_true will prevent the bloom filter check??

Member

al13n321 Dec 4, 2024 •

edited

Loading

Yes. Maybe it would make sense to also do a redundant explicit check, to not rely on checkInRange carefully handling the kinda-special "infinite range x empty set" case. Either way, this logic belongs in KeyCondition, I feel.

In my mind, empty set is not a special case at all, nothing would break if the normal code path runs on it (unless the code uses empty list as a special value with special meaning (e.g. "skip bloom filter check"), which I think is not a good idea in this case). Except that we may unnecessarily read the bloom filter from file, worth avoiding.

(Or KeyCondition's construction code could do the ALWAYS_FALSE thing, but I like that slightly less because (a) it means a more substantially different code path is taken depending on the data (whether the set happens to be empty, e.g. if it's a subquery), (b) it requires the set to be built (e.g. by running subquery) at KeyCondition construction time; which currently always happens anyway, but in future I can imagine deferring it to a different stage of query execution, with better progress reporting and cancellability than the early query analysis stage.)

Contributor Author

arthurpassos Dec 4, 2024

My mind might be tricking me, but I guess it is ok not to check for empty columns at this stage.

bool mayExistOnBloomFilter(const KeyCondition::BloomFilterData & condition_bloom_filter_data,
                           const KeyCondition::ColumnIndexToBloomFilter & column_index_to_column_bf)

will loop over the columns in the set index, and then call the overload

bool mayExistOnBloomFilter(const std::vector<uint64_t> & hashes, const std::unique_ptr<KeyCondition::BloomFilter> & bloom_filter)

which loops over the hashes, but none shall be found because empty set won't produce any hashes. If that is the case, it'll return false. If it returns false, the row group will be skipped.

That's the goal, right? Unless there is a scenario where min/max range check would return true for empty set, and then bloom filter would affect the end result.

Member

al13n321 Dec 5, 2024 •

edited

Loading

Two goals: (1) skip the row group (which probably already redundantly accomplished by both set_index->checkInRange and mayExistOnBloomFilter), (2) don't read the bf. For (2) I just realized checkRPNAgainstHyperrectangle's behavior is irrelevant, the empty set check would need to be in getBloomFilterFilteringColumnKeys (and it doesn't seem important; if any special effort is needed to "prepare" the set correctly in there, it's probably not worth it).

(EDIT: So my first comment in this chain was incorrect: the empty set check belongs in getBloomFilterFilteringColumnKeys, not in KeyCondition. Unless KeyCondition does the ALWAYS_FALSE thing for empty set, which is also fine. None of this is important, why am I writing so many words lol.)

src/Processors/Formats/Impl/Parquet/keyConditionRPNToParquetRPN.cpp Outdated Show resolved Hide resolved

src/Processors/Formats/Impl/ParquetBlockInputFormat.cpp Outdated Show resolved Hide resolved

src/Storages/MergeTree/KeyCondition.cpp Outdated Show resolved Hide resolved

src/Storages/MergeTree/KeyCondition.cpp Outdated Show resolved Hide resolved

arthurpassos added 3 commits

December 3, 2024 09:20


          update tests

649cf0e


          update comment

0155d62


          address some comments

ed0ba4c

arthurpassos added 3 commits

December 6, 2024 07:18


          Update ArrowColumnToCHColumn.cpp


          forgot to include this file

7ec6a56


          perhaps this will wokr

8f57c5a

al13n321 reviewed

View reviewed changes

src/Storages/MergeTree/KeyCondition.cpp Outdated

    
                          hashes_for_column.emplace_back(*hashed_value);

                          hashes.emplace_back(std::move(hashes_for_column));

                          hashes.emplace_back(static_cast<std::vector<uint64_t>>(std::move(hashes_for_column)));

Member

al13n321 Dec 6, 2024

Would just hashes.emplace_back({*hashed_value}); work?

I'd use uint64_t for the hash everywhere in this PR (changing hash_one and hash_many return types), since that's what the arrow's functions return, and all hashes in this PR come from those functions.

...

e8c8977

Contributor Author

arthurpassos commented Dec 9, 2024

01086_window_view_cleanup - #72232

Contributor Author

arthurpassos commented Dec 9, 2024

@al13n321 can we merge it?

arthurpassos added 3 commits

December 10, 2024 21:06

lol

a6f4077


          re-trigger ci

b138072


          add missing columndescriptor check

5ed182c

Contributor Author

arthurpassos commented Dec 17, 2024

al13n321 and others added 3 commits

December 17, 2024 23:13


          Merge remote-tracking branch 'origin/master' into merge_parquet_minma…

42635b7

…x_bloom_filter_evaluation


          Merge remote-tracking branch 'origin/master' into merge_parquet_minma…

fdee80d

…x_bloom_filter_evaluation


          Merge branch 'master' into merge_parquet_minmax_bloom_filter_evaluation

6bc1933

arthurpassos mentioned this pull request

Merge parquet bloom filter and min/max evaluation Altinity/ClickHouse#590

Merged

30 tasks

Contributor Author

arthurpassos commented Jan 15, 2025

@al13n321 can we merge it?

s3_cluster - #74202

al13n321 approved these changes

View reviewed changes

Member

al13n321 commented Jan 16, 2025

IIUC, the workflow is that I should fix these flaky tests before merging, and I don't have the time/energy right now, sorry.


          Merge branch 'master' into merge_parquet_minmax_bloom_filter_evaluation

0d87fcd

devcrafter assigned al13n321

arthurpassos added 2 commits

January 20, 2025 10:09


          Merge branch 'master' into merge_parquet_minmax_bloom_filter_evaluation

f0c4cee


          Merge branch 'master' into merge_parquet_minmax_bloom_filter_evaluation

07cb54f

Contributor Author

arthurpassos commented Jan 28, 2025

@al13n321 CI/CD is green, can you fix the CH sync issue and merge it?

al13n321 added this pull request to the merge queue

Merged via the queue into ClickHouse:master with commit 288ad3e

206 checks passed

robot-clickhouse added the pr-synced-to-cloud label

arthurpassos mentioned this pull request

24.8 Backport of #71383 - Merge parquet bloom filter and min/max evaluation Altinity/ClickHouse#681

Merged

Enmk added a commit to Altinity/ClickHouse that referenced this pull request


          Merge pull request #681 from Altinity/backports/24.8/merge_parquet_bf…

dc9c28a

…_minmax_eval

24.8 Backport of ClickHouse#71383 - Merge parquet bloom filter and min/max evaluation

svb-alt mentioned this pull request

Project Antalya Roadmap 2025 - Real-Time Data Lakes Altinity/ClickHouse#804

Open

37 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested pr-improvement pr-synced-to-cloud