-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Merge parquet bloom filter and min/max evaluation #71383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge parquet bloom filter and min/max evaluation #71383
Conversation
For now, this is just an ugly POC. Done with minimal effort. Ideas for its design are welcome. The design challenge here is if we should re-use logic in |
|
|
||
rpn_stack.emplace_back(intersects, !contains); | ||
|
||
if (rpn_stack.back().can_be_true && element.bloom_filter_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just highlighting bf check since diff can't do it
{ | ||
rpn_stack.emplace_back(true, true); | ||
|
||
if (element.bloom_filter_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just highlighting bf check since diff can't do it
* represented by a set of hyperrectangles. | ||
*/ | ||
} | ||
else if (element.function == ConditionElement::FUNCTION_POINT_IN_POLYGON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
|
||
rpn_stack.emplace_back(element.set_index->checkInRange(hyperrectangle, data_types, single_point)); | ||
|
||
if (rpn_stack.back().can_be_true && element.bloom_filter_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just highlighting bf check since diff can't do it
The first option that comes to my mind Make The thing I dislike the most about this approach is the method name... Because it is now doing other things, not only checking the hyperrectangle. |
Another problem with this approach is that we would need some sort of EDIT Addressed above with an interface |
I have implemented the approach I mentioned above. Let me know if you are ok with this approach. Plus, can you enable CI? |
Basically, a new static method Some One thing to note is that in case |
This is an automated comment for commit 07cb54f with description of existing statuses. It's updated for the latest CI running ✅ Click here to open a full report in a separate page Successful checks
|
Correct me, if i'm wrong, but this doesn't work even for MergeTree tables right now? |
I haven't looked through the code, but I assume it doesn't work even for MT. KeyCondition does not handle bloom filters, it is handled separately, so most likely that's the case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat, more functionality for less code. Thanks for taking the time to simplify this!
Correct me, if i'm wrong, but this doesn't work even for MergeTree tables right now?
Yes. I think it would make sense to do a similar refactoring to MergeTreeIndexConditionBloomFilter too (move most of the logic into KeyCondition instead of having a custom RPNElement and tree traversal; then, separately, actually merging bf index checking with primary key analysis would be more tricky, may or may not be worth it).
tests/queries/0_stateless/03261_test_merge_parquet_bloom_filter_minmax_stats.reference
Show resolved
Hide resolved
if (found_empty_column) | ||
{ | ||
condition_elements.emplace_back(Function::ALWAYS_FALSE); | ||
// todo arthur | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be handled by KeyCondition independent of bf. I think the line element.set_index->checkInRange
already takes care of empty sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you say that because if the set is empty, it'll be false and thus rpn_stack.back().can_be_true
will prevent the bloom filter check??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Maybe it would make sense to also do a redundant explicit check, to not rely on checkInRange carefully handling the kinda-special "infinite range x empty set" case. Either way, this logic belongs in KeyCondition, I feel.
In my mind, empty set is not a special case at all, nothing would break if the normal code path runs on it (unless the code uses empty list as a special value with special meaning (e.g. "skip bloom filter check"), which I think is not a good idea in this case). Except that we may unnecessarily read the bloom filter from file, worth avoiding.
(Or KeyCondition's construction code could do the ALWAYS_FALSE thing, but I like that slightly less because (a) it means a more substantially different code path is taken depending on the data (whether the set happens to be empty, e.g. if it's a subquery), (b) it requires the set to be built (e.g. by running subquery) at KeyCondition construction time; which currently always happens anyway, but in future I can imagine deferring it to a different stage of query execution, with better progress reporting and cancellability than the early query analysis stage.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mind might be tricking me, but I guess it is ok not to check for empty columns at this stage.
bool mayExistOnBloomFilter(const KeyCondition::BloomFilterData & condition_bloom_filter_data,
const KeyCondition::ColumnIndexToBloomFilter & column_index_to_column_bf)
will loop over the columns in the set index, and then call the overload
bool mayExistOnBloomFilter(const std::vector<uint64_t> & hashes, const std::unique_ptr<KeyCondition::BloomFilter> & bloom_filter)
which loops over the hashes, but none shall be found because empty set won't produce any hashes. If that is the case, it'll return false. If it returns false, the row group will be skipped.
That's the goal, right? Unless there is a scenario where min/max range check would return true for empty set, and then bloom filter would affect the end result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two goals: (1) skip the row group (which probably already redundantly accomplished by both set_index->checkInRange and mayExistOnBloomFilter), (2) don't read the bf. For (2) I just realized checkRPNAgainstHyperrectangle's behavior is irrelevant, the empty set check would need to be in getBloomFilterFilteringColumnKeys (and it doesn't seem important; if any special effort is needed to "prepare" the set correctly in there, it's probably not worth it).
(EDIT: So my first comment in this chain was incorrect: the empty set check belongs in getBloomFilterFilteringColumnKeys, not in KeyCondition. Unless KeyCondition does the ALWAYS_FALSE thing for empty set, which is also fine. None of this is important, why am I writing so many words lol.)
src/Processors/Formats/Impl/Parquet/keyConditionRPNToParquetRPN.cpp
Outdated
Show resolved
Hide resolved
hashes_for_column.emplace_back(*hashed_value); | ||
|
||
hashes.emplace_back(std::move(hashes_for_column)); | ||
hashes.emplace_back(static_cast<std::vector<uint64_t>>(std::move(hashes_for_column))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would just hashes.emplace_back({*hashed_value});
work?
I'd use uint64_t for the hash everywhere in this PR (changing hash_one and hash_many return types), since that's what the arrow's functions return, and all hashes in this PR come from those functions.
01086_window_view_cleanup - #72232 |
@al13n321 can we merge it? |
@al13n321 ping |
…x_bloom_filter_evaluation
…x_bloom_filter_evaluation
IIUC, the workflow is that I should fix these flaky tests before merging, and I don't have the time/energy right now, sorry. |
@al13n321 CI/CD is green, can you fix the CH sync issue and merge it? |
…_minmax_eval 24.8 Backport of ClickHouse#71383 - Merge parquet bloom filter and min/max evaluation
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Evaluate parquet bloom filters and min/max indexes together. Necessary to properly support:
x = 3 or x > 5
where data = [1, 2, 4, 5]Documentation entry for user-facing changes
CI Settings (Only check the boxes if you know what you are doing):