-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WHERE clauses optimization leveraging Projection feature #63207
base: master
Are you sure you want to change the base?
WHERE clauses optimization leveraging Projection feature #63207
Conversation
This is an automated comment for commit cf38f59 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
…ckHouse into proj_optimization_analyzer
Thank you! This looks interesting... I didn't check the code yet. According to my expectations, PREWHERE will scan the column for filtering and then read the remaining data from the positions that passed the filter. |
Hi Alexey, thanks for the response. First, this is a add-on to PREWHERE not a replacement to PREWHERE.
Secondly, on top of PREWHERE, this optimization utilize the projection feature by including a sub select query With
|
Actually, i guess it make sense to include part name as well, because it will allow to filter out parts which doesn't have interesting rows for us. select * from table where indexHint((_part, primary_key) in (select _part, primary_key from test_a where secondary_key='-42')) and secondary_key='-42'; |
Hi @UnamedRus , I think we discussed this in the original thread. Do we still need to re-calculate projection table if we try to include _part and _part_offset? |
_part_offset was different topic, it was more like: what if we will materialize and store _part_offset column from part in projection. But here i'm talking about "virtual" column _part, which just have meaning of current part being queried, it should be always up to date, because it's virtual |
Can you elaborate on this? I dont see any differences by including _part into the query.
|
Difference could be visible Imagine, if your order by is something like:
Because index is sparse, it does meant for select * from table where indexHint((tenant_id, event_id) in (select tenant_id, event_id from test_a where secondary_key='-42')) and secondary_key='-42'; It will read at least 1 granule from each part. (it's still be win from perf improvement, but add unnecessary slowness) |
To begin with code review it requires to add a performance test. |
How to add performance test? |
Please see the documentation: https://clickhouse.com/docs/en/development/tests#performance-tests |
Refer to original PR #57216, porting code to analyzer passes to align with other ClickHouse optimization features.
This is a proposal to optimize select query by leveraging projection feature.
Since PROJECTION can effectively create a new primary key for the table, it can increase the searching speed if using properly.
This implementation inserts a new subquery to the original where condition, in order to use the projection feature in more general cases.
For example,
select * from table where secondary_key='-42';
will now become
select * from table where primary_key in (select primary_key from test_a where secondary_key='-42') and secondary_key='-42';
This implementation is tested and proved increasing query execution speed vastly.
Thanks to @UnamedRus 's suggestion, indexHint() is also added to further optimize the query. Now the query is going to be re-write to
select * from table where indexHint(primary_key in (select primary_key from test_a where secondary_key='-42')) and secondary_key='-42';
It is proved that this will also optimize query execution speed. Tests results are attached below.
Pseudo Code
Restrictions
=
,!=
,in
,notIn
functionFor example,
can be re-written to
However, query like the following will not be recognized, and will not be re-written.
Test result
/* Query using primary key (fastest) */
/* Query does not using projection */
/* Optimized non-projection query now can leveraging projection feature */
/* Optimized query with indexHint */
Query using primary key (fastest) : 0.042 sec, Peak memory usage: 44.66 KiB
Query does not using projection : 2.177 sec, Peak memory usage: 1021.77 KiB.
Optimized non-projection query : 0.101 sec, Peak memory usage: 275.28 KiB
Optimized query with indexHint : 0.094 sec, Peak memory usage: 223.92 KiB.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Rewrite where clause in order to leverage PROJECTION feature.
Documentation entry for user-facing changes
Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: