You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I join two tables and put a condition on the join key in one table, applying that condition to the corresponding column in the other table could sometimes make the query run much faster than it does now. Given these tables:
create table a (i int) engine=MergeTree primary key i as select * from numbers(100000000);
create table b (i int) engine=MergeTree primary key i as select * from numbers(100000000);
This query hashes all of b and takes several seconds:
select count(*) from a join b using(i) where a.i < 1000;
However, if I explicitly make a copy of the condition, most of b doesn't isn't read/hashed so I get results instantly:
select count(*) from a join b using(i) where a.i < 1000 and b.i < 1000;
(Concretely, I get eight seconds and 100m rows processed for the first, vs .003 seconds and 16 thousand rows processed for the second.)
I recognize the general case of this is more complicated than my example queries. Even a partial implementation could speed up many queries.
If ClickHouse someday takes advantage of two tables having a (prefix of the) sort key in common for joins, the first example might no longer be slow; in that case, maybe changing the join_algorithm could make it slow again.
The text was updated successfully, but these errors were encountered:
@den-crane It's great that this and #45286 are in discussion for the 2023 roadmap! Thinking about CH issue tracker etiquette, should I leave this issue open (to indicate user interest, say) or close it because the feature is already planned?
If I join two tables and put a condition on the join key in one table, applying that condition to the corresponding column in the other table could sometimes make the query run much faster than it does now. Given these tables:
This query hashes all of
b
and takes several seconds:However, if I explicitly make a copy of the condition, most of
b
doesn't isn't read/hashed so I get results instantly:(Concretely, I get eight seconds and 100m rows processed for the first, vs .003 seconds and 16 thousand rows processed for the second.)
I recognize the general case of this is more complicated than my example queries. Even a partial implementation could speed up many queries.
If ClickHouse someday takes advantage of two tables having a (prefix of the) sort key in common for joins, the first example might no longer be slow; in that case, maybe changing the join_algorithm could make it slow again.
The text was updated successfully, but these errors were encountered: