Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join on external dictionary works slow with grouping #24209

Closed
vkosh opened this issue May 17, 2021 · 4 comments · Fixed by #25618
Closed

Join on external dictionary works slow with grouping #24209

vkosh opened this issue May 17, 2021 · 4 comments · Fixed by #25618
Assignees
Labels
bug Confirmed user-visible misbehaviour in official release performance

Comments

@vkosh
Copy link

vkosh commented May 17, 2021

Select query with aggregation and join on external dictionary works slow when no cast in join expression provided. Table and the dictionary both have UInt64 id's. When using explicit cast like table.dict_id = toUInt64(dict.id) the query works fast.

Does it reproduce on recent release?
Yes

How to reproduce

  • ClickHouse server version to use: 21.4.6.55
create table test (key UInt64, value String) engine=MergeTree order by key;
insert into test select number, '' from numbers(1000000);

create dictionary test_dict (key UInt64, value String)
primary key key source(clickhouse(table test db 'default' user 'default'))
lifetime(min 0 max 0) layout(hashed());

-- Select without GROUP BY
-- 1.1
select test.key, test_dict.value from test join test_dict on test.key = test_dict.key limit 1;
-- Elapsed: 0.038 sec.

-- 1.2
select test.key, test_dict.value from test join test_dict on test.key = toUInt64(test_dict.key) limit 1;
-- Elapsed: 0.316 sec. Processed 1.00 million rows, 17.00 MB (3.17 million rows/s., 53.86 MB/s.)

-- 1.3
select test.key, dictGetString('test_dict', 'value', test.key) from test limit 1;
-- Elapsed: 0.007 sec.

-- Select with GROUP BY

-- 2.1
select test.key from test join test_dict on test.key = test_dict.key group by test.key limit 1;
-- Elapsed: 1286.466 sec. Processed 1.00 million rows, 8.00 MB (777.32 rows/s., 6.22 KB/s.)

-- 2.2
select test.key from test join test_dict on test.key = toUInt64(test_dict.key) group by test.key limit 1;
-- Elapsed: 0.169 sec. Processed 2.00 million rows, 16.00 MB (11.81 million rows/s., 94.49 MB/s.)

-- 2.3
select test.key, dictGetString('test_dict', 'value', test.key) from test group by test.key limit 1;
--  Elapsed: 0.056 sec. Processed 1.00 million rows, 8.00 MB (17.91 million rows/s., 143.32 MB/s.)

Expected behavior
Both queries with explicit cast in join and without it expected to work fast.

@vkosh vkosh added the bug Confirmed user-visible misbehaviour in official release label May 17, 2021
@den-crane den-crane added performance and removed bug Confirmed user-visible misbehaviour in official release labels May 23, 2021
@alexey-milovidov
Copy link
Member

@vkosh You have LIMIT 1 in queries without GROUP BY, they return first record as soon as it has been read.
In contrast, the queries with GROUP BY have to finish reading and aggregating all data before returning result.

Workaround: remove GROUP BY and replace it with DISTINCT.

@alexey-milovidov alexey-milovidov self-assigned this Jun 22, 2021
@vkosh
Copy link
Author

vkosh commented Jun 22, 2021

@alexey-milovidov,
In real queries I use GROUP BY to aggregate some data from MergeTree table, grouped by dictionary key field, e.g.:

select count(), user_dict.email
from click join user_dict on click.user_id = user_dict.id
group by click.user_id;

So I actually use workaround from 2.2 case.
Anyway thanks for the suggestion.

@alexey-milovidov
Copy link
Member

I cannot explain the difference between 2.1 and 2.2, it is strange indeed.

@alexey-milovidov
Copy link
Member

I checked the code and can qualify it as a bug.

KeyGetterForDict::findKey is reading the whole dictionary for every record.
This code simply should not exist.

@alexey-milovidov alexey-milovidov added the bug Confirmed user-visible misbehaviour in official release label Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed user-visible misbehaviour in official release performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants