Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache dictionary request the same key lot of times #51762

Closed
filimonov opened this issue Jul 4, 2023 · 2 comments · Fixed by #51853
Closed

cache dictionary request the same key lot of times #51762

filimonov opened this issue Jul 4, 2023 · 2 comments · Fixed by #51853
Assignees

Comments

@filimonov
Copy link
Contributor

then one single block of data processed by dictGet have a lot of duplicate keys ClickHouse will construct suboptimal query like

SELECT ... FROM dict_source WHERE `id` IN (1000, 1000, ..., 1000, 1000, 1000);

instead of simple

SELECT ... FROM dict_source WHERE `id` IN (1000);

Repro:

DROP TABLE IF EXISTS dict_source;
DROP DICTIONARY IF EXISTS test_cache_dict;

CREATE TABLE dict_source engine = Log AS SELECT number as id, toString(number) as value FROM numbers(10000);


CREATE DICTIONARY IF NOT EXISTS test_cache_dict (
    id UInt64,
    value String
)
PRIMARY KEY id
SOURCE(
    CLICKHOUSE(
        host '127.0.0.2'
        DB currentDatabase()
        TABLE 'dict_source'
    )
)
LAYOUT(
    CACHE(SIZE_IN_CELLS 10000)
)
LIFETIME(MIN 1 MAX 100);


SELECT dictGet(test_cache_dict, 'value', materialize(toUInt64(1000))) FROM numbers_mt(1000) SETTINGS max_block_size = 50, max_threads = 4 FORMAT Null;

system flush logs;
SELECT event_time, query FROM system.query_log WHERE event_time > now() - 300 and has(tables, currentDatabase() || '.dict_source') and type = 'QueryFinish' and type = 'QueryFinish' and query_kind='Select' ORDER BY event_time DESC LIMIT 10;


SELECT dict_key, count(query_id) number_of_queries_to_source, sum(count_per_query) as sum_key_requests, max(count_per_query) as max_key_requests_per_query FROM (
SELECT  arrayJoin(splitByRegexp(',\s+', extract(query, 'IN \((.*)\);'))) as dict_key, query_id, count() count_per_query FROM system.query_log WHERE event_time > now() - 300 and has(tables, currentDatabase() || '.dict_source') and type = 'QueryFinish' and query_kind='Select' GROUP BY dict_key, query_id) GROUP BY dict_key FORMAT PrettyCompactMonoBlock;
@filimonov
Copy link
Contributor Author

It is the regression between 21.3 and 21.4, most probably introduced during that refactoring #20595

@filimonov filimonov added the comp-dictionary Dictionaries label Jul 4, 2023
@kitaisreal kitaisreal self-assigned this Jul 4, 2023
@diegov
Copy link

diegov commented Jul 4, 2023

Is the fix gong to be applied to the query that is issued against the source of the dictionary? Or would it apply to the queue, eg. by making tryPushToUpdateQueueOrThrow a noop if the key is already in the the queue to be updated?

We've had issues with dictionary update queue sizes due to a very large number of dictGet operations, which often add the same key to the update queue over and over until the first update task populates the value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants