-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance with functions in primary index #33056
Comments
And if the toStartOfHour is materialized as a real column CREATE TABLE perf1 (`dt` DateTime, `metric_id` Int64, `F` Float64, dth DateTime default toStartOfHour(dt) )
ENGINE = MergeTree PARTITION BY toYYYYMMDD(dt) ORDER BY (dth, metric_id, dt);
CREATE TABLE perf2 (`dt` DateTime, `metric_id` Int64, `F` Float64, dth DateTime default toStartOfHour(dt) )
ENGINE = MergeTree PARTITION BY toYYYYMMDD(dt) ORDER BY (dth, metric_id);
insert into perf1(dt, metric_id, F) select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;
insert into perf2(dt, metric_id, F) select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;
optimize table perf1 final ;
optimize table perf2 final ;
SELECT count() FROM perf1
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00'))) AND (metric_id = 42)
and ((dth >= toDateTime('2021-12-21 00:00:00')) AND (dth < '2021-12-21 01:00:00'))
┌─count()─┐
│ 65 │
└─────────┘
1 rows in set. Elapsed: 0.002 sec. Processed 16.38 thousand rows, 262.14 KB (7.74 million rows/s., 123.79 MB/s.)
SELECT count() FROM perf2
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00'))) AND (metric_id = 42)
and ((dth >= toDateTime('2021-12-21 00:00:00')) AND (dth < toDateTime('2021-12-21 01:00:00')))
┌─count()─┐
│ 65 │
└─────────┘
1 rows in set. Elapsed: 0.002 sec. Processed 16.38 thousand rows, 262.14 KB (7.39 million rows/s., 118.16 MB/s.) |
ah, and CH does not use ah, toStartOfHour is not monotonic, that is expected. |
SELECT count() FROM perf1
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00'))) AND (metric_id = 42)
and toStartOfHour(dt) >= toStartOfHour(toDateTime('2021-12-21 00:00:00'))
and toStartOfHour(dt) < toStartOfHour(toDateTime('2021-12-21 01:00:00'))
┌─count()─┐
│ 65 │
└─────────┘
1 rows in set. Elapsed: 0.003 sec. Processed 16.38 thousand rows, 196.61 KB (6.04 million rows/s., 72.54 MB/s.)
SELECT count() FROM perf1
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00')))
and toStartOfHour(dt) >= toStartOfHour(toDateTime('2021-12-21 00:00:00'))
and toStartOfHour(dt) < toStartOfHour(toDateTime('2021-12-21 01:00:00'))
┌─count()─┐
│ 72000 │
└─────────┘
1 rows in set. Elapsed: 0.003 sec. Processed 73.73 thousand rows, 294.91 KB (23.76 million rows/s., 95.06 MB/s.) |
It makes not much sense to have |
It does make sense.
And big chunk of your queries doesn't read more than couple hours of data. If you will put dt at start of ORDER BY, it would mess query 2, because dt is high cardinality column. So having toStartOfHour(dt) or toDate(dt) at beginning of ORDER BY helps to colocate data near by and reduce amount of read rows for different query patterns. |
Yes, it's useful and show significant improvement in some of our production use cases.
I'll investigate. |
one more test CREATE TABLE perf1 ( `dt` DateTime, `metric_id` Int64, `F` Float64 ) ENGINE = MergeTree
ORDER BY (toDate(dt), metric_id, dt);
CREATE TABLE perf2 ( `dt` DateTime, `metric_id` Int64, `F` Float64 ) ENGINE = MergeTree
ORDER BY (toDate(dt), metric_id);
insert into perf1 select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;
insert into perf2 select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;
optimize table perf1 final ;
optimize table perf2 final ;
SELECT count() FROM perf1 WHERE dt = 1640044800;
Key condition: (column 2 in [1640044800, 1640044800])
1 rows in set. Elapsed: 0.084 sec. Processed 100.00 million rows, 400.00 MB (1.19 billion rows/s., 4.75 GB/s.)
SELECT count() FROM perf2 WHERE dt = 1640044800;
Key condition: (column 0 in [18982, 18982])
1 rows in set. Elapsed: 0.007 sec. Processed 1.73 million rows, 6.91 MB (259.23 million rows/s., 1.04 GB/s.)
So in case At would be nice if CH be able to use both |
Workaround for now:
|
@UnamedRus What if I want to create a PRIMARY KEY as for my table metric_id is not unique and can occur multiple times in an hour/day. In that case I can't use |
In order to achieve this, we have to do index analysis for every key column (and there can be a combinatorial explosion when set index is used) . Our index analysis is already quite sophisticated. I don't think it's worth to extend for such a minor use case. |
Can we do that without set index?
Primary key isn't uniq constraint in ClickHouse, it's basically prefix of ORDER BY which being put in memory index. |
The approach with CH doesn't use the last |
23.6
Related #28087 |
I don't really understand why in case
ORDER BY (toStartOfHour(dt), metric_id, dt)
CH reads more rows.ORDER BY (toStartOfHour(dt), metric_id, dt)
/ Processed 262.14 thousand rowsORDER BY (toStartOfHour(dt), metric_id )
/ Processed 24.58 thousand rowsThe text was updated successfully, but these errors were encountered: