Performance with functions in primary index #33056

den-crane · 2021-12-22T18:16:08Z

I don't really understand why in case ORDER BY (toStartOfHour(dt), metric_id, dt) CH reads more rows.

ORDER BY (toStartOfHour(dt), metric_id, dt) / Processed 262.14 thousand rows
ORDER BY (toStartOfHour(dt), metric_id ) / Processed 24.58 thousand rows

18.14.18 / 21.8.10.19 / 21.13.1.9141

CREATE TABLE perf1 (     `dt` DateTime,     `metric_id` Int64,     `F` Float64 ) ENGINE = MergeTree 
PARTITION BY toYYYYMMDD(dt) ORDER BY (toStartOfHour(dt), metric_id, dt);

CREATE TABLE perf2 (     `dt` DateTime,     `metric_id` Int64,     `F` Float64 )  ENGINE = MergeTree 
PARTITION BY toYYYYMMDD(dt) ORDER BY (toStartOfHour(dt), metric_id);

insert into perf1 select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;
insert into perf2 select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;

optimize table perf1 final ;
optimize table perf2 final ;

SELECT count() FROM perf1
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00'))) AND (metric_id = 42);
┌─count()─┐
│      65 │
└─────────┘
1 rows in set. Elapsed: 0.004 sec. Processed 262.14 thousand rows, 3.15 MB (72.95 million rows/s., 875.45 MB/s.)


SELECT count() FROM perf2
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00'))) AND (metric_id = 42);
┌─count()─┐
│      65 │
└─────────┘
1 rows in set. Elapsed: 0.002 sec. Processed 24.58 thousand rows, 294.91 KB (13.64 million rows/s., 163.70 MB/s.)


SELECT count() FROM perf1
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00')));
┌─count()─┐
│   72000 │
└─────────┘
1 rows in set. Elapsed: 0.003 sec. Processed 1.73 million rows, 6.91 MB (584.11 million rows/s., 2.34 GB/s.)

SELECT count() FROM perf2
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00')));
┌─count()─┐
│   72000 │
└─────────┘
1 rows in set. Elapsed: 0.002 sec. Processed 147.46 thousand rows, 589.82 KB (80.60 million rows/s., 322.41 MB/s.)


SELECT count() FROM perf1
PREWHERE metric_id = 42;
┌─count()─┐
│   90009 │
└─────────┘
1 rows in set. Elapsed: 0.009 sec. Processed 15.18 million rows, 121.44 MB (1.74 billion rows/s., 13.94 GB/s.)

SELECT count() FROM perf2
PREWHERE metric_id = 42;
┌─count()─┐
│   90009 │
└─────────┘
1 rows in set. Elapsed: 0.008 sec. Processed 15.18 million rows, 121.44 MB (1.91 billion rows/s., 15.29 GB/s.)

The text was updated successfully, but these errors were encountered:

den-crane · 2021-12-22T18:22:55Z

And if the toStartOfHour is materialized as a real column

CREATE TABLE perf1 (`dt` DateTime, `metric_id` Int64, `F` Float64, dth DateTime default toStartOfHour(dt) ) 
ENGINE = MergeTree  PARTITION BY toYYYYMMDD(dt)  ORDER BY (dth, metric_id, dt);

CREATE TABLE perf2 (`dt` DateTime, `metric_id` Int64, `F` Float64, dth DateTime default toStartOfHour(dt) ) 
ENGINE = MergeTree PARTITION BY toYYYYMMDD(dt)  ORDER BY (dth, metric_id);


insert into perf1(dt, metric_id, F) select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;
insert into perf2(dt, metric_id, F) select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;

optimize table perf1 final ;
optimize table perf2 final ;

SELECT count() FROM perf1
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00'))) AND (metric_id = 42) 
   and ((dth >= toDateTime('2021-12-21 00:00:00')) AND (dth < '2021-12-21 01:00:00'))
┌─count()─┐
│      65 │
└─────────┘
1 rows in set. Elapsed: 0.002 sec. Processed 16.38 thousand rows, 262.14 KB (7.74 million rows/s., 123.79 MB/s.)


SELECT count() FROM perf2
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00'))) AND (metric_id = 42) 
      and ((dth >= toDateTime('2021-12-21 00:00:00')) AND (dth < toDateTime('2021-12-21 01:00:00')))
┌─count()─┐
│      65 │
└─────────┘
1 rows in set. Elapsed: 0.002 sec. Processed 16.38 thousand rows, 262.14 KB (7.39 million rows/s., 118.16 MB/s.)

den-crane · 2021-12-22T19:01:05Z

ah, and CH does not use
ORDER BY (toStartOfHour(dt) for filtering (dt >= toDateTime('2021-12-21 00:00:00')

ah, toStartOfHour is not monotonic, that is expected.
So the question is still why (toStartOfHour(dt), metric_id, dt) / Processed 262.14 thousand rows

den-crane · 2021-12-22T19:06:22Z

SELECT count() FROM perf1
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00'))) AND (metric_id = 42)
         and toStartOfHour(dt) >= toStartOfHour(toDateTime('2021-12-21 00:00:00'))
         and toStartOfHour(dt) < toStartOfHour(toDateTime('2021-12-21 01:00:00'))
┌─count()─┐
│      65 │
└─────────┘
1 rows in set. Elapsed: 0.003 sec. Processed 16.38 thousand rows, 196.61 KB (6.04 million rows/s., 72.54 MB/s.)


SELECT count() FROM perf1
PREWHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00'))) 
and toStartOfHour(dt) >= toStartOfHour(toDateTime('2021-12-21 00:00:00'))
and toStartOfHour(dt) < toStartOfHour(toDateTime('2021-12-21 01:00:00'))
┌─count()─┐
│   72000 │
└─────────┘
1 rows in set. Elapsed: 0.003 sec. Processed 73.73 thousand rows, 294.91 KB (23.76 million rows/s., 95.06 MB/s.)

alexey-milovidov · 2021-12-23T07:39:39Z

It makes not much sense to have toStartOfHour(dt) instead of dt in primary key.

UnamedRus · 2021-12-23T12:00:29Z

It does make sense.
Assume you have 2 types of queries:

SELECT * FROM table WHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00')))
SELECT * FROM table WHERE ((dt >= toDateTime('2021-12-21 00:00:00')) AND (dt < toDateTime('2021-12-21 01:00:00'))) AND metric_id = xxxx

And big chunk of your queries doesn't read more than couple hours of data.

If you will put dt at start of ORDER BY, it would mess query 2, because dt is high cardinality column.
If you will put dt at the end of ORDER BY, it would mess query 1, because ClickHouse would need to read data from uniq(metric_id) different ranges.

So having toStartOfHour(dt) or toDate(dt) at beginning of ORDER BY helps to colocate data near by and reduce amount of read rows for different query patterns.

amosbird · 2021-12-24T08:47:42Z

It does make sense.

Yes, it's useful and show significant improvement in some of our production use cases.

So the question is still why (toStartOfHour(dt), metric_id, dt) / Processed 262.14 thousand rows

I'll investigate.

den-crane · 2022-01-15T19:34:45Z

one more test

CREATE TABLE perf1 (     `dt` DateTime,     `metric_id` Int64,     `F` Float64 ) ENGINE = MergeTree 
ORDER BY (toDate(dt), metric_id, dt);

CREATE TABLE perf2 (     `dt` DateTime,     `metric_id` Int64,     `F` Float64 )  ENGINE = MergeTree 
ORDER BY (toDate(dt), metric_id);

insert into perf1 select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;
insert into perf2 select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;

optimize table perf1 final ;
optimize table perf2 final ;



SELECT count() FROM perf1 WHERE dt = 1640044800;

Key condition: (column 2 in [1640044800, 1640044800])
1 rows in set. Elapsed: 0.084 sec. Processed 100.00 million rows, 400.00 MB (1.19 billion rows/s., 4.75 GB/s.)


SELECT count() FROM perf2 WHERE dt = 1640044800;

Key condition: (column 0 in [18982, 18982])
1 rows in set. Elapsed: 0.007 sec. Processed 1.73 million rows, 6.91 MB (259.23 million rows/s., 1.04 GB/s.)

So in case ORDER BY (toDate(dt), metric_id, dt) CH uses column 2 / dt for filtering dt =
And in case ORDER BY (toDate(dt), metric_id) CH uses column 0 / toDate(dt) for filtering dt =

At would be nice if CH be able to use both column 0 & column 2 in the case with ORDER BY (toDate(dt), metric_id, dt)

UnamedRus · 2022-01-15T22:00:14Z

Workaround for now:

CREATE TABLE perf1 (     `dt` DateTime,     `metric_id` Int64,     `F` Float64 ) ENGINE = MergeTree
PRIMARY KEY (toDate(dt), metric_id)
ORDER BY (toDate(dt), metric_id, dt);

makeavish · 2022-03-10T02:23:34Z

@UnamedRus What if I want to create a PRIMARY KEY as for my table metric_id is not unique and can occur multiple times in an hour/day. In that case I can't use (toDate(dt), metric_id) as PRIMARY KEY.
Any recommendations in that case?

amosbird · 2022-03-10T15:38:52Z

At would be nice if CH be able to use both column 0 & column 2 in the case with ORDER BY (toDate(dt), metric_id, dt)

In order to achieve this, we have to do index analysis for every key column (and there can be a combinatorial explosion when set index is used) . Our index analysis is already quite sophisticated. I don't think it's worth to extend for such a minor use case.

UnamedRus · 2022-03-11T21:41:43Z

(and there can be a combinatorial explosion when set index is used)

Can we do that without set index?

@UnamedRus What if I want to create a PRIMARY KEY as for my table metric_id is not unique and can occur multiple times in an hour/day. In that case I can't use (toDate(dt), metric_id) as PRIMARY KEY.

Primary key isn't uniq constraint in ClickHouse, it's basically prefix of ORDER BY which being put in memory index.

loyd · 2023-02-17T05:35:16Z

The approach with OK = (toStartOfInterval(dt, ..), metric_id, dt) is similar to how prometheus and loki store data (but in files instead of contiguous parts).

CH doesn't use the last dt in requests, so there is no sense to put it in PK (and put only to OK instead). Am I right?

UnamedRus · 2023-06-13T22:08:49Z

23.6

CREATE TABLE perf1 (     `dt` DateTime,     `metric_id` Int64,     `F` Float64 ) ENGINE = MergeTree 
ORDER BY (toStartOfDay(dt), metric_id, dt);

CREATE TABLE perf2 (     `dt` DateTime,     `metric_id` Int64,     `F` Float64 )  ENGINE = MergeTree 
ORDER BY (toStartOfDay(dt), metric_id);

insert into perf1 select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;
insert into perf2 select toDateTime('2021-12-21 00:00:00')+ number/20, number%1111, 1 from numbers(100000000) ;

optimize table perf1 final ;
optimize table perf2 final ;



SELECT count() FROM perf1 WHERE dt = 1640044800;

_local.perf1 (SelectExecutor): Key condition: (column 2 in [1640044800, 1640044800])
2023.06.14 01:04:21.083043 [ 1424 ] {64409330-a343-4966-9609-9cfb420e33e7} <Trace> _local.perf1 (SelectExecutor): Used generic exclusion search over index for part all_1_96_4 with 16881 steps
2023.06.14 01:04:21.083156 [ 1424 ] {64409330-a343-4966-9609-9cfb420e33e7} <Debug> _local.perf1 (SelectExecutor): Selected 1/1 parts by partition key, 1 parts by primary key, 12207/12208 marks by primary key, 12207 marks to read from 1 ranges


SELECT count() FROM perf2 WHERE dt = 1640044800;

 Key condition: (column 0 in [1640034000, 1640034000])
2023.06.14 01:04:37.411896 [ 1424 ] {60a5de67-ac68-40f2-b2f0-cdfca3198aea} <Trace> _local.perf2 (SelectExecutor): Running binary search on index range for part all_1_96_3 (12209 marks)
2023.06.14 01:04:37.411918 [ 1424 ] {60a5de67-ac68-40f2-b2f0-cdfca3198aea} <Trace> _local.perf2 (SelectExecutor): Found (LEFT) boundary mark: 0
2023.06.14 01:04:37.411935 [ 1424 ] {60a5de67-ac68-40f2-b2f0-cdfca3198aea} <Trace> _local.perf2 (SelectExecutor): Found (RIGHT) boundary mark: 211
2023.06.14 01:04:37.411979 [ 1424 ] {60a5de67-ac68-40f2-b2f0-cdfca3198aea} <Trace> _local.perf2 (SelectExecutor): Found continuous range in 26 steps
2023.06.14 01:04:37.411996 [ 1424 ] {60a5de67-ac68-40f2-b2f0-cdfca3198aea} <Debug> _local.perf2 (SelectExecutor): Selected 1/1 parts by partition key, 1 parts by primary key, 211/12208 marks by primary key, 211 marks to read from 1 ranges

Related #28087

den-crane added question Question? performance labels Dec 22, 2021

den-crane added the comp-optimizers Query optimizations label Dec 22, 2021

den-crane mentioned this issue Jun 16, 2022

Is QuestDB Faster Than ClickHouse? #38109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance with functions in primary index #33056

Performance with functions in primary index #33056

den-crane commented Dec 22, 2021 •

edited

den-crane commented Dec 22, 2021 •

edited

den-crane commented Dec 22, 2021 •

edited

den-crane commented Dec 22, 2021

alexey-milovidov commented Dec 23, 2021 •

edited

UnamedRus commented Dec 23, 2021

amosbird commented Dec 24, 2021

den-crane commented Jan 15, 2022 •

edited

UnamedRus commented Jan 15, 2022

makeavish commented Mar 10, 2022

amosbird commented Mar 10, 2022 •

edited

UnamedRus commented Mar 11, 2022

loyd commented Feb 17, 2023

UnamedRus commented Jun 13, 2023 •

edited

Performance with functions in primary index #33056

Performance with functions in primary index #33056

Comments

den-crane commented Dec 22, 2021 • edited

den-crane commented Dec 22, 2021 • edited

den-crane commented Dec 22, 2021 • edited

den-crane commented Dec 22, 2021

alexey-milovidov commented Dec 23, 2021 • edited

UnamedRus commented Dec 23, 2021

amosbird commented Dec 24, 2021

den-crane commented Jan 15, 2022 • edited

UnamedRus commented Jan 15, 2022

makeavish commented Mar 10, 2022

amosbird commented Mar 10, 2022 • edited

UnamedRus commented Mar 11, 2022

loyd commented Feb 17, 2023

UnamedRus commented Jun 13, 2023 • edited

den-crane commented Dec 22, 2021 •

edited

den-crane commented Dec 22, 2021 •

edited

den-crane commented Dec 22, 2021 •

edited

alexey-milovidov commented Dec 23, 2021 •

edited

den-crane commented Jan 15, 2022 •

edited

amosbird commented Mar 10, 2022 •

edited

UnamedRus commented Jun 13, 2023 •

edited