Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition pruning not working as expected #7948

Closed
nvartolomei opened this issue Nov 27, 2019 · 13 comments
Closed

Partition pruning not working as expected #7948

nvartolomei opened this issue Nov 27, 2019 · 13 comments
Labels
bug Confirmed user-visible misbehaviour in official release performance

Comments

@nvartolomei
Copy link
Contributor

nvartolomei commented Nov 27, 2019

How to reproduce

  • Which ClickHouse server version to use: v19.17.2.4-testing
DROP TABLE IF EXISTS test_partition_filtering;

CREATE TABLE test_partition_filtering (
    `timestamp` DateTime,
    zoneId UInt64
) ENGINE = MergeTree() 
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (zoneId, timestamp);

INSERT INTO test_partition_filtering
SELECT
  toUInt64(now())-1000*number/1000 as timestamp,
  number/1000 as zone
FROM numbers(1000000);

OPTIMIZE TABLE test_partition_filtering final;

SELECT count() FROM test_partition_filtering WHERE toDate(toStartOfDay(timestamp)) = today() and zoneId=42;
SELECT count() FROM test_partition_filtering WHERE toDate(timestamp) = today() AND zoneId = 42;

Expected behavior
Expect both SELECTS to prune partitions and to read just a single part.

Actual behavior
First query reads one part. Second query reads 12 parts.

Logs

executeQuery: (from 127.0.0.1:48812) SELECT count() FROM test_partition_filtering WHERE (toDate(toStartOfDay(timestamp)) = today()) AND (zoneId = 42)
InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "toDate(toStartOfDay(timestamp)) = today()" moved to PREWHERE
default.test_partition_filtering (SelectExecutor): Key condition: (column 0 in [42, 42]), (toDate(toStartOfDay(column 1)) in [18227, 18227]), and
default.test_partition_filtering (SelectExecutor): MinMax index condition: unknown, (toDate(toStartOfDay(column 0)) in [18227, 18227]), and
default.test_partition_filtering (SelectExecutor): Selected 1 parts by date, 1 parts by key, 1 marks to read from 1 ranges



executeQuery: (from 127.0.0.1:48812) SELECT count() FROM test_partition_filtering WHERE (toDate(timestamp) = today()) AND (zoneId = 42)
InterpreterSelectQuery: MergeTreeWhereOptimizer: condition "toDate(timestamp) = today()" moved to PREWHERE
default.test_partition_filtering (SelectExecutor): Key condition: (column 0 in [42, 42]), (toDate(column 1) in [18227, 18227]), and
default.test_partition_filtering (SelectExecutor): MinMax index condition: unknown, (toDate(column 0) in [18227, 18227]), and
default.test_partition_filtering (SelectExecutor): Selected 12 parts by date, 1 parts by key, 1 marks to read from 1 ranges
@nvartolomei nvartolomei added the bug Confirmed user-visible misbehaviour in official release label Nov 27, 2019
@victor-perov
Copy link

Is it reproducible on a stable version as well?

@nvartolomei
Copy link
Contributor Author

@victor-perov 19.17.2.4-testing is a stable version even though system.build_options reports it as testing.

@den-crane
Copy link
Contributor

den-crane commented Nov 27, 2019

each part has min_max_timestamp.idx file.
This file stores max & min values of timestamp column over this part.

pruning works with where timestamp > = <
pruning does not work with function(timestamp) > = < because function could be one-way_function

imho in these queries pruning does not work at all, only PK.

@den-crane
Copy link
Contributor

den-crane commented Nov 27, 2019

Though maybe I am wrong.

CREATE TABLE test_partition_filtering (
    timestamp DateTime) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY tuple();

INSERT INTO test_partition_filtering SELECT '2019-01-01 10:00:00' FROM numbers(1000000);
INSERT INTO test_partition_filtering SELECT '2019-01-02 10:00:00' FROM numbers(1000000);
INSERT INTO test_partition_filtering SELECT toDateTime(today())+3600 FROM numbers(1000000);

select count() from test_partition_filtering 
where toDate(toStartOfDay(timestamp)) = today();
Processed 1.00 million rows

select count() from test_partition_filtering 
where toDate((timestamp)) = today();
Processed 1.00 million rows

@victor-perov
Copy link

It would be nice to get an update on the issue.

@hagen1778
Copy link

I'm hitting the same issue. Is there any updates?

@den-crane
Copy link
Contributor

den-crane commented Sep 24, 2020

related: #15255

20.10.1.4704.
SELECT count() FROM test_partition_filtering WHERE toDate(toStartOfDay(timestamp)) = today() and zoneId=42;
Selected 1 parts by date, 1 parts by key, 1 marks by primary key, 1 marks to read from 1 ranges
Processed 8.19 thousand rows

SELECT count() FROM test_partition_filtering WHERE toDate(timestamp) = today() AND zoneId = 42;
Selected 1 parts by date, 1 parts by key, 1 marks by primary key, 1 marks to read from 1 ranges
Processed 8.19 thousand rows

seems fixed starting with 20.8

@amosbird
Copy link
Collaborator

It's fixed in #13497 . And toDate(timestamp_ms / 1000) will also work because of #14513 .

@vitalvi
Copy link

vitalvi commented Apr 29, 2021

Should partition pruning work with functions comming from CTE, for example:

with toDate('2019-01-02') as dt select count() from test_partition_filtering where toDate((timestamp)) = dt;

1 rows in set. Elapsed: 0.025 sec. Processed 3.00 million rows, 12.00 MB (120.75 million rows/s., 483.00 MB/s.)

Is this expected that 3 million rows have been processed in this case?

@amosbird
Copy link
Collaborator

Should partition pruning work with functions comming from CTE, for example:

with toDate('2019-01-02') as dt select count() from test_partition_filtering where toDate((timestamp)) = dt;

1 rows in set. Elapsed: 0.025 sec. Processed 3.00 million rows, 12.00 MB (120.75 million rows/s., 483.00 MB/s.)

Is this expected that 3 million rows have been processed in this case?

This is not CTE but scalar alias. It should work. Please share a minimal reproduceable test case.

@vitalvi
Copy link

vitalvi commented Apr 29, 2021

I use the same table definition and data as provided above:

CREATE TABLE test_partition_filtering (
    timestamp DateTime) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY tuple();

INSERT INTO test_partition_filtering SELECT '2019-01-01 10:00:00' FROM numbers(1000000);
INSERT INTO test_partition_filtering SELECT '2019-01-02 10:00:00' FROM numbers(1000000);
INSERT INTO test_partition_filtering SELECT toDateTime(today())+3600 FROM numbers(1000000);

with toDate('2019-01-02') as dt select count() from test_partition_filtering where toDate((timestamp)) = dt;
1 rows in set. Elapsed: 0.025 sec. Processed 3.00 million rows, 12.00 MB (120.75 million rows/s., 483.00 MB/s.)

but

select count() from test_partition_filtering where toDate((timestamp)) = toDate('2019-01-02');
1 rows in set. Elapsed: 0.014 sec. Processed 1.00 million rows, 4.00 MB (71.74 million rows/s., 286.97 MB/s.)

@amosbird
Copy link
Collaborator

It's already fixed in #21766

@vitalvi
Copy link

vitalvi commented Apr 29, 2021

ohh, pretty new one. Yes, with the latest version it works as expected. Sorry for the noise and thank you @amosbird

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed user-visible misbehaviour in official release performance
Projects
None yet
Development

No branches or pull requests

7 participants