Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadFromMergeTree use MinMax index before Partition key to filter parts #48093

Closed
zhongwang97 opened this issue Mar 28, 2023 · 5 comments
Closed
Labels
question Question?

Comments

@zhongwang97
Copy link

I have created a table which use DateTime as Partition Key

CREATE TABLE hits_simple
(
    `UserID` UInt32,
    `URL` String,
    `EventTime` DateTime
)
ENGINE = MergeTree
PRIMARY KEY (UserID, URL)
ORDER BY (UserID, URL, EventTime)
PARTITION BY toYYYYMMDD(EventTime)
SETTINGS index_granularity = 8192, index_granularity_bytes = 0;

and try to query one row, just for test

select * from hits_simple where  UserID = 279588 AND EventTime = '2014-03-17 18:08:53'

the explain show ClickHouse use MinMax to filter parts first, then use Partition key to find corresponding parts

image

I would like to know why not use Partition key first as it seems more efficient
thanks

@zhongwang97 zhongwang97 added the question Question? label Mar 28, 2023
@save-my-heart
Copy link
Contributor

Actually, the MinMax here is also partition key.

@zhongwang97
Copy link
Author

Actually, the MinMax here is also partition key.

Yes, I understand that the MinMax index here is based on the minimum and maximum values of each column in the data part, which is determined by the partition key.

However, this approach requires checking all data parts in the first step. Wouldn't it be more natural and efficient to first locate the corresponding data part based on the partition key and then further filter using MinMax within those data parts?

Or is there any special concern here?

@den-crane
Copy link
Contributor

However, this approach requires checking all data parts in the first step. Wouldn't it be more natural and efficient to first locate the corresponding data part based on the partition key and then further filter using MinMax within those data parts?

it's the same. To locate the corresponding data part based on the partition key requires checking all data parts in the first step. Partition pruning checks all parts.

@arloor
Copy link

arloor commented Aug 15, 2023

It looks like that Clickhouse always creates a Minmax index on datatime columns if these columns are part of partition key.
I guess that the purpose is to filter many data parts by time attribute.
OLAP queries usually come with a time attribute.
So the auto-created Minmax index of datetime in partition key will accelerate a lot of queries.

above is my guess. is that right?

@tnhminh
Copy link

tnhminh commented Nov 16, 2023

Actually, the MinMax here is also partition key.

yes, i think so
I tried to make the MinMax use my custom key such as : UserID
The solution is as below :
Partition by (UseID) -> done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question?
Projects
None yet
Development

No branches or pull requests

5 participants