You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Effectively both the queries should be reading or parsing the same number of files ("DUMMY" used the glob pattern is non-existent).
The resulting response time is approximately 1.5s for query without glob pattern.
But the same query with blob pattern is taking around 140s
Which ClickHouse server version to use ClickHouse client version 23.7.3.14 (official build).
Expected performance
The response time in both cases should be more or less the same.
Additional context
Data stored in S3 is of parquet format. There are multiple files within 202308210735. As is obvious from the pattern, the data is time partitioned and there will be multiple folders like 202308210740, 202308210745 etc..
Additionally, there are multiple top-level folders as well (eg) Partition1, Partition2 etc..
I did run query analysis of both the queries and from that its quite clear that the glob pattern based query is leading a substantially higher S3ListObject & S3Reads. Below is a snapshot of the comparision
The text was updated successfully, but these errors were encountered:
@tavplubix Yep, looks like its the same issue. I was just about to post that I was suspecting this is to be a result of highly partitioned bucket. The number of objects in the S3 bucket may be leading to the high number of list operations.
SELECT toStartOfHour(eventdate) as ts, max(value) as value
FROM s3('https://s3.xx.amazonaws.com/xxxx/xxx/xxxxx/xxxxx/day={2023-12-12,2023-12-13,2023-12-14}/*.parquet','key', 'secret')
GROUP BY ts
order by ts
SETTINGS max_threads=24
Describe the situation
There is a huge difference wrt query response time when querying S3 with or without glob patterns in the S3 URL
How to reproduce
Below is an example query without glob pattern
Below is an example query with glob pattern
Effectively both the queries should be reading or parsing the same number of files ("DUMMY" used the glob pattern is non-existent).
The resulting response time is approximately 1.5s for query without glob pattern.
But the same query with blob pattern is taking around 140s
ClickHouse client version 23.7.3.14 (official build).
Expected performance
The response time in both cases should be more or less the same.
Additional context
Data stored in S3 is of parquet format. There are multiple files within
202308210735
. As is obvious from the pattern, the data is time partitioned and there will be multiple folders like202308210740
,202308210745
etc..Additionally, there are multiple top-level folders as well (eg)
Partition1
,Partition2
etc..I did run query analysis of both the queries and from that its quite clear that the glob pattern based query is leading a substantially higher S3ListObject & S3Reads. Below is a snapshot of the comparision
The text was updated successfully, but these errors were encountered: