Queries to S3 having glob patterns takes a long time to complete #53643

MaheshGPai · 2023-08-21T13:58:18Z

Describe the situation
There is a huge difference wrt query response time when querying S3 with or without glob patterns in the S3 URL

How to reproduce
Below is an example query without glob pattern

SELECT Column1, Column2, _path
FROM s3('https://test-s3-bucket.s3.us-west-2.amazonaws.com/Partition1/202308210735/*.parquet', <aws_access_key_id>, <aws_secret_access_key>) LIMIT 2

Below is an example query with glob pattern

SELECT Column1, Column2, _path
FROM s3('https://test-s3-bucket.s3.us-west-2.amazonaws.com/Partition1/{202308210735,DUMMY}/*.parquet', <aws_access_key_id>, <aws_secret_access_key>) LIMIT 2

Effectively both the queries should be reading or parsing the same number of files ("DUMMY" used the glob pattern is non-existent).
The resulting response time is approximately 1.5s for query without glob pattern.
But the same query with blob pattern is taking around 140s

Which ClickHouse server version to use
ClickHouse client version 23.7.3.14 (official build).

Expected performance
The response time in both cases should be more or less the same.

Additional context
Data stored in S3 is of parquet format. There are multiple files within 202308210735. As is obvious from the pattern, the data is time partitioned and there will be multiple folders like 202308210740, 202308210745 etc..
Additionally, there are multiple top-level folders as well (eg) Partition1, Partition2 etc..

I did run query analysis of both the queries and from that its quite clear that the glob pattern based query is leading a substantially higher S3ListObject & S3Reads. Below is a snapshot of the comparision

The text was updated successfully, but these errors were encountered:

tavplubix · 2023-08-21T15:23:30Z

Probably a duplicate of #49929

MaheshGPai · 2023-08-21T15:37:32Z

@tavplubix Yep, looks like its the same issue. I was just about to post that I was suspecting this is to be a result of highly partitioned bucket. The number of objects in the S3 bucket may be leading to the high number of list operations.

dchimeno · 2024-01-29T13:07:56Z

also suffered this in a query like

SELECT toStartOfHour(eventdate) as ts, max(value) as value
             FROM s3('https://s3.xx.amazonaws.com/xxxx/xxx/xxxxx/xxxxx/day={2023-12-12,2023-12-13,2023-12-14}/*.parquet','key', 'secret')
             GROUP BY ts
             order by ts
             SETTINGS max_threads=24

without string globbing, it works great.

MaheshGPai added the performance label Aug 21, 2023

filimonov added the comp-s3 label Sep 25, 2023

zvonand mentioned this issue Mar 31, 2024

Improve S3 glob performance #62120

Merged

29 tasks

kssenii closed this as completed in #62120 May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queries to S3 having glob patterns takes a long time to complete #53643

Queries to S3 having glob patterns takes a long time to complete #53643

MaheshGPai commented Aug 21, 2023 •

edited

tavplubix commented Aug 21, 2023

MaheshGPai commented Aug 21, 2023

dchimeno commented Jan 29, 2024

Queries to S3 having glob patterns takes a long time to complete #53643

Queries to S3 having glob patterns takes a long time to complete #53643

Comments

MaheshGPai commented Aug 21, 2023 • edited

tavplubix commented Aug 21, 2023

MaheshGPai commented Aug 21, 2023

dchimeno commented Jan 29, 2024

MaheshGPai commented Aug 21, 2023 •

edited