Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries to S3 having glob patterns takes a long time to complete #53643

Closed
MaheshGPai opened this issue Aug 21, 2023 · 3 comments · Fixed by #62120
Closed

Queries to S3 having glob patterns takes a long time to complete #53643

MaheshGPai opened this issue Aug 21, 2023 · 3 comments · Fixed by #62120

Comments

@MaheshGPai
Copy link

MaheshGPai commented Aug 21, 2023

Describe the situation
There is a huge difference wrt query response time when querying S3 with or without glob patterns in the S3 URL

How to reproduce
Below is an example query without glob pattern

SELECT Column1, Column2, _path
FROM s3('https://test-s3-bucket.s3.us-west-2.amazonaws.com/Partition1/202308210735/*.parquet', <aws_access_key_id>, <aws_secret_access_key>) LIMIT 2

Below is an example query with glob pattern

SELECT Column1, Column2, _path
FROM s3('https://test-s3-bucket.s3.us-west-2.amazonaws.com/Partition1/{202308210735,DUMMY}/*.parquet', <aws_access_key_id>, <aws_secret_access_key>) LIMIT 2

Effectively both the queries should be reading or parsing the same number of files ("DUMMY" used the glob pattern is non-existent).
The resulting response time is approximately 1.5s for query without glob pattern.
But the same query with blob pattern is taking around 140s

  • Which ClickHouse server version to use
    ClickHouse client version 23.7.3.14 (official build).

Expected performance
The response time in both cases should be more or less the same.

Additional context
Data stored in S3 is of parquet format. There are multiple files within 202308210735. As is obvious from the pattern, the data is time partitioned and there will be multiple folders like 202308210740, 202308210745 etc..
Additionally, there are multiple top-level folders as well (eg) Partition1, Partition2 etc..

I did run query analysis of both the queries and from that its quite clear that the glob pattern based query is leading a substantially higher S3ListObject & S3Reads. Below is a snapshot of the comparision

image
@tavplubix
Copy link
Member

Probably a duplicate of #49929

@MaheshGPai
Copy link
Author

@tavplubix Yep, looks like its the same issue. I was just about to post that I was suspecting this is to be a result of highly partitioned bucket. The number of objects in the S3 bucket may be leading to the high number of list operations.

@dchimeno
Copy link

also suffered this in a query like

SELECT toStartOfHour(eventdate) as ts, max(value) as value
             FROM s3('https://s3.xx.amazonaws.com/xxxx/xxx/xxxxx/xxxxx/day={2023-12-12,2023-12-13,2023-12-14}/*.parquet','key', 'secret')
             GROUP BY ts
             order by ts
             SETTINGS max_threads=24

without string globbing, it works great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants