Is there a way to configure S3 upload requests concurrency? #48389

cricun · 2023-04-04T14:40:16Z

I'm trying to export data to S3.
If the file is bigger than s3_max_single_part_upload_size then the file is sent as multiparts. Let's say here we split the file into 500 parts with a size of 8 MB each. ClickHouse starts sending all the parts in parallel, this is managed by setting the limit of requests per second (s3_max_put_rps ). If any request is not answered in 3 seconds, then the request is sent again. This leads to a situation where S3 still processes the previous request, but receives a new one for the same part, up to 11 times per each. This happens with a slow network or high load on the S3 side.
In the worst case in the given example, 5500 requests may be sent instead of 500, and all of them may be sent in parallel. And each consumes resources on the receiving side.

Is there a way to prevent this? Either:

Limit concurrency. For example, send not more than 5 parts at one time and don't continue until getting a response (or timeout) for at least one.
Disable concurrency and send a new request only after the old one is answered or timed out.

Query:

INSERT INTO FUNCTION s3('http://localhost:9008/s3bucket/filename', 'access_key', 'secret_key', 'CSV') SELECT number
FROM numbers(500000000)

The text was updated successfully, but these errors were encountered:

serxa · 2023-04-10T14:53:58Z

Unfortunately, at the moment there is no way to properly limit the number of in-flight S3 requests. But you can try a number of other things:

As a temporary (not for production) way you may limit the number of threads in IOThreadPool. Every S3 request uses its own thread, so it will do the trick. But unfortunately, there are other processes that also use IOThreadPool and they might degrade. Server setting max_io_thread_pool_size control the number of thread, and btw the default value is 100, so there should not be more than 100 simultaneous requests (not counting retries). You can monitor it using select * from system.metrics where metric = 'S3Requests' query
There are settings that limit not S3 request rate, but network bandwidth. They might be helpful if the bottleneck is the egress network channel and not S3 itself. Please check query setting max_remote_write_network_bandwidth and server setting max_remote_write_network_bandwidth_for_server.

A proper solution should appear after implementing IO scheduling. It is a pretty long task and you can track implementation here #47009 (and maybe some follow-up PRs)

serxa · 2023-04-10T15:27:22Z

Also you can try to split your data into bigger parts, e.g. by lowering s3_upload_part_size_multiply_parts_count_threshold. Check the following settings:
a)s3_upload_part_size_multiply_factor, default=2. Multiply s3_min_upload_part_size by this factor each time s3_multiply_parts_count_threshold parts were uploaded from a single write to S3.
b)s3_upload_part_size_multiply_parts_count_threshold, default=500. Each time this number of parts was uploaded to S3 s3_min_upload_part_size multiplied by s3_upload_part_size_multiply_factor.
c) s3_max_single_part_upload_size, 32MB. The maximum size of object to upload using singlepart upload to S3.

cricun · 2023-04-10T16:15:17Z

@serxa Thank you for your willingness to help and for providing several solutions. Unfortunately, none of these could completely resolve my issue:

Export to S3 is not high priority task for us, keeping system from degradation is more important, this functionality has to be used in production
The bottleneck is the custom S3 server.
This could help. I will increase s3_max_single_part_upload_size as soon as will get the information about real-user experience, maybe most of the requests will fit in one hundred. Tried increasing the multipart size, but things went worse, fixed it at a constant size.

As a temporary solution, I've limited s3_max_put_rps, and s3_max_upload_part_size on ClickHouse side and timeouts on the S3 side. Setting these parameters can guarantee limiting connections (=rps*timeout) on ClickHouse side and RAM consumption (=connections*request_size) on the S3 but at the cost of drastically reduced performance.

cricun added the question Question? label Apr 4, 2023

cricun closed this as completed Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to configure S3 upload requests concurrency? #48389

Is there a way to configure S3 upload requests concurrency? #48389

cricun commented Apr 4, 2023

serxa commented Apr 10, 2023

serxa commented Apr 10, 2023 •

edited

cricun commented Apr 10, 2023

Is there a way to configure S3 upload requests concurrency? #48389

Is there a way to configure S3 upload requests concurrency? #48389

Comments

cricun commented Apr 4, 2023

serxa commented Apr 10, 2023

serxa commented Apr 10, 2023 • edited

cricun commented Apr 10, 2023

serxa commented Apr 10, 2023 •

edited