Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to configure S3 upload requests concurrency? #48389

Closed
cricun opened this issue Apr 4, 2023 · 3 comments
Closed

Is there a way to configure S3 upload requests concurrency? #48389

cricun opened this issue Apr 4, 2023 · 3 comments
Labels
question Question?

Comments

@cricun
Copy link

cricun commented Apr 4, 2023

I'm trying to export data to S3.
If the file is bigger than s3_max_single_part_upload_size then the file is sent as multiparts. Let's say here we split the file into 500 parts with a size of 8 MB each. ClickHouse starts sending all the parts in parallel, this is managed by setting the limit of requests per second (s3_max_put_rps ). If any request is not answered in 3 seconds, then the request is sent again. This leads to a situation where S3 still processes the previous request, but receives a new one for the same part, up to 11 times per each. This happens with a slow network or high load on the S3 side.
In the worst case in the given example, 5500 requests may be sent instead of 500, and all of them may be sent in parallel. And each consumes resources on the receiving side.

Is there a way to prevent this? Either:

  1. Limit concurrency. For example, send not more than 5 parts at one time and don't continue until getting a response (or timeout) for at least one.
  2. Disable concurrency and send a new request only after the old one is answered or timed out.

Query:

INSERT INTO FUNCTION s3('http://localhost:9008/s3bucket/filename', 'access_key', 'secret_key', 'CSV') SELECT number
FROM numbers(500000000)
@cricun cricun added the question Question? label Apr 4, 2023
@serxa
Copy link
Member

serxa commented Apr 10, 2023

Unfortunately, at the moment there is no way to properly limit the number of in-flight S3 requests. But you can try a number of other things:

  1. As a temporary (not for production) way you may limit the number of threads in IOThreadPool. Every S3 request uses its own thread, so it will do the trick. But unfortunately, there are other processes that also use IOThreadPool and they might degrade. Server setting max_io_thread_pool_size control the number of thread, and btw the default value is 100, so there should not be more than 100 simultaneous requests (not counting retries). You can monitor it using select * from system.metrics where metric = 'S3Requests' query
  2. There are settings that limit not S3 request rate, but network bandwidth. They might be helpful if the bottleneck is the egress network channel and not S3 itself. Please check query setting max_remote_write_network_bandwidth and server setting max_remote_write_network_bandwidth_for_server.

A proper solution should appear after implementing IO scheduling. It is a pretty long task and you can track implementation here #47009 (and maybe some follow-up PRs)

@serxa
Copy link
Member

serxa commented Apr 10, 2023

  1. Also you can try to split your data into bigger parts, e.g. by lowering s3_upload_part_size_multiply_parts_count_threshold. Check the following settings:
    a)s3_upload_part_size_multiply_factor, default=2. Multiply s3_min_upload_part_size by this factor each time s3_multiply_parts_count_threshold parts were uploaded from a single write to S3.
    b)s3_upload_part_size_multiply_parts_count_threshold, default=500. Each time this number of parts was uploaded to S3 s3_min_upload_part_size multiplied by s3_upload_part_size_multiply_factor.
    c) s3_max_single_part_upload_size, 32MB. The maximum size of object to upload using singlepart upload to S3.

@cricun
Copy link
Author

cricun commented Apr 10, 2023

@serxa Thank you for your willingness to help and for providing several solutions. Unfortunately, none of these could completely resolve my issue:

  1. Export to S3 is not high priority task for us, keeping system from degradation is more important, this functionality has to be used in production
  2. The bottleneck is the custom S3 server.
  3. This could help. I will increase s3_max_single_part_upload_size as soon as will get the information about real-user experience, maybe most of the requests will fit in one hundred. Tried increasing the multipart size, but things went worse, fixed it at a constant size.

As a temporary solution, I've limited s3_max_put_rps, and s3_max_upload_part_size on ClickHouse side and timeouts on the S3 side. Setting these parameters can guarantee limiting connections (=rps*timeout) on ClickHouse side and RAM consumption (=connections*request_size) on the S3 but at the cost of drastically reduced performance.

@cricun cricun closed this as completed Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question?
Projects
None yet
Development

No branches or pull requests

2 participants