-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a problem of duplicate downloads in S3 segmented downloads? #1673
Comments
Hi, don't worry, it's not a bug but an optimization. You can see the following two request, they both use file size as right offset, which doesn't mean we will read all data of this request, but only read what we need. The following data will be discarded. They can be two columns reading concurrently from two threads in your case. We use this method so that we can reduce the total GET requests of S3 if we will read lots of data.
By this way, which version you are using? Actually, we have optimized the accurate right offset rather than file size. |
We are using version 0.4.0. If they are multi-threaded to read different columns, their total read count should be less than or equal to the total file size, rather than 1.5 to 2 times the file size. This has caused an increase in S3 bandwidth. |
How did you get the conclusion |
I determined based on the length of Content Length, and the sum of the lengths of two Content Lengths is already greater than the file size. |
We have set these parameters:
|
That's not true..Just like I said before, we send the request but not read all data from it. And S3 doesn't return all data either. It's like a stream. |
You can try close these two, in this case, right offset will be accurate. |
We can give it a try |
any updates? |
After our adjustment, the merging of data became slower and S3QPS increased significantly, so we rolled it back. |
Question
Is there a problem of duplicate downloads in S3 segmented downloads?
When I was checking the logs of the default-worker, I found that the S3 version downloaded the same file multiple times in segments, and this data was duplicated, which may cause 2-3 copies of the same file to be pulled. I think this is unreasonable, please check it.
Logs of vw-default-1 node:
Logs of vw-default-2 node:
Extract the content of Content Length and Content Range:
vw-default-1 node:
vw-default-2 node:
It can be seen that there are overlapping parts when reading multiple segments of the same file, and this has a significant impact on S3 bandwidth.
Is it reasonable to use segmentation to read data segments exceeding 10G?
The text was updated successfully, but these errors were encountered: