Is there a problem of duplicate downloads in S3 segmented downloads? #1673

FourSpaces · 2024-05-17T03:24:26Z

Question

Is there a problem of duplicate downloads in S3 segmented downloads?

When I was checking the logs of the default-worker, I found that the S3 version downloaded the same file multiple times in segments, and this data was duplicated, which may cause 2-3 copies of the same file to be pulled. I think this is unreasonable, please check it.

Logs of vw-default-1 node：

2024.05.16 13:28:08.198688 [ 525 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/dbcd24f5-94ea-4d3c-0efb-fd1c61af8d96/data
2024.05.16 13:28:08.232677 [ 525 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 13:28:08.232709 [ 525 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 05:28:08 GMT; Content-Type: application/xml; Content-Length: 18482198404; Connection: keep-
alive; Accept-Ranges: bytes; Content-Range: bytes 3254060092-21736258495/21736258496; ETag: "743898dd8e61224c8842cd916bf60150-1220"; Last-Modified: Sat, 04 May 2024 16:30:35 GMT; Strict-Transport-Security: max-a
ge=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 54285e0ecff7a52155e28b256ef91a6942aba64f5275133a9915e5a03a2b0fe3; X-Amz-Request-Id: 17CFE0EC38B207BE; X-Content-Type-Options: nosn
iff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449534533240356868; 


2024.05.16 13:28:08.360277 [ 457 ] {} <Debug> AWSClient: AWS S3 slow read(100ms): http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/dbcd24f5-94ea-4d3c-0efb-fd1c61af8d96/data, ti
me = 20051ms, header = Server: nginx/1.21.5; Date: Thu, 16 May 2024 05:28:08 GMT; Content-Type: application/xml; Content-Length: 62896; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 29185007
1-291912966/21736258496; ETag: "743898dd8e61224c8842cd916bf60150-1220"; Last-Modified: Sat, 04 May 2024 16:30:35 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-En
coding; X-Amz-Id-2: 941e76dd7d6fa756cdff7ccf88d4481bffbe769d1028d5b43704b4c55c73ddfa; X-Amz-Request-Id: 17CFE0E797717C07; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449
534533240356868; 


.16 13:28:08.360913 [ 457 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/dbcd24f5-94ea-4d3c-0efb-fd1c61af8d96/data
2024.05.16 13:28:08.391827 [ 457 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 13:28:08.391853 [ 457 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 05:28:08 GMT; Content-Type: application/xml; Content-Length: 21724365923; Connection: keep-
alive; Accept-Ranges: bytes; Content-Range: bytes 11892573-21736258495/21736258496; ETag: "743898dd8e61224c8842cd916bf60150-1220"; Last-Modified: Sat, 04 May 2024 16:30:35 GMT; Strict-Transport-Security: max-age
=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: d4ff7959db658b9e0dffd743dd392e33bc300cd3bbd7a5bd3d29810f53c3c9c8; X-Amz-Request-Id: 17CFE0EC4277F777; X-Content-Type-Options: nosnif
f; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449534533240356868; 


2024.05.16 13:28:09.732969 [ 502 ] {} <Debug> AWSClient: AWS S3 slow read(100ms): http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/dbcd24f5-94ea-4d3c-0efb-fd1c61af8d96/data, ti
me = 20032ms, header = Server: nginx/1.21.5; Date: Thu, 16 May 2024 05:28:09 GMT; Content-Type: application/xml; Content-Length: 62896; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 11623284
-11686179/21736258496; ETag: "743898dd8e61224c8842cd916bf60150-1220"; Last-Modified: Sat, 04 May 2024 16:30:35 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Enco
ding; X-Amz-Id-2: f26f03c24c5ec8801cb87012b4a127bedc053def9caa5a112cf1f83805ebe38e; X-Amz-Request-Id: 17CFE0E7EA1D555F; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#44953
4533240356868; 

2024.05.16 13:28:09.733863 [ 502 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/dbcd24f5-94ea-4d3c-0efb-fd1c61af8d96/data
2024.05.16 13:28:09.814513 [ 502 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 13:28:09.814544 [ 502 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 05:28:09 GMT; Content-Type: application/xml; Content-Length: 21727900351; Connection: keep-
alive; Accept-Ranges: bytes; Content-Range: bytes 8358145-21736258495/21736258496; ETag: "743898dd8e61224c8842cd916bf60150-1220"; Last-Modified: Sat, 04 May 2024 16:30:35 GMT; Strict-Transport-Security: max-age=
31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 1b73111f7edd0bc74bca9ee528d43c48f91ab0a501a53129569d2af3a2fa2865; X-Amz-Request-Id: 17CFE0EC9475C151; X-Content-Type-Options: nosniff
; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449534533240356868;

Logs of vw-default-2 node：

2024.05.16 16:17:47.217211 [ 493 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/d530f076-cda9-d83c-1900-af39694d000c/data
2024.05.16 16:17:47.239594 [ 493 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 16:17:47.239661 [ 493 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 44640; Connection: keep-alive;
 Accept-Ranges: bytes; Content-Range: bytes 209293991-209338630/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=3153600
0; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 6b55756232eaebad29d4d8b397bf9c659f243dca8d7ab18592690d0cc4c34798; X-Amz-Request-Id: 17CFEA2E327FEC1A; X-Content-Type-Options: nosniff; X-Xss
-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339; 
2024.05.16 16:17:47.240946 [ 493 ] {} <Debug> DiskLocal: Reserving 192.41 MiB on disk `server_local_0`, having unreserved 1.34 TiB.
2024.05.16 16:17:47.241188 [ 493 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/d530f076-cda9-d83c-1900-af39694d000c/data
2024.05.16 16:17:47.255143 [ 493 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 16:17:47.255166 [ 493 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 15452667054; Connection: keep-
alive; Accept-Ranges: bytes; Content-Range: bytes 7541875-15460208928/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=3
1536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 941e76dd7d6fa756cdff7ccf88d4481bffbe769d1028d5b43704b4c55c73ddfa; X-Amz-Request-Id: 17CFEA2E339618C0; X-Content-Type-Options: nosniff;
 X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339; 
2024.05.16 16:17:47.335909 [ 470 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 44640; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 7497235-7541874/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 1b73111f7edd0bc74bca9ee528d43c48f91ab0a501a53129569d2af3a2fa2865; X-Amz-Request-Id: 17CFEA2E384E4EDD; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339; 
2024.05.16 16:17:47.336322 [ 470 ] {} <Debug> DiskLocal: Reserving 99.45 KiB on disk `server_local_0`, having unreserved 1.34 TiB.
2024.05.16 16:17:47.336536 [ 470 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/d530f076-cda9-d83c-1900-af39694d000c/data
2024.05.16 16:17:47.354031 [ 470 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 16:17:47.354055 [ 470 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 15452813534; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 7395395-15460208928/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 080f919c38a146b0c9e87ef5ccef4e7f57727f513de461329f04b2a946d47397; X-Amz-Request-Id: 17CFEA2E3951B7FB; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339; 
2024.05.16 16:17:47.360179 [ 470 ] {} <Debug> DiskLocal: Reserving 43.59 KiB on disk `server_local_0`, having unreserved 1.34 TiB.
2024.05.16 16:17:47.372596 [ 490 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/d530f076-cda9-d83c-1900-af39694d000c/data
2024.05.16 16:17:47.392572 [ 490 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 16:17:47.392605 [ 490 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 44640; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 7350755-7395394/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: 080f919c38a146b0c9e87ef5ccef4e7f57727f513de461329f04b2a946d47397; X-Amz-Request-Id: 17CFEA2E3B908F6F; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339; 
2024.05.16 16:17:47.393722 [ 490 ] {} <Debug> DiskLocal: Reserving 2.22 MiB on disk `server_local_0`, having unreserved 1.34 TiB.
2024.05.16 16:17:47.393941 [ 490 ] {} <Debug> AWSClient: Make request to: http://minio-nginx-svc.minio-nginx.svc.cluster.local/bigdata-olap-data/pandora_data/d530f076-cda9-d83c-1900-af39694d000c/data
2024.05.16 16:17:47.419049 [ 490 ] {} <Debug> AWSClient: Response status: 206, Partial Content
2024.05.16 16:17:47.419073 [ 490 ] {} <Debug> AWSClient: Received headers: Server: nginx/1.21.5; Date: Thu, 16 May 2024 08:17:47 GMT; Content-Type: application/xml; Content-Length: 15455181897; Connection: keep-alive; Accept-Ranges: bytes; Content-Range: bytes 5027032-15460208928/15460208929; ETag: "30e2a9132c1d83500dcc2f8535b825ce-868"; Last-Modified: Sun, 05 May 2024 14:16:20 GMT; Strict-Transport-Security: max-age=31536000; includeSubDomains; Vary: Origin; Vary: Accept-Encoding; X-Amz-Id-2: a6df522adb4071567f8fb40dc8952caed1cdaa48348601461d84f5ba69b473c1; X-Amz-Request-Id: 17CFEA2E3CDA29CB; X-Content-Type-Options: nosniff; X-Xss-Protection: 1; mode=block; x-amz-meta-pg-id: T#449554659452387339;

Extract the content of Content Length and Content Range：

vw-default-1 node：

Content-Length: 18482198404; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 3254060092-21736258495/21736258496;

Content-Length: 62896; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 291850071-291912966/21736258496

Content-Length: 21724365923; Connection: keep-
alive; Accept-Ranges: bytes; 
Content-Range: bytes 11892573-21736258495/21736258496; 

Content-Length: 62896; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 11623284-11686179/21736258496;

Content-Length: 21727900351; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 8358145-21736258495/21736258496;

vw-default-2 node：

Content-Length: 44640; Connection: keep-alive;Accept-Ranges: bytes; 
Content-Range: bytes 209293991-209338630/15460208929;

Content-Length: 15452667054; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7541875-15460208928/15460208929;

Content-Length: 44640; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7497235-7541874/15460208929;

Content-Length: 15452813534; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7395395-15460208928/15460208929;

Content-Length: 44640; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7350755-7395394/15460208929;

Content-Length: 15455181897; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 5027032-15460208928/15460208929;

It can be seen that there are overlapping parts when reading multiple segments of the same file, and this has a significant impact on S3 bandwidth.
Is it reasonable to use segmentation to read data segments exceeding 10G?

The text was updated successfully, but these errors were encountered:

Alima777 · 2024-05-21T07:06:34Z

Hi, don't worry, it's not a bug but an optimization.

You can see the following two request, they both use file size as right offset, which doesn't mean we will read all data of this request, but only read what we need. The following data will be discarded. They can be two columns reading concurrently from two threads in your case.

We use this method so that we can reduce the total GET requests of S3 if we will read lots of data.

Content-Length: 15452667054; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7541875-15460208928/15460208929;

Content-Length: 15452813534; Connection: keep-alive; Accept-Ranges: bytes; 
Content-Range: bytes 7395395-15460208928/15460208929;

By this way, which version you are using? Actually, we have optimized the accurate right offset rather than file size.

FourSpaces · 2024-05-22T01:34:57Z

We are using version 0.4.0. If they are multi-threaded to read different columns, their total read count should be less than or equal to the total file size, rather than 1.5 to 2 times the file size. This has caused an increase in S3 bandwidth.

Alima777 · 2024-05-22T02:33:06Z

How did you get the conclusion 1.5 to 2 times the file size?

FourSpaces · 2024-05-22T02:41:38Z

I determined based on the length of Content Length, and the sum of the lengths of two Content Lengths is already greater than the file size.
In the example, the sum of two Content Lengths is 15452667054+15452813534=30905480588, while the file size is 15460208929, 30905480588/15460208929=1.999

FourSpaces · 2024-05-22T03:04:24Z

We have set these parameters:

enable_io_scheduler: 1
enable_io_pfra: 1

Alima777 · 2024-05-22T04:08:54Z

I determined based on the length of Content Length, and the sum of the lengths of two Content Lengths is already greater than the file size.

That's not true..Just like I said before, we send the request but not read all data from it. And S3 doesn't return all data either. It's like a stream.

Alima777 · 2024-05-22T04:09:44Z

enable_io_scheduler: 1
enable_io_pfra: 1

You can try close these two, in this case, right offset will be accurate.

FourSpaces · 2024-05-22T06:47:16Z

We can give it a try

kevinthfang · 2024-06-04T03:17:27Z

any updates?

FourSpaces · 2024-06-04T04:59:04Z

After our adjustment, the merging of data became slower and S3QPS increased significantly, so we rolled it back.

FourSpaces added the question Further information is requested label May 17, 2024

kevinthfang assigned smmsmm1988 May 21, 2024

smmsmm1988 assigned Alima777 May 22, 2024

kevinthfang added the not an issue label Jun 11, 2024

kevinthfang closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a problem of duplicate downloads in S3 segmented downloads? #1673

Is there a problem of duplicate downloads in S3 segmented downloads? #1673

FourSpaces commented May 17, 2024 •

edited

Loading

Alima777 commented May 21, 2024 •

edited

Loading

FourSpaces commented May 22, 2024

Alima777 commented May 22, 2024

FourSpaces commented May 22, 2024 •

edited

Loading

FourSpaces commented May 22, 2024

Alima777 commented May 22, 2024

Alima777 commented May 22, 2024

FourSpaces commented May 22, 2024

kevinthfang commented Jun 4, 2024

FourSpaces commented Jun 4, 2024

Is there a problem of duplicate downloads in S3 segmented downloads? #1673

Is there a problem of duplicate downloads in S3 segmented downloads? #1673

Comments

FourSpaces commented May 17, 2024 • edited Loading

Question

Alima777 commented May 21, 2024 • edited Loading

FourSpaces commented May 22, 2024

Alima777 commented May 22, 2024

FourSpaces commented May 22, 2024 • edited Loading

FourSpaces commented May 22, 2024

Alima777 commented May 22, 2024

Alima777 commented May 22, 2024

FourSpaces commented May 22, 2024

kevinthfang commented Jun 4, 2024

FourSpaces commented Jun 4, 2024

FourSpaces commented May 17, 2024 •

edited

Loading

Alima777 commented May 21, 2024 •

edited

Loading

FourSpaces commented May 22, 2024 •

edited

Loading