S3 file size issue #126

HenryCaiHaiying · 2023-03-14T22:21:51Z

This is related to #125

From the code here https://github.com/aiven/tiered-storage-for-apache-kafka/blob/main/s3/src/main/java/io/aiven/kafka/tiered/storage/s3/S3ClientWrapper.java#L179, it was trying to break the Kafka log segment file into multiple part files and upload them one by one.

This will create multiple files on S3 corresponds to one original log segment file. And this will leads into many small file issues on S3 which will hinder the S3 performance (especially on object listing which you are using).

If the goal is to improve the upload performance, you can use S3's multipart upload which uses multiple threads. But the target file on S3 is just one file (instead of many small parts files). And on the reading/downloading path, you can also use S3's range API to read a chunk of bytes instead of the whole S3 object: https://docs.aws.amazon.com/AmazonS3/latest/userguide/download-objects.html

      // Get a range of bytes from an object and print the bytes.
            GetObjectRequest rangeObjectRequest = new GetObjectRequest(bucketName, key)
                    .withRange(0, 9);

The text was updated successfully, but these errors were encountered:

AnatolyPopov · 2023-03-15T09:44:34Z

Thanks for an issue and we are already looking into this and aware of this. The idea behind splitting into multiple files was related to compression and encryption. We right now are assessing how this should be implemented but in current implementation because of compression and encryption we can not really request the byte range.

AnatolyPopov · 2023-03-15T09:47:34Z

Also the file size in configurable and it can be tweaked that the files are not necessary small or if the segment file is smaller than configured file size, the segment will not be split. So since some people have really big segment sizes and together with compression and encryption we do not really want to fetch the whole file to decrypt/decompress. But as I mentioned before we are looking into how to improve this.

mdedetrich · 2023-03-15T10:24:18Z

MultiPart upload is definitely beneficial for larger users than the standard file upload but the elephant in the room here is compression/encryption. As @AnatolyPopov mentioned, while its possible to make fetching of byte ranges work with encryption (which can ensure the same block sizes as the plaintext data), this doesn't work trivially with compression.

We are trying to solve these issues, however I don't see harm in using the S3 multipart upload in the specific case of no compression and no encryption.

AnatolyPopov · 2023-03-15T10:39:31Z

With no compression and encryption it's for sure possible but those are not optional as of right now and this needs to be changed.

mdedetrich · 2023-03-15T10:40:11Z

@AnatolyPopov Shall we make an issue for this?

AnatolyPopov · 2023-03-15T11:02:54Z

@mdedetrich Yes, we should.

HenryCaiHaiying · 2023-03-15T22:42:05Z

I think the Kafka log segment file is already compressed (snappy, gzip or zstd), not sure whether we need to compress it further.

mdedetrich · 2023-03-16T07:58:47Z

I think the Kafka log segment file is already compressed (snappy, gzip or zstd), not sure whether we need to compress it further.

So we are planning to add a feature where if Kafka is configured to compress segments then we won't recompress them. For our specific usecase since we are dealing with external users that can configure Kafka it would be nice if we could compress on the plugin level but as stated earlier it is problematic.

ivanyu · 2023-06-01T12:53:03Z

This was changed radically in the new implementation. Now, despite chunking, we still upload a single blob and support range queries.

HenryCaiHaiying · 2023-06-03T01:43:42Z

Are we still uploading many small files onto S3?

ivanyu · 2023-06-05T04:11:06Z

No, compression and encryption are performed by chunk, but the result nevertheless is concatenated before upload. Regardless of the log file size, each segment will produce these files on the remote: https://github.com/aiven/tiered-storage-for-apache-kafka/blob/main/core/src/test/java/io/aiven/kafka/tieredstorage/RemoteStorageManagerTest.java#L210-L224

HenryCaiHaiying mentioned this issue Mar 14, 2023

Use S3's range Object to fetch a chunk of bytes #127

Closed

ivanyu closed this as completed Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 file size issue #126

S3 file size issue #126

HenryCaiHaiying commented Mar 14, 2023

AnatolyPopov commented Mar 15, 2023

AnatolyPopov commented Mar 15, 2023

mdedetrich commented Mar 15, 2023

AnatolyPopov commented Mar 15, 2023

mdedetrich commented Mar 15, 2023

AnatolyPopov commented Mar 15, 2023

HenryCaiHaiying commented Mar 15, 2023

mdedetrich commented Mar 16, 2023

ivanyu commented Jun 1, 2023

HenryCaiHaiying commented Jun 3, 2023

ivanyu commented Jun 5, 2023

S3 file size issue #126

S3 file size issue #126

Comments

HenryCaiHaiying commented Mar 14, 2023

AnatolyPopov commented Mar 15, 2023

AnatolyPopov commented Mar 15, 2023

mdedetrich commented Mar 15, 2023

AnatolyPopov commented Mar 15, 2023

mdedetrich commented Mar 15, 2023

AnatolyPopov commented Mar 15, 2023

HenryCaiHaiying commented Mar 15, 2023

mdedetrich commented Mar 16, 2023

ivanyu commented Jun 1, 2023

HenryCaiHaiying commented Jun 3, 2023

ivanyu commented Jun 5, 2023