Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 file size issue #126

Closed
HenryCaiHaiying opened this issue Mar 14, 2023 · 11 comments
Closed

S3 file size issue #126

HenryCaiHaiying opened this issue Mar 14, 2023 · 11 comments

Comments

@HenryCaiHaiying
Copy link

This is related to #125

From the code here https://github.com/aiven/tiered-storage-for-apache-kafka/blob/main/s3/src/main/java/io/aiven/kafka/tiered/storage/s3/S3ClientWrapper.java#L179, it was trying to break the Kafka log segment file into multiple part files and upload them one by one.

This will create multiple files on S3 corresponds to one original log segment file. And this will leads into many small file issues on S3 which will hinder the S3 performance (especially on object listing which you are using).

If the goal is to improve the upload performance, you can use S3's multipart upload which uses multiple threads. But the target file on S3 is just one file (instead of many small parts files). And on the reading/downloading path, you can also use S3's range API to read a chunk of bytes instead of the whole S3 object: https://docs.aws.amazon.com/AmazonS3/latest/userguide/download-objects.html

      // Get a range of bytes from an object and print the bytes.
            GetObjectRequest rangeObjectRequest = new GetObjectRequest(bucketName, key)
                    .withRange(0, 9);

@AnatolyPopov
Copy link
Contributor

Thanks for an issue and we are already looking into this and aware of this. The idea behind splitting into multiple files was related to compression and encryption. We right now are assessing how this should be implemented but in current implementation because of compression and encryption we can not really request the byte range.

@AnatolyPopov
Copy link
Contributor

Also the file size in configurable and it can be tweaked that the files are not necessary small or if the segment file is smaller than configured file size, the segment will not be split. So since some people have really big segment sizes and together with compression and encryption we do not really want to fetch the whole file to decrypt/decompress. But as I mentioned before we are looking into how to improve this.

@mdedetrich
Copy link
Contributor

MultiPart upload is definitely beneficial for larger users than the standard file upload but the elephant in the room here is compression/encryption. As @AnatolyPopov mentioned, while its possible to make fetching of byte ranges work with encryption (which can ensure the same block sizes as the plaintext data), this doesn't work trivially with compression.

We are trying to solve these issues, however I don't see harm in using the S3 multipart upload in the specific case of no compression and no encryption.

@AnatolyPopov
Copy link
Contributor

With no compression and encryption it's for sure possible but those are not optional as of right now and this needs to be changed.

@mdedetrich
Copy link
Contributor

@AnatolyPopov Shall we make an issue for this?

@AnatolyPopov
Copy link
Contributor

@mdedetrich Yes, we should.

@HenryCaiHaiying
Copy link
Author

I think the Kafka log segment file is already compressed (snappy, gzip or zstd), not sure whether we need to compress it further.

@mdedetrich
Copy link
Contributor

I think the Kafka log segment file is already compressed (snappy, gzip or zstd), not sure whether we need to compress it further.

So we are planning to add a feature where if Kafka is configured to compress segments then we won't recompress them. For our specific usecase since we are dealing with external users that can configure Kafka it would be nice if we could compress on the plugin level but as stated earlier it is problematic.

@ivanyu
Copy link
Contributor

ivanyu commented Jun 1, 2023

This was changed radically in the new implementation. Now, despite chunking, we still upload a single blob and support range queries.

@ivanyu ivanyu closed this as completed Jun 1, 2023
@HenryCaiHaiying
Copy link
Author

Are we still uploading many small files onto S3?

@ivanyu
Copy link
Contributor

ivanyu commented Jun 5, 2023

No, compression and encryption are performed by chunk, but the result nevertheless is concatenated before upload. Regardless of the log file size, each segment will produce these files on the remote: https://github.com/aiven/tiered-storage-for-apache-kafka/blob/main/core/src/test/java/io/aiven/kafka/tieredstorage/RemoteStorageManagerTest.java#L210-L224

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants