-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default compression method for flat files #36
Comments
This reply is related to this question and also the other issue about glue and parquet conversion. The two main concerns for me have been performance and compatibility. As for csv compression, I've been testing gzip and bzip2 compressions, but ended up giving up on both and just writing uncompressed csv:s. The issue with gzip is that it is not 'splittable', meaning a single worker node in Athena/Presto needs to take care of uncompressing, which occasionally leads to a performance bottlenecks especially with large files (and for small files compression is unnecessary). Of course would could split the data prior to compression and upload multiple gzipped csv's. The issue with bzip2 is, that the compression is much slower, which again was a showstopper for me. My solution:
I'll continue about an option how to create parq's and orcs in the other issue (#37). |
Thanks @OssiLehtinen for the extra insight. Currently
If user is concerned about performance then For the time being I believe the option to create compressed flat files should be available for users however I might include Top 10 Performance Tuning Tips for Amazon Athena in the readme documentation to help users set up s3 to get the most out of AWS Athena when using |
Correct me if I am wrong. In their current states |
I don't think any changes are required. |
Perhaps I'm missing existing functionality, but one way to address the splittability issue with gzipped files would be to split the data into pieces before writing and uploading to S3? Or, I think, this can be achieved with the current setup by splitting the before hand data into partitions and using copy_top on each, but one could do this for the user automatically (if requested). Something along the lines of having an optional parameter chunk_size* in copy_to and if this is set to an a positive integer, split the data to such pieces, and recursively run copy_to on each piece with the same destination, but a different file name (e.g., an incrementing suffix). You would end up with a pile of gzipped csv:s containing all the data, and the reading work could be split accordingly. Partitioning would be optional in this case as Athena is fine with having just a pile of files in the same S3 path. |
chunk_size could be built but the only problem is defining the partition. However something like this could be do able:
Only concern would be users setting the chunks too small and losing the benefits of partitioning data. |
One thing to keep in mind is that it is not necessary to define partitions when splitting the data. One can just have something like
and it will work just fine. Having partitions can be useful on top of that. One idea would be to just give the partition column and have R automatically split the data by unique values in said column and do the partitioning. |
One more note on partitioning: this can be a limitation also. Let's say one wishes to append data daily to a table. If one uploads the data in partitioned fashion, things will break down after 100 days, as only 100 partitions are allowed in Athena. Dumping the data to just the same path without partitioning will not have such problem, however. Probably there will be performance issues eventually, once the number of files grows exceedingly large. Some tables I've been using have some 500 (smallish) files in them and they work OK. |
Correct me if I am wrong but it looks like there is a limitation of :
https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html#limits_glue |
This method looks promising, will investigate a possible implementation. Will have to see if there is an increase in speed by this method. |
Hmm I think I have mixed up things. Apparently the 100 partition limit applies specifically to CTAS queries. Sorry about the confusion! https://docs.aws.amazon.com/athena/latest/ug/bucketing-vs-partitioning.html |
Not to worry, got alot of information. Going to test the possible solution for spliting gzip files into chunks |
Created branch https://github.com/DyfanJones/RAthena/tree/chunksize to investigate suggestion of max.batch for gzip compression in S3 folder format:
Initial finding:
AWS Athena performance results from AWS console (query executed:
It appears there is a significant performance enhancement with splitting |
This feels like a really good method for users who prefer to use csv.gzip files instead of |
will do further testing on a larger data.frame |
Well, the performance tuning guideline from AWS says individual files shouldn't be much smaller than 128 MB, but not sure what that would mean in terms of line counts. Obviously depends linearly on the width of the table too. Maybe a default of 100k would be ok. There's some overhead in splitting the data before writing, but maybe not that much? |
Still not sure if the 128 MB limit applies to compressed or uncompressed files... |
Ran another test to see what is the possible benefits of this method:
AWS Athena performance results from AWS console (query executed: select count(*) from ....):
I believe there is a clear benefit of doing this method with gzip files. but i don't think a fixed max.batch is good enough. Possibly a minimum batch size can be set and then a dynamic one can be used. |
Interesting! Could there be some optimal(ish) total number of pieces? Probably the number of workers a query gets varies, but these could be related. |
I am not too sure how to get that information. Plus i am not sure if it is obtainable through the SDK's. If this is the case the could look at object size and the determine the batch size. However I believe this feature should be implement in it's initial state and then future development can happen. There are clear benefits of this and a semi smart solution is better than no solution :) |
Yeah I was thinkin more in the direction of making a rough guess what a typical number of workers could be (let's say ten) and split to that many pieces. |
That is fairly easy to implement. Plus think a minimum batch can be set to help to prevent too small of files being created, and causing a large over head |
Last speed test:
AWS Athena performance results from AWS console (query executed: select count(*) from ....):
From these findings I will use the 20 split as default for compressed flat files: #39 If there any objections with the default split I am happy to change it. Overall this increase in performance is very promising and should make the user experience a lot smoother when working with AWS Athena |
@OssiLehtinen noticed that
This will split the example into 20 csv gzip files to help with AWS Athena performance. After updates have been added to |
Perfect! And I'm learning here too, so it's a win win :) |
Currently
RAthena
andnoctua
support gzip compression when uploading data to S3 and Athena. Is there a better compression algorithm for flat files? Top 10 Performance Tuning Tips for Amazon AthenaFrom this it looks like BZIP2/GZIP are currently recommended. Might need to benchmark speed of BZip2 and GZIP files when reading from Athena
The text was updated successfully, but these errors were encountered: