-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aws Glue to convert csv to parquet #37
Comments
note: this may not be implemented, but more to investigate the feasibility |
This stackoverflow entry looks promising. How to convert many csv files to parquet using glue |
My solution for creating parquet's and orc's to be consumed by Athena (and other AWS services such as Redshift Spectrum) has been to leverage Athena itself for this task. The steps to do this are:
create table mydb.tab_name
(with external_location = 's3://my_athena_results/my_orc_stas_table/',
format = 'ORC')
as (select * from temp.temp_table) * as in, to be deleted after this process The advantages are, that one does not need any new tools for this, the resulting files are pretty much guaranteed to be compatible (I had trouble earlier with externally generated orc's), and one can out source the parc/orc generation optimizations to Athena/Presto. It is a bit of a kludge, but has been working well even in close to production type processes. |
@OssiLehtinen the method you have highlighted above is currently supported through dplyr's
I believe if this method is favourable then it should not only be supported through dplyr's However ORC format isn't support. As Athena is doing the process then I believe it should be support in |
That's right! I'm still thinking about copy_to -functionality. There the parquet is currently generated using arrow, but one could incorporate the commands you outline there as well, when a parquet or orc end result is requested. |
@OssiLehtinen the good thing about using arrow in R is that there isn't any extra cost from AWS. Maybe a parameter to tell the function where the parquet file should be created, in R or in AWS Athena? Side note will have to see how AWS Athena creates the parquet file, and to check if there isn't a faster method i.e. AWS Glue. Just thinking about performance when uploading a file to AWS Athena from R |
The cost is a good point to think about. The cost per call will mostly be quite minimal, but obviously not zero either. E.g., doing a conversion of a 2 GB csv with Athena would cost 1 cent. I might be wrong, but Glue is probably not a good option here. First of all the cost is probably higher. At least if you need to spin up a an ETL job of type Apache Spark, you pay for a minimum of 10 minutes at $0.44 per DPU hour. Also, the wait times for the jobs to start (with creating the Spark cluster and what not) are quite long at least for interactive work. I don't have have hands on experience with the Python shell type jobs, however. Perhaps those start fast. Also not sure if a Spark type job is required for creating parquets. |
Possible solution for creating history tables using AWS Athena method: https://docs.aws.amazon.com/athena/latest/ug/insert-into.html |
Yeah, the insert functionality is quite useful for any 'append' type of operations! For copy_to with append one can use almost the same operation as in #37 (comment) |
Will extend this suggestion so that it is supported in base package as well as Possible function names:
|
dbConvertTable sounds good to me! |
PR #106 brings this feature request. |
For what ever reason, if apache arrow cant be installed onto local machine should aws glue be used to convert csv file format parquet to help with aws athena performance?
The text was updated successfully, but these errors were encountered: