Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aws Glue to convert csv to parquet #37

Closed
DyfanJones opened this issue Dec 3, 2019 · 12 comments
Closed

Aws Glue to convert csv to parquet #37

DyfanJones opened this issue Dec 3, 2019 · 12 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@DyfanJones
Copy link
Owner

For what ever reason, if apache arrow cant be installed onto local machine should aws glue be used to convert csv file format parquet to help with aws athena performance?

@DyfanJones DyfanJones added the question Further information is requested label Dec 3, 2019
@DyfanJones
Copy link
Owner Author

note: this may not be implemented, but more to investigate the feasibility

@DyfanJones
Copy link
Owner Author

DyfanJones commented Dec 3, 2019

This stackoverflow entry looks promising. How to convert many csv files to parquet using glue

@OssiLehtinen
Copy link

OssiLehtinen commented Dec 4, 2019

My solution for creating parquet's and orc's to be consumed by Athena (and other AWS services such as Redshift Spectrum) has been to leverage Athena itself for this task.

The steps to do this are:

  1. Upload a csv to a temporary* S3 location
  2. Create a temporary Athena table 'temp.temp_table' pointing to the csv
  3. Create the final table and files with a CTAS-statement pointing to the temp table as in:
create table mydb.tab_name 
(with external_location = 's3://my_athena_results/my_orc_stas_table/',
      format = 'ORC')
as (select * from temp.temp_table)

* as in, to be deleted after this process

The advantages are, that one does not need any new tools for this, the resulting files are pretty much guaranteed to be compatible (I had trouble earlier with externally generated orc's), and one can out source the parc/orc generation optimizations to Athena/Presto.

It is a bit of a kludge, but has been working well even in close to production type processes.
Of course all this makes sense only for content which is going to be reused multiple times after creation as the temporary csv version of the table needs to be read once in any case.

@DyfanJones
Copy link
Owner Author

DyfanJones commented Dec 4, 2019

@OssiLehtinen the method you have highlighted above is currently supported through dplyr's compute method.

library(DBI)
library(dplyr)

con <- dbConnect(RAthena::athena())

dbWriteTable(con, "iris", iris)

tbl(con, "iris") %>% 
  compute("iris_parquet", file_type = "parquet", s3_location = "s3://mybucket/iris_parquet/"))

dbRemoveTable(con, "iris")

I believe if this method is favourable then it should not only be supported through dplyr's compute but also callable through RAthena and noctua set functions.

However ORC format isn't support. As Athena is doing the process then I believe it should be support in RAthena and noctua packages

@OssiLehtinen
Copy link

That's right!

I'm still thinking about copy_to -functionality. There the parquet is currently generated using arrow, but one could incorporate the commands you outline there as well, when a parquet or orc end result is requested.

@DyfanJones
Copy link
Owner Author

@OssiLehtinen the good thing about using arrow in R is that there isn't any extra cost from AWS. Maybe a parameter to tell the function where the parquet file should be created, in R or in AWS Athena?

Side note will have to see how AWS Athena creates the parquet file, and to check if there isn't a faster method i.e. AWS Glue. Just thinking about performance when uploading a file to AWS Athena from R

@OssiLehtinen
Copy link

The cost is a good point to think about. The cost per call will mostly be quite minimal, but obviously not zero either. E.g., doing a conversion of a 2 GB csv with Athena would cost 1 cent.

I might be wrong, but Glue is probably not a good option here. First of all the cost is probably higher. At least if you need to spin up a an ETL job of type Apache Spark, you pay for a minimum of 10 minutes at $0.44 per DPU hour. Also, the wait times for the jobs to start (with creating the Spark cluster and what not) are quite long at least for interactive work.

I don't have have hands on experience with the Python shell type jobs, however. Perhaps those start fast. Also not sure if a Spark type job is required for creating parquets.

@DyfanJones
Copy link
Owner Author

Possible solution for creating history tables using AWS Athena method: https://docs.aws.amazon.com/athena/latest/ug/insert-into.html

@OssiLehtinen
Copy link

Yeah, the insert functionality is quite useful for any 'append' type of operations! For copy_to with append one can use almost the same operation as in #37 (comment)

@DyfanJones
Copy link
Owner Author

Will extend this suggestion so that it is supported in base package as well as dplyr. @OssiLehtinen any suggestions in function name?

Possible function names:

  • dbConvert
  • dbConvertFile

DyfanJones pushed a commit that referenced this issue Apr 20, 2020
@OssiLehtinen
Copy link

dbConvertTable sounds good to me!

@DyfanJones
Copy link
Owner Author

PR #106 brings this feature request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants