Aws Glue to convert csv to parquet #37

DyfanJones · 2019-12-03T23:09:47Z

For what ever reason, if apache arrow cant be installed onto local machine should aws glue be used to convert csv file format parquet to help with aws athena performance?

DyfanJones · 2019-12-03T23:12:15Z

note: this may not be implemented, but more to investigate the feasibility

DyfanJones · 2019-12-03T23:17:30Z

This stackoverflow entry looks promising. How to convert many csv files to parquet using glue

OssiLehtinen · 2019-12-04T06:26:34Z

My solution for creating parquet's and orc's to be consumed by Athena (and other AWS services such as Redshift Spectrum) has been to leverage Athena itself for this task.

The steps to do this are:

Upload a csv to a temporary* S3 location
Create a temporary Athena table 'temp.temp_table' pointing to the csv
Create the final table and files with a CTAS-statement pointing to the temp table as in:

create table mydb.tab_name 
(with external_location = 's3://my_athena_results/my_orc_stas_table/',
      format = 'ORC')
as (select * from temp.temp_table)

* as in, to be deleted after this process

The advantages are, that one does not need any new tools for this, the resulting files are pretty much guaranteed to be compatible (I had trouble earlier with externally generated orc's), and one can out source the parc/orc generation optimizations to Athena/Presto.

It is a bit of a kludge, but has been working well even in close to production type processes.
Of course all this makes sense only for content which is going to be reused multiple times after creation as the temporary csv version of the table needs to be read once in any case.

DyfanJones · 2019-12-04T12:08:22Z

@OssiLehtinen the method you have highlighted above is currently supported through dplyr's compute method.

library(DBI)
library(dplyr)

con <- dbConnect(RAthena::athena())

dbWriteTable(con, "iris", iris)

tbl(con, "iris") %>% 
  compute("iris_parquet", file_type = "parquet", s3_location = "s3://mybucket/iris_parquet/"))

dbRemoveTable(con, "iris")

I believe if this method is favourable then it should not only be supported through dplyr's compute but also callable through RAthena and noctua set functions.

However ORC format isn't support. As Athena is doing the process then I believe it should be support in RAthena and noctua packages

OssiLehtinen · 2019-12-04T12:40:45Z

That's right!

I'm still thinking about copy_to -functionality. There the parquet is currently generated using arrow, but one could incorporate the commands you outline there as well, when a parquet or orc end result is requested.

DyfanJones · 2019-12-04T13:00:56Z

@OssiLehtinen the good thing about using arrow in R is that there isn't any extra cost from AWS. Maybe a parameter to tell the function where the parquet file should be created, in R or in AWS Athena?

Side note will have to see how AWS Athena creates the parquet file, and to check if there isn't a faster method i.e. AWS Glue. Just thinking about performance when uploading a file to AWS Athena from R

OssiLehtinen · 2019-12-04T13:13:09Z

The cost is a good point to think about. The cost per call will mostly be quite minimal, but obviously not zero either. E.g., doing a conversion of a 2 GB csv with Athena would cost 1 cent.

I might be wrong, but Glue is probably not a good option here. First of all the cost is probably higher. At least if you need to spin up a an ETL job of type Apache Spark, you pay for a minimum of 10 minutes at $0.44 per DPU hour. Also, the wait times for the jobs to start (with creating the Spark cluster and what not) are quite long at least for interactive work.

I don't have have hands on experience with the Python shell type jobs, however. Perhaps those start fast. Also not sure if a Spark type job is required for creating parquets.

DyfanJones · 2019-12-04T13:52:04Z

Possible solution for creating history tables using AWS Athena method: https://docs.aws.amazon.com/athena/latest/ug/insert-into.html

OssiLehtinen · 2019-12-04T14:27:06Z

Yeah, the insert functionality is quite useful for any 'append' type of operations! For copy_to with append one can use almost the same operation as in #37 (comment)

DyfanJones · 2020-04-20T15:07:23Z

Will extend this suggestion so that it is supported in base package as well as dplyr. @OssiLehtinen any suggestions in function name?

Possible function names:

dbConvert
dbConvertFile

…to desired backend file (#37)

OssiLehtinen · 2020-04-21T07:19:18Z

dbConvertTable sounds good to me!

DyfanJones · 2020-04-21T13:05:21Z

PR #106 brings this feature request.

DyfanJones added the question Further information is requested label Dec 3, 2019

OssiLehtinen mentioned this issue Dec 4, 2019

Default compression method for flat files #36

Closed

DyfanJones pushed a commit that referenced this issue Apr 20, 2020

new feature dbConvertTable. allows to convert tables and sql queries …

d420fec

…to desired backend file (#37)

DyfanJones added the enhancement New feature or request label Apr 21, 2020

DyfanJones mentioned this issue Apr 21, 2020

dbConvertTable aws athena convert file formats #106

Merged

DyfanJones pushed a commit that referenced this issue Apr 21, 2020

dbConvertTable aws athena convert file formats #37

64e8707

DyfanJones closed this as completed Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aws Glue to convert csv to parquet #37

Aws Glue to convert csv to parquet #37

DyfanJones commented Dec 3, 2019

DyfanJones commented Dec 3, 2019

DyfanJones commented Dec 3, 2019 •

edited

Loading

OssiLehtinen commented Dec 4, 2019 •

edited

Loading

DyfanJones commented Dec 4, 2019 •

edited

Loading

OssiLehtinen commented Dec 4, 2019

DyfanJones commented Dec 4, 2019

OssiLehtinen commented Dec 4, 2019

DyfanJones commented Dec 4, 2019

OssiLehtinen commented Dec 4, 2019

DyfanJones commented Apr 20, 2020

OssiLehtinen commented Apr 21, 2020

DyfanJones commented Apr 21, 2020

Aws Glue to convert csv to parquet #37

Aws Glue to convert csv to parquet #37

Comments

DyfanJones commented Dec 3, 2019

DyfanJones commented Dec 3, 2019

DyfanJones commented Dec 3, 2019 • edited Loading

OssiLehtinen commented Dec 4, 2019 • edited Loading

DyfanJones commented Dec 4, 2019 • edited Loading

OssiLehtinen commented Dec 4, 2019

DyfanJones commented Dec 4, 2019

OssiLehtinen commented Dec 4, 2019

DyfanJones commented Dec 4, 2019

OssiLehtinen commented Dec 4, 2019

DyfanJones commented Apr 20, 2020

OssiLehtinen commented Apr 21, 2020

DyfanJones commented Apr 21, 2020

DyfanJones commented Dec 3, 2019 •

edited

Loading

OssiLehtinen commented Dec 4, 2019 •

edited

Loading

DyfanJones commented Dec 4, 2019 •

edited

Loading