Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upCommas and quote characters in data mess things up when doing dbWrite #50
Comments
|
If a user has a special character "," and they are using the default "csv" file type. I would suggest that they have set the wrong file type as csv. If the basic flat files don't work then setting parquet is always an option. I am reluctant to change the sep to "|" as currently
I wasn't aware that data.table would semi quote character fields. |
|
I see the problem with "|". Well, I guess any character can cause trouble if no escaping is used. The risk with comma and tab is that one cannot always fully control the contents of incoming data. Well, maybe some data cleaning step, such as replacing tabs with spaces would be wise. |
|
that could possibly be done in In the mean time have you tried using parquet file format? |
|
I don't have arrow setup on my box at the momen't so couldn't test Parquet with that. Doing the Parquet conversion with Athena, as described above, seems to work fine. Right now on the road again, so can't continue with arrow... But yeah, I guess one option would be to do some automatic replacement as you propose. I think that would better than the current 'melt down' that can happen. |
|
I will play around with some ideas once i have the |
|
current issue is that when AWS Athena queries special characters it returns a quoted csv:
This makes it very difficult to split and may result in double " as a return of it. I don't think this is avoidable. |
|
To fix this
However user will need to be notified around the changes to characters strings. I am still not a 100% comfortable around the replacement of |
|
Another issue will have to be addressed The following special characters will cause issue with delimited files:
|
|
Part of the issue is down to the escape: |
|
One quick point: replacing " with ' can be problematic. In my (retail related) data I might for example have some sizes in inches (like 2") and the meaning gets lost with the substitution. |
|
OK cool, i will see if i can get around that :) |
|
@OssiLehtinen |
|
Seems to work with my tests, except for a minor tweak: Line 119 in ad7ab5a should be
I think. btw. I would set tsv as the default file.type, as replacing tabs with spaces feels much less intrusive than replacing commas with dots. Would using ";" as the subsititute be less weird? I mean. if there is some written text in the data. replacing commas with dots can make the text pretty difficult to read. or what do you think? Still one thought: the message about replacing tabs with spaces could be more clear as:
|
|
Thanks for the spot, will make that change now. I agree with the change of tsv as default. Plus it will help with any support in json column types to athena map (possible future plan). |
|
The problem with making Would setting file.type = "tsv" in |
|
That's right. I guess problems would arise when appending to a csv based table or is there some other situation which would cause trouble? Should there, btw, be a check that the formats match when using append = T? Or pehaps it is there already, and I'm just missing it. |
|
Currently no format check has been added. Will have to check in AWS glue will give this information. |
|
Going to add file and compression check when appending to existing AWS Athena Table. Will have to review code at a later date to tidy it up (if necessary):
|
|
@OssiLehtinen just merged PR #51. Closing this issue, if the issue persists please feel free to re-open this issue. If there is any more issue/ features you come across please let me know. I am planning to put these new changes live to the cran by the end of the week, Many thanks for all your help. |
|
Thanks for your hard work on the package! It's shaping up to be an exellent tool! |
Issue Description
Since data.table::fwrite tries to handle special characters in it's own way, that is, escaping field separators and and quote characters etc, and quoting strings when necessary, things get weird when Athena tries to deal with such source files.
Reproducible Example
(The datetimes and dates are there for later.)
Dealing with a comma in the data can be done using tsv file type. However a tab would cause problems then.
The quote character will be problematic in either case. Default behaviour of fwrite is to '"double" (default, same as write.csv), in which case the double quote is doubled with another one.' and then the whole entry is encolsed in another set of quotes, which Athena has no idea how to deal with. The conditional quoting takes place also when a field has a comma in csv of a tab in tsv.
My (wholly unelegant) solution has been to use
, quote=F, row.names=F, col.names=F, sep="|", na=""when writing the file. And telling athena to use "|" as the separator. Additionally, I've simply removed all "|"s from the data. The point is that the pipe character is not that common in my data.
This is obviously not at all optimal. I think the nicest thing would be to have the file with escaped special characters, but without the enclosing quotes. However, I don't think fwrite can accomplish this.
The other solution would be to enclose everything in double quotes, using
quote=TRUE,qmethod='escape', col.names=F, but then one needs to use another SerDe in Athena. Now, parsing dates and datetimes get's complicated and I have not been able to get those to function.The way around this is to
or if one wishes to append to an existing table:
I know we are back at using Athena to create the final tables, but this is the only way so far I've been able to get everything to work at the same time. Well, an unescaped new line will still mess things up...
I know this is a bit of a horror story, but on the other hand these are situations that at least I have hade to deal with with 'real data'.