Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread should un-escape escaped quotes in fields #1109

Open
jangorecki opened this issue Apr 8, 2015 · 7 comments
Open

fread should un-escape escaped quotes in fields #1109

jangorecki opened this issue Apr 8, 2015 · 7 comments

Comments

@jangorecki
Copy link
Member

According to the docs:

A quoted field must start with quote and end with a quote that is also immediately followed by sep or \n. Thus, unescaped quotes may be present in a quoted field (...,2,"Joe, "Bloggs"",3.14,...) as well as escaped quotes (...,2,"Joe \",Bloggs\"",3.14,...).

The following csv should be supported by fread.

library(data.table)
dt <- data.table(a = 1:2, b = c('f(c("a","b"))','sum(1,2)'))

dt
#    a             b
#1: 1 f(c("a","b"))
#2: 2      sum(1,2)

write.table(dt,"tbl1.csv",sep=",",na="",col.names=TRUE,row.names=FALSE,qmethod="escape")
write.table(dt,"tbl2.csv",sep=",",na="",col.names=TRUE,row.names=FALSE,qmethod="double")

system("cat tbl1.csv")
# "a","b"
#1,"f(c(\"a\",\"b\"))"
#2,"sum(1,2)"
system("cat tbl2.csv")
# "a","b"
#1,"f(c(""a"",""b""))"
#2,"sum(1,2)"

# output NA
fread("tbl1.csv",sep=",")
# Error in fread("tbl1.csv", sep = ",") : 
#   Expected sep (',') but new line, EOF (or other non printing character) ends field 1 when detecting types (   first): 2,"sum(1,2)"
# In addition: Warning message:
#   In fread("tbl1.csv", sep = ",") :
#   Starting data input on line 2 and discarded previous non-empty line: "a","b"

# incorrect output
fread("tbl2.csv",sep=",")
# a                 b
#1: 1 f(c(""a"",""b""))
#2: 2          sum(1,2)

# incorrect output
read.table("tbl1.csv",sep=",",header=TRUE)
#   a                 b
#1 1 f(c(\\a\\,\\b\\))
#2 2          sum(1,2)

# correct output
read.table("tbl2.csv",sep=",",header=TRUE)
# a               b
#1 1 f(c("a","b"))
#2 2      sum(1,2)

Findings:
as of now, only writing using qmethod="double" and read.table correctly supports write-read such kind of data.

Latest dev data.table, my locale, etc:

> sessionInfo()
# R version 3.1.3 (2015-03-09)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.2 LTS
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_DK.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] data.table_1.9.5
# 
# loaded via a namespace (and not attached):
#  [1] bitops_1.0-6   chron_2.3-45   devtools_1.7.0 evaluate_0.5.5 formatR_1.0    httr_0.6.1     knitr_1.8      RCurl_1.95-4.5 stringr_0.6.2  tools_3.1.3   
@arunsrinivasan arunsrinivasan changed the title fread double quote in character columns fread double quote in character columns, needs 'allowEscapes' like argument Sep 8, 2015
@arunsrinivasan
Copy link
Member

PS: Remember to also test and credit #1299.

@jan-glx
Copy link
Contributor

jan-glx commented Sep 10, 2015

MOVED from #1299
I do as well have problems with the way fread currently handles quotes and tried to find a solution (don't look at my fork - it's sh*t).
Here are my thoughts on it:

There are two more or less separate issues:

Quoting

Some strings contain control characters (wrt a csv file) like \n, \l or sep (eg. ,).
To allow unambiguous parsing of files containing such characters such strings must be quoted, i.e. surrounded by a quote character(typically ").
Quote charters, however, may be part of a sting as well: He said: "What'up?" or {"id": 34, "character":"\""}
There are two major strategies to escape them:

  • Double: all quote characters inside the string to be quoted get repeated once. e.g. the above would become "He said: ""What'up?""" and "{""id"": 34, ""character"":""\""""}"'
  • Escape character: all quotes inside the string to be quoted get escaped by a preceding escape character (usually \). With this method the escape character must be escaped as well if it was part of the string before (usually now using the Double method). e.g. the above strings would become "He said: \"What'up?\"" and "{\"id\": 34, \"character\":\"\\\"\"}"

The former is easier to implement and uses less extra storage. The latter is the standard for write.table. However write.table fails to implement the second part and does not escape \ if it was part of the string. This leads to problems e.g. in the following case (in the 7th line, the quoted region never ends):

> a = "\\write.table \"killer\"\\"; a
[1] "\\write.table \"killer\"\\"
> cat(a)
\write.table "killer"\
> write.table(a)
"x"
"1" "\write.table \"killer\"\"
> write.table(a, qmethod = "double")
"x"
"1" "\write.table ""killer""\"

Note that quoting always comes with a quote escape method. If the quoted string did not contain any quoting characters, using non escaping quoting would be equivalent to using the double escaping method. If It did also not contain the escape character, escape character escaping quoting would be equivalent as well. Often it is possible to infer the quote escape method used.

I would want fread to correctly parse csv files if the user specifies the used quote method on a per file basis. Where qmethod could be either something like:

  • "notQuoted" allowing to read files that contain fields with arbitrarily many " but no control characters (e.g. this real world example)
  • "quotedDoubleEscaped"
  • "quotedEscapeCharEscaped"
    It would be cool if one could specify the quote and the escape character.
    Further I would want fread to deduce the quote method if possible.

Unquoting

i.e. the removal of the leading and trailing quote charactars from the string and the unescaping of the string

fread should do this only if all fields of one column are exactly one quoted region (necessarily but not sufficiently begin and end with a quote) or if sep2 was implemented if the same holds for all sub-fields of a column.
There should be options to specify unquoting behavior on a per column basis.
At least fread should provide easy ways to do unquoting afterwards.

Summary

Reading files with escaped quotes (as specified by the user) is a must. fread should try to deduce the quote escaping method used.
It would be cool if fread would allow for unquoting and even cooler if it would try to guess what should be unquoted.

For ease of writing/reading these words sound much more confident than I am about them. I don't want to nag but to help.
best,
Jan

PS: some example of currently improperly handled data:

> a = '"gene names" "protein names"\nABL,JTK7 "Abelson murine leukemia viral oncogene homolog 1","Abelson tyrosine-protein kinase 1"'; cat(a)
"gene names" "protein names"
ABL,JTK7 "Abelson murine leukemia viral oncogene homolog 1","Abelson tyrosine-protein kinase 1"
> fread(a)
   gene names                                                                        protein names
1:   ABL,JTK7 Abelson murine leukemia viral oncogene homolog 1","Abelson tyrosine-protein kinase 1
> fread(a,quote = "")
Empty data.table (0 rows) of 3 cols: ABL,JTK7 "Abelson murine leukemia viral oncogene homolog 1","Abelson tyrosine-protein kinase 1"
Warning message:
In fread(a, quote = "") :
  Starting data input on line 2 and discarded previous non-empty line: "gene names" "protein names"

@mattdowle mattdowle added this to the v1.9.12 milestone Nov 28, 2016
@goodmansasha
Copy link

goodmansasha commented Dec 2, 2017

I think my problem should be added to this, as it relates to allowing escapes and a feature request in the Master task for fread bugs. I'm using the 1.10.4-3 of data.table.

When exporting using the default fwrite settings, quotes inside of quotes are doubled to escape. Here is a line of the csv, showing that in the name 'Oller, Thomas "Rico"', the quotes are doubled to escape it and also balanced correctly :

AD04,19992000,"Oller, Thomas ""Rico""",A,Oller,Thomas,Oller,,,Assemblymember,Assemblymember,REP,Y,LEG_ESI,2006-12-15 12:43:44,N

However, when using fread, these are not handled well. The result is double quotes:

Oller, Thomas ""Rico"" 

Its only when I use the read.csv that this is read correctly as:

Oller, Thomas "Rico"

Note that this problem persists when forcing quotes, with quote=TRUE in fwrite. I also get strange triple slashes when setting fwrite to quote=TRUE, qmethod='escape'. In fact, I haven't found a way to simply import and export using the fread/fwrite combination! I only can import it back using data.table(read.csv(...)). Since i haven't found a simple solution using data.table to a simple task, this feels like a big bug!

@goodmansasha
Copy link

Escapes are defined this standard, RFC 4180:
https://tools.ietf.org/html/rfc4180#page-6

Here is a one line example of the issue:
fread('AD04,19992000,"Oller, Thomas ""Rico""",A,Oller,Thomas,Oller,,,Assemblymember,Assemblymember,REP,Y,LEG_ESI,2006-12-15 12:43:44,N\n')

The example shows that fread does not handle escapes the way the standard defines them, as the quotes around "Rico" are not escaped:

     V1       V2                     V3 V4    V5     V6    V7 V8 V9            V10            V11 V12 V13     V14
1: AD04 19992000 Oller, Thomas ""Rico""  A Oller Thomas Oller NA NA Assemblymember Assemblymember REP   Y LEG_ESI
                   V15 V16
1: 2006-12-15 12:43:44   N

@goodmansasha
Copy link

goodmansasha commented Dec 15, 2017

If fread does not comply with RFC 4180 for .csv, then why not escalate that to a 'Bug'?

@st-pasha st-pasha added bug and removed enhancement labels Dec 15, 2017
@st-pasha st-pasha changed the title fread double quote in character columns, needs 'allowEscapes' like argument fread should un-escape escaped quotes in fields Dec 15, 2017
@st-pasha st-pasha modified the milestones: Candidate, v1.10.6 Dec 15, 2017
@mattdowle mattdowle modified the milestones: v1.11.0, v1.11.2 Apr 29, 2018
@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 11, 2019
@MichaelChirico
Copy link
Member

Just got bit by this... a bit unfortunate that fwrite default become hard to invert...

@valentas-kurauskas
Copy link

Escaping " using "" (qmethod="double") is the only option for tables exported in BigQuery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants