Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread: quotes in quoted string fields #1299

Closed
berndbischl opened this issue Aug 28, 2015 · 11 comments
Closed

fread: quotes in quoted string fields #1299

berndbischl opened this issue Aug 28, 2015 · 11 comments
Labels

Comments

@berndbischl
Copy link

Hi,

how do I properly encode / import string fields with double quotes in them?
The docs say:

character columns can be quoted (...,2,"Joe Bloggs",3.14,...) or not quoted (...,2,Joe Bloggs,3.14,...).

Due to the restrictions on unquoted char cols, my cols are always quoted (a sep can appear in them)

Thus, unescaped quotes may be present in a quoted field (...,2,"Joe, "Bloggs"",3.14,...) as well as
escaped quotes (...,2,"Joe ",Bloggs"",3.14,...). If an embedded quote is followed by the separator
inside a quoted field, the embedded quotes up to that point in that field must be balanced; e.g.
...,2,"www.blah?x="one",y="two"",3.14,....

Due to the restrictions on "normal dquotes" inside of the string, I have to escape them.

Now the problem is that this is imported correctly, as I want it:

"a", "b"
"x",  "my name is "joe""

See here

d = fread("test_dt.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE, data.table = FALSE)
  V1                   V2
1  a                  "b"
2  x   "my name is "joe""

But this is what I have to use, but the backslashes used for quoting the extra dquotes now get doubled

File:

"a", "b"
"x",  "my name is \"joe\""

fread output:

  V1                       V2
1  a                      "b"
2  x   "my name is \\"joe\\""

Note: I have some control over the csv files, as I am already preprocessing them a bit. But I need a routine that works on general files, so in my string columns I have to expect arbitrary input.

@arunsrinivasan
Copy link
Member

@berndbischl w.r.t. the doubled slashes, I don't see a difference with read.table() behaviour.. Do you?

I'm working on quote = "" argument now.. Would that solve your issue?

@jangorecki
Copy link
Member

duplicate of #1109 ?

@berndbischl
Copy link
Author

@arunsrinivasan
Yes, I see a difference. Here is my code
(the CSV file is the one I posted above)

library(data.table)
d1 = fread("test_datatable_quotes.csv", data.table = FALSE)
print(d1)
d2 = read.table("test_datatable_quotes.csv")
print(d2)

Output:

# fread
  a                       "b"
1 x    "my name is \\"joe\\""

# read.table
  V1               V2
1 a,                b
2 x, my name is "joe"

Like I said, the quoting backslashes get doubled for fread.

For ref, sessionInfo

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.04

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] data.table_1.9.4 setwidth_1.0-3   vimcom_1.2-5     stringr_1.0.0    testthat_0.10.0  roxygen2_4.1.1   devtools_1.8.0  
[8] BBmisc_1.9      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0     digest_0.6.8    crayon_1.3.0    chron_2.3-47    plyr_1.8.3      git2r_0.10.1    magrittr_1.5    stringi_0.5-5  
 [9] reshape2_1.4.1  curl_0.8        xml2_0.1.1      checkmate_1.6.3 tools_3.2.1     rversions_1.0.1 memoise_0.2.1  

@berndbischl
Copy link
Author

  1. The above code was crap, sorry.

  2. Maybe I am confused.
    How do I read this file properly into R?

1,"b"
2,"my name is \"joe\""

I tried this better code:

library(data.table)
s = "my name is \"joe\""
d1 = fread("test_datatable_quotes.csv", data.table = FALSE, header = FALSE)
print(d1[2,2])
print(d1[2,2] == s)
d2 = read.table("test_datatable_quotes.csv", sep = ",", header = FALSE,
  colClasses = c("numeric", "character"))
print(d2[2,2])
print(d2[2,2] == s)
[1] "my name is \\\"joe\\\""
[1] FALSE
[1] "my name is \\joe\\"
[1] FALSE

Both calls dont do what I want, that is to obtain the content of s.

NB: that still both call produce different results.

@arunsrinivasan
Copy link
Member

To read it with v1.9.5+ of data.table, you've to quote your columns, not add the escapes..

# 1299_fread.csv
"a","b"
"x","my name is "joe""

fread("1299_fread.csv")
#    V1               V2
# 1:  a                b
# 2:  x my name is "joe"

I'm not particularly fond of the behaviour of read.table() to remove quotes within substrings. I like fread's behaviour in that, if you'd like for , or \n to be read as such, then wrap that column/row(s) with "" and use quote = "\"".

See ?fread quote argument for more special cases.

In short, if you've to use fread(), then you'll have to construct your file without escapes. Else fread() will also read that in.

@arunsrinivasan
Copy link
Member

This'll be solved if/when #1109 is implemented (allowEscapes like argument). I don't think we should be escaping automatically. Perhaps Matt can weigh in as well. Closing this, as it's linked to the other post anyway.

@berndbischl
Copy link
Author

To read it with v1.9.5+ of data.table, you've to quote your columns, not add the escapes..

I tried to address this in my first post. As far as I understand the docs that does not work, when I now have commas and so on in my string after the quotes? And the quotes need to be balanced?

@berndbischl
Copy link
Author

Because if that is they case, I cannot use this. And this issue is still open.

Like I said, I have no control about the contents of the string, they can be very general.
I can preprocess them a bit, and respresent different chars in different ways. But basically any combination of weird char sequences can occur.

Can you please comment on this so I know whether this is solved or not?

@arunsrinivasan
Copy link
Member

tried to address this in my first post. As far as I understand the docs that does not work, when I now have commas and so on in my string after the quotes?

I don't know what this means.

And the quotes need to be balanced?

From ?fread quote argument:

Single character value: character columns can be quoted by the character specified in quote, e.g., ...,2,"Joe Bloggs",3.14,... or not quoted, e.g., ...,2,Joe Bloggs,3.14,....
Spaces and other whitepace (other than sep and \n) may appear in an unquoted character field. In essence quoting character fields are required only if sep or \n appears in the string value. Quoting may be used to signify that numeric data should be read as text. A quoted field must start with quote and end with a quote that is also immediately followed by sep or \n. Thus, unescaped quotes may be present in a quoted field, e.g., ...,2,"Joe, "Bloggs"",3.14,..., as well as escaped quotes, e.g., ...,2,"Joe ",Bloggs"",3.14,.... If an embedded quote is followed by the separator inside a quoted field, the embedded quotes up to that point in that field must be balanced; e.g. ...,2,"www.blah?x="one",y="two"",3.14,....

Did you go through this? If so, why wouldn't it work in your case? Please show with an example.

I'm not sure why you think it should be open. Your issue is about "" being not escaped in the read file, which would be solved when/if allowEscapes like argument is implemented. Like I said, I don't think that should happen by default without such argument.

@jan-glx
Copy link
Contributor

jan-glx commented Sep 9, 2015

MOVED to #1109

@arunsrinivasan
Copy link
Member

Most of your points are on escaping quotes (which should be handled better). But could you please shift this post to #1109?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants