Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread should take care of the double quote escaping "" #4779

Open
shrektan opened this issue Oct 27, 2020 · 9 comments
Open

fread should take care of the double quote escaping "" #4779

shrektan opened this issue Oct 27, 2020 · 9 comments
Labels

Comments

@shrektan
Copy link
Member

shrektan commented Oct 27, 2020

This example should illustrate what I mean well.

library(data.table)
text <- 'A,B\na,"x1""x2"'
fread(text = text)
#>    A      B
#> 1: a x1""x2
as.data.table(readr::read_csv(text))
#>    A     B
#> 1: a x1"x2
as.data.table(read.csv(text = text))
#>    A     B
#> 1: a x1"x2

Created on 2020-10-27 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                         
#>  version  R version 4.0.2 (2020-06-22)  
#>  os       Windows 10 x64                
#>  system   x86_64, mingw32               
#>  ui       RTerm                         
#>  language en                            
#>  collate  Chinese (Simplified)_China.936
#>  ctype    Chinese (Simplified)_China.936
#>  tz       Asia/Taipei                   
#>  date     2020-10-27                    
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source        
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)
#>  backports     1.1.7      2020-05-13 [1] CRAN (R 4.0.0)
#>  callr         3.4.3      2020-03-28 [1] CRAN (R 4.0.0)
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 4.0.0)
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.0)
#>  data.table  * 1.13.0     2020-07-24 [1] CRAN (R 4.0.2)
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 4.0.0)
#>  devtools      2.3.0      2020-04-10 [1] CRAN (R 4.0.0)
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 4.0.0)
#>  ellipsis      0.3.1      2020-05-15 [1] CRAN (R 4.0.2)
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)
#>  fs            1.5.0      2020-07-31 [1] CRAN (R 4.0.2)
#>  glue          1.4.2      2020-08-27 [1] CRAN (R 4.0.2)
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.0)
#>  hms           0.5.3      2020-01-08 [1] CRAN (R 4.0.0)
#>  htmltools     0.5.0      2020-06-16 [1] CRAN (R 4.0.0)
#>  knitr         1.29       2020-06-23 [1] CRAN (R 4.0.0)
#>  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 4.0.0)
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 4.0.0)
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 4.0.0)
#>  pillar        1.4.6      2020-07-10 [1] CRAN (R 4.0.2)
#>  pkgbuild      1.0.7      2020-04-25 [1] CRAN (R 4.0.0)
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.0)
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 4.0.0)
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.0.0)
#>  processx      3.4.3      2020-07-05 [1] CRAN (R 4.0.2)
#>  ps            1.3.4      2020-08-11 [1] CRAN (R 4.0.2)
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 4.0.0)
#>  Rcpp          1.0.5      2020-07-06 [1] CRAN (R 4.0.2)
#>  readr         1.3.1      2018-12-21 [1] CRAN (R 4.0.0)
#>  remotes       2.2.0      2020-07-21 [1] CRAN (R 4.0.2)
#>  rlang         0.4.7      2020-07-09 [1] CRAN (R 4.0.2)
#>  rmarkdown     2.4        2020-09-30 [1] CRAN (R 4.0.2)
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 4.0.0)
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 4.0.0)
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.0)
#>  testthat      2.3.2.9000 2020-05-09 [1] local         
#>  tibble        3.0.3      2020-07-10 [1] CRAN (R 4.0.2)
#>  usethis       1.6.1      2020-04-29 [1] CRAN (R 4.0.0)
#>  vctrs         0.3.2      2020-07-15 [1] CRAN (R 4.0.2)
#>  withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)
#>  xfun          0.18       2020-09-29 [1] CRAN (R 4.0.2)
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] D:/app/R_lib/4.0
#> [2] D:/app/R-4.0.0/library
@jangorecki
Copy link
Member

#1109 could possibly be related

@barryrowlingson
Copy link

Doubled double-quotes in quoted fields is the RFC 4180 spec for including a double-quote character in a quoted field. Section 2.7 in the RFC: https://tools.ietf.org/html/rfc4180#page-2

I can't see an option to make fread compliant with this.

@shrektan
Copy link
Member Author

@barryrowlingson Sorry but I could not understand the reason of:

I can't see an option to make fread compliant with this.

@barryrowlingson
Copy link

@shrektan what I'm saying is that I can't see a way to make fread interpret doubled double-quotes in quoted fields correctly according to RFC4180:

   7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

but fread by default does not interpret two double-quotes as an escaped one double-quote as per the spec quoted above:

> fread('A,B,C\n"aaa","b""bb","ccc"')$B
[1] "b\"\"bb"

where it is returning two doubled quotes.

My saying "I can't see an option to make fread compliant with this." is saying that I've looked at the fread help and I can't see something like fread(csvfile, rfc4014compliant=TRUE) or fread(csv, escapeddoublequotes=TRUE) that would do this.

@gorkang
Copy link

gorkang commented Apr 10, 2021

Here an example where it seems fread is not doing taking care of the quotes: https://stackoverflow.com/questions/67026291/reading-files-with-double-double-quotes-in-r?noredirect=1

In brief, when reading {""Q0"":""double double quote""}, the expectation is that it becomes {"Q0":"double double quote"}, but fread keeps the double quotes. read.csv works fine.

# Content of csv file
# "numbers", "simple_quote", "double_quote"
# "9", "quoted text", "{""Q0"":""double double quote""}"

library(data.table)

read.csv("test.csv")
#>   numbers simple_quote                  double_quote
#> 1       9  quoted text  {"Q0":"double double quote"}

fread("test.csv")
#>    numbers simple_quote                     double_quote
#> 1:       9  quoted text {""Q0"":""double double quote""}

@kwhkim
Copy link

kwhkim commented Jul 7, 2021

The basic rule is if you give a character a special function, you need to have a way to escape that function.
If the comma is a column separator, you need to have a way to represent a character comma, which is achieved by using quotation mark which represents the start and the end of a value. Since quotation marks have a special function of representing the start and the end of a value, it should have a way to represent a character quotation mark. Most functions including read.csv and read_csv use double quotation(using double quotation to represent a character quotation). But it is not must-do. Python pandas.read_csv has an argument doublequote=, which should be set to TRUE if you want to use double quotation. You can set it to FALSE if you want the same result with fread(). In addition to that, since substring ", is seldom included in a value, not using double quotation and using ", to represent a column separator is not a bad idea, which could lead to smaller file size when a lot of quotation marks are included in the data. But it could break.

Anyway, I think it is much better to fread() files fwrite()ed. Or we could have arguments doublequote= much like python pandas.read_csv(). For the last, but not the least, I would like to thank you all the authors and contributors of package data.table for the wonderful work you've made for the R community.

@dhersz
Copy link

dhersz commented Oct 19, 2023

Hello, should we expect a fix/workaround for this issue in a future version?

It has been 3 years since the issue was first opened and since then a few other issues have also mentioned this behavior, as we can see above.

As I note in #5088, particularly troubling is that fwrite() and fread() deal with double quote escaping in different ways.

@tdhock
Copy link
Member

tdhock commented Oct 19, 2023

hi @dhersz thanks for your concern. If you would like this to be fixed in a future version, the next step would be to submit a PR that attempts to fix, so please consider doing that, if you have time to volunteer/help. Also it would be useful to have your opinion on the community survey #5704 and the new governance #5676

@dhersz
Copy link

dhersz commented Oct 19, 2023

Hi @tdhock, thanks for the quick reply.

Unfortunately I don't have the time nor the expertise to help with a PR here. I'll take a look at the new governance proposal and the community survey, thanks for sharing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants