Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread in version 1.9.5 fails on csv that contains json with embedded double quote #1164

Closed
richardtessier opened this issue May 29, 2015 · 3 comments
Assignees
Milestone

Comments

@richardtessier
Copy link

fread on this csv

json1, string1
"{""f1"":""value1"",""f2"":""double quote escaped with a backslash [ \"" ]""}", "string field"

results in the following error

Error in fread("data/json.csv", verbose = TRUE, data.table = FALSE, stringsAsFactors = FALSE) : 
  Field 1 on line 2 starts with quote (") but then has a problem. It can contain balanced unescaped quoted subregions but if it does it can't contain embedded \n as well. Check for unbalanced unescaped quotes: "{""f1"":""value1"",""f2"":""double quote escaped with a backslash [ \"" ]""}", "string field"

Verbose fread output

Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.000000 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 2 columns. Longest stretch was from line 1 to line 2
Starting data input on line 1 (either column names or first row of data). First 10 characters: json1, str
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 1 (including 0 at the end)
Count of sep: 2
nrow = MIN( nsep [2] / ncol [2] -1, neol [1] - nblank [0] ) = 1
Type codes (   first 5 rows): 40
Type codes: 40 (after applying colClasses and integer64)
Type codes: 40 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 dropped)

My sessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.5

loaded via a namespace (and not attached):
[1] chron_2.3-45 tools_3.1.2 
@arunsrinivasan
Copy link
Member

But there's a balanced unescaped quoted region - after \" in \"", as the error message says.. fread can't handle. The error message is therefore clear?

@richardtessier
Copy link
Author

I would say the question is more whether fread should support reading this file considering it contains valid JSON escaped to a CSV file. Unless I've misunderstood, escaping a quote in JSON is done by adding a \ before it and then escaping for CSV doubles all double quotes resulting in "" inside the CSV file.

If it cannot support it, I find that understanding what the balanced unescaped quoted subregions means is not trivial but I can't say I have a better formulation. Would it be possible to point to the character where the problematic subregion starts?

@arunsrinivasan arunsrinivasan self-assigned this Sep 10, 2015
@arunsrinivasan arunsrinivasan added this to the v1.9.6 milestone Sep 10, 2015
@arunsrinivasan
Copy link
Member

I see what you mean. Will try to see if we can handle this case without need for specifying quote argument.

@arunsrinivasan arunsrinivasan modified the milestones: v1.9.8, v1.9.6 Sep 17, 2015
@arunsrinivasan arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Nov 17, 2015
@mattdowle mattdowle modified the milestones: v1.9.8, v1.9.10 Nov 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants