New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #5358] fread quoted strings not always handled properly #489

Closed
arunsrinivasan opened this Issue Jun 8, 2014 · 1 comment

Comments

Projects
None yet
2 participants
@arunsrinivasan
Member

arunsrinivasan commented Jun 8, 2014

Submitted by: James Sams; Assigned to: Nobody; R-Forge link

I have a file with three fields: two string fields and an integer field. In 99% of cases, the string fields aren't even quoted due to their simplicity. However, I have one line in a file that looks like:

233,"A ""EMBEDDED"" QUOTE FIELD",morechars

And fread fails to read, thinking that the second quote closes the string field and it expects a separator:

# "Expected sep (',') but '"' ends field 2 on line 828 when reading data:". 

(Actual data not used due to confidentiality concerns.)

read.csv properly interprets this as three columns:

1) 233
2) A "EMBEDDED" QUOTE FIELD
3) morechars

IME, there are two ways that CSV-type files will handle embedded quotes with backslash escape (") and by doubling them up, as is done here (""). Well, at least two unambiguous ways. Note that it isn't uncommon to see this field without the outer quotes. The reason for this, as I understand it, is that some programs will only include the outer quotes if the field contains the designated field separator. Otherwise, these programs will rely on the escaping mechanism (either backslash or doubling) to handle single or double quotes, etc. Of course, csv files aren't standardized; so, there may be other cases. Hopefully this is helpful information though.

I see several other bug reports about fread's handling of quoted fields, but this seems to be a different issue than the others. Thus the separate report. Apologies if you consider it to be a duplicate report.

@mattdowle mattdowle added this to the v1.9.6 milestone Oct 25, 2014

@mattdowle

This comment has been minimized.

Show comment
Hide comment
@mattdowle

mattdowle Oct 25, 2014

Member

Embedded quotes and doubled-up quotes should now be handled in v1.9.4 inside a quoted field or not. Report seems to be from much earlier this year. There's still a problem if an embedded newline occurs after a double-up quote. Check and add more tests on this one, document, add to README and close.

Member

mattdowle commented Oct 25, 2014

Embedded quotes and doubled-up quotes should now be handled in v1.9.4 inside a quoted field or not. Report seems to be from much earlier this year. There's still a problem if an embedded newline occurs after a double-up quote. Check and add more tests on this one, document, add to README and close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment