Parsing confusion in the presence of non-escaping backslashes #17

Closed
dimitri opened this Issue Jul 21, 2014 · 4 comments

Projects

None yet

2 participants

@dimitri
dimitri commented Jul 21, 2014

Hi,

As reported in dimitri/pgloader#80 towards the end, cl-csv fails to parse simple input when it contains unexpected escaping characters (not the whole escaping string) in the middle of a text field.

Here's a reduced test case:

"16417153","1401640227","Jun 1 2014","HTML -//W3C//DTD HTML 4.01 Frameset//EN\\",""

And I can reproduce the failure with the following code:

CL-USER> (with-open-file (s "foo.csv")
           (cl-csv:read-csv s :quote #\" :separator #\, :escape "\\\""))
; Evaluation aborted on #<SB-KERNEL:CASE-FAILURE expected-type:
                         (MEMBER :COLLECTING :COLLECTING-QUOTED :WAITING)
                         datum: :WAITING-FOR-NEXT>.
@bobbysmith007
Member

So the bug as I understand it is the need to add an escape for the escape character (in some circumstances).
By default this should be "\" (ie: two backslashes in a row). Any suggestion for the name? escape-escape sounds awful but also accurate.

@dimitri
dimitri commented Jul 21, 2014

Well in the case of that specific input file you can see HTTPArchive/httparchive#25 that hints into the backslash not being there for any reason really (truncated string).

So I'm not sure we should reason in terms of escaping the escape character rather than just allowing for a general espace character: backslash could be used to escape whatever follows, which in the case of the faulty input we have, is another backslash, and then we have a free quote, so the quoted section ends. What do you think?

@bobbysmith007
Member

I read that as:
We would like a new parser escaping mode, that rather than replacing all quote-escapes with a quote, replaces {escape-character}{thing} with {thing} regardless of what {thing} is.

I have to imagine that this is partly where the "" escape sequence arose.

I guess a new parameter :escaping-mode that defaults to :quote and accepts one of (:quote :following-char).

@bobbysmith007 bobbysmith007 added a commit that closed this issue Jul 22, 2014
@bobbysmith007 bobbysmith007 Added `*escape-mode*` to control how the escaping process works
the options are:
   :quote which replaces the entire escape sequence with a quote.
   :following reads the character after the escape sequence verbatim

fix AccelerationNet/cl-csv#17
77e024b
@bobbysmith007
Member

Please try this out and let me know if it matches what you had in mind / solves your parsing error.

@dimitri dimitri added a commit to dimitri/pgloader that referenced this issue Jun 25, 2015
@dimitri dimitri Expose cl-csv escape mode option, fix #80.
Some CSV files are using the CSV escape character internally in their
fields. In that case we enter a parsing bug in cl-csv where backtracking
from parsing the escape string isn't possible (or at least
unimplemented).

To handle the case, change the quote parameter from \" to just \ and let
cl-csv use its escape-quote mechanism to decide if we're escaping only
separators or just any data.

See AccelerationNet/cl-csv#17 where the escape
mode feature was introduced for pgloader issue #80 already.
d75c100
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment