Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Handling fallbacks for unreliable data like quote-delim variance #16

Closed
saizai opened this Issue July 11, 2012 · 11 comments

3 participants

Sai James Edward Gray II Daniel J. Pritchett
Sai
saizai commented July 11, 2012

I'm trying to fix https://github.com/NYTimes/Fech/blob/master/lib/fech/csv.rb

According to the FEC file spec, fields are supposed to be quote delimited.

However, sometimes you get things like this:

SB23,C00121368,CAN,HAYES FOR CONGRESS,PO BOX 2000,,CONCORD,NC,28026,15,PRIMARY,P2004,,20031014,000001000.00,,,ROBERT CANNON "ROBIN" HAYES,H,NC,08,,,,,,,,CONT TO ROBIN HAYES HOUSE NC 08,,03103111504948000013,,,,

Note the within-field quote marks, which are not supposed to be there per file format spec, but robustness principle means that's our problem. It's legacy data, can't be changed.

@nytimes' CsvDoctor tries to fix this by manually loading the file, but of course its logic there is wrong (e.g. it breaks if a field has linebreaks within it, because File.open.each thinks that linebreaks are the separator).

However there doesn't seem to be a better place in FasterCSV to patch in a retry; if I try to wrap #shift for instance, retry would break because the io has already moved forward in a way that can't be reversed. And I'd rather not have to replicate a whole lot of FasterCSV's file-parsing code just to make this patch.

Any suggestions?

James Edward Gray II
Owner
JEG2 commented July 11, 2012

Hmm, the truth is that I just probably wouldn't use a CSV parser for this. The data isn't quite valid CSV. Because of that, I want a parser more tuned to this data. Does that make sense?

Sai
saizai commented July 12, 2012

The data is at least intended to be CSV — and it would work correctly if there's some way to detect the quoting style being different at runtime, like CsvDoctor tries to do. It's just dirty data; not everyone complied with the format rules, and now we're stuck with having to read it anyway. What else is new, right? :-P

James Edward Gray II
Owner
JEG2 commented July 12, 2012

I think something like this parser is what you need:

while line = DATA.gets
  fields = [ ]
  unless line == "\n"
    loop do
      while line.sub!(/\A("(?:""|[^"]*)+"|(?!")[^,]*)(?:,|\n)/, "")
        if $1.empty?
          fields << nil
        elsif $1.start_with? '"'
          fields << $1[1..-2].gsub('""', '"')
        else
          fields << $1
        end
      end
      break if     line.empty?
      break unless more = DATA.gets
      line += more
    end
  end
  p fields
end

__END__
,,,
123,"testing a field with
a newline in it"
SB23,C00121368,CAN,HAYES FOR CONGRESS,PO BOX 2000,,CONCORD,NC,28026,15,PRIMARY,P2004,,20031014,000001000.00,,,ROBERT CANNON "ROBIN" HAYES,H,NC,08,,,,,,,,CONT TO ROBIN HAYES HOUSE NC 08,,03103111504948000013,,,,
James Edward Gray II JEG2 closed this July 12, 2012
Sai
saizai commented July 30, 2012

@JEG2 Tried yours, and it fails for the following case:
123,"foo","bar "baz" qux","stuff",321
note the within-field quotes for bar "baz" qux.

Here's a stupid method that I think should work:

QUOTE_CHAR = '"'; REPLACEMENT_CHAR = "'"; ROW_SEP = "\r\n"; COL_SEP = ","
surrounding_quotes_regex = /(\A\s*#{QUOTE_CHAR}|#{QUOTE_CHAR}\s*#{ROW_SEP}\Z)/
quote_sep_quote_regex = /#{QUOTE_CHAR}\s*#{COL_SEP}\s*#{QUOTE_CHAR}/
line = line.gsub(surrounding_quotes_regex, '').gsub(quote_sep_quote_regex, "\x1E").gsub(QUOTE_CHAR, REPLACEMENT_CHAR)

… basically:
1. strip the opening and closing quotes
2. replace internal quote-sep-quote sequences with ASCII 0x1E (record separator)
3. replace any remaining quotes with the replacement character (double -> single quote)

It works if you assume that you're processing a whole line at a time and that all strings are quoted. Not sure how to modify it to handle problems where #gets gives you a partial line, or where you need to process as you go rather than line-at-once.

It fails if there is inconsistent quoting though; e.g. 123,"foo "bar" baz",321, 123,foo,"bar",baz "qux",321, 123,"foo "bar" baz",1,2,"1,2,3",4. Hmmm.

James Edward Gray II
Owner
JEG2 commented July 30, 2012

Yeah, that's the problem with this data. Once you throw out proper escaping there are way too many edge cases. :(

Sai
saizai commented July 30, 2012

@JEG2 Unfortunately, the data is what it is. I don't get to control it, and I do have to import it. I can wish that they used \x1E separators to begin with, and used delimiters consistently or did other basic validation; then I'd have more sanity. But they didn't, and I'd like to retain at least some. :-P

James Edward Gray II
Owner
JEG2 commented August 06, 2012

I hear ya. I'm just not sure this leads to: CSV should support this kind of content. Know what I mean?

Daniel J. Pritchett

Wouldn't patching FasterCSV lead to interface incompatibilities with 1.9's CSV library anyway? That csv.rb on fech appears to be passing text into its chosen CSV processor one line at a time already; why not drop your pre-processing text cleaner in there instead?

Sai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.