Skip to content

Handling fallbacks for unreliable data like quote-delim variance #16

Closed
saizai opened this Issue Jul 11, 2012 · 11 comments

3 participants

@saizai
saizai commented Jul 11, 2012

I'm trying to fix https://github.com/NYTimes/Fech/blob/master/lib/fech/csv.rb

According to the FEC file spec, fields are supposed to be quote delimited.

However, sometimes you get things like this:

SB23,C00121368,CAN,HAYES FOR CONGRESS,PO BOX 2000,,CONCORD,NC,28026,15,PRIMARY,P2004,,20031014,000001000.00,,,ROBERT CANNON "ROBIN" HAYES,H,NC,08,,,,,,,,CONT TO ROBIN HAYES HOUSE NC 08,,03103111504948000013,,,,

Note the within-field quote marks, which are not supposed to be there per file format spec, but robustness principle means that's our problem. It's legacy data, can't be changed.

@nytimes' CsvDoctor tries to fix this by manually loading the file, but of course its logic there is wrong (e.g. it breaks if a field has linebreaks within it, because File.open.each thinks that linebreaks are the separator).

However there doesn't seem to be a better place in FasterCSV to patch in a retry; if I try to wrap #shift for instance, retry would break because the io has already moved forward in a way that can't be reversed. And I'd rather not have to replicate a whole lot of FasterCSV's file-parsing code just to make this patch.

Any suggestions?

@JEG2
Owner
JEG2 commented Jul 11, 2012

Hmm, the truth is that I just probably wouldn't use a CSV parser for this. The data isn't quite valid CSV. Because of that, I want a parser more tuned to this data. Does that make sense?

@saizai
saizai commented Jul 12, 2012

The data is at least intended to be CSV — and it would work correctly if there's some way to detect the quoting style being different at runtime, like CsvDoctor tries to do. It's just dirty data; not everyone complied with the format rules, and now we're stuck with having to read it anyway. What else is new, right? :-P

@JEG2
Owner
JEG2 commented Jul 12, 2012

I think something like this parser is what you need:

while line = DATA.gets
  fields = [ ]
  unless line == "\n"
    loop do
      while line.sub!(/\A("(?:""|[^"]*)+"|(?!")[^,]*)(?:,|\n)/, "")
        if $1.empty?
          fields << nil
        elsif $1.start_with? '"'
          fields << $1[1..-2].gsub('""', '"')
        else
          fields << $1
        end
      end
      break if     line.empty?
      break unless more = DATA.gets
      line += more
    end
  end
  p fields
end

__END__
,,,
123,"testing a field with
a newline in it"
SB23,C00121368,CAN,HAYES FOR CONGRESS,PO BOX 2000,,CONCORD,NC,28026,15,PRIMARY,P2004,,20031014,000001000.00,,,ROBERT CANNON "ROBIN" HAYES,H,NC,08,,,,,,,,CONT TO ROBIN HAYES HOUSE NC 08,,03103111504948000013,,,,
@JEG2 JEG2 closed this Jul 12, 2012
@saizai
saizai commented Jul 31, 2012

@JEG2 Tried yours, and it fails for the following case:
123,"foo","bar "baz" qux","stuff",321
note the within-field quotes for bar "baz" qux.

Here's a stupid method that I think should work:

QUOTE_CHAR = '"'; REPLACEMENT_CHAR = "'"; ROW_SEP = "\r\n"; COL_SEP = ","
surrounding_quotes_regex = /(\A\s*#{QUOTE_CHAR}|#{QUOTE_CHAR}\s*#{ROW_SEP}\Z)/
quote_sep_quote_regex = /#{QUOTE_CHAR}\s*#{COL_SEP}\s*#{QUOTE_CHAR}/
line = line.gsub(surrounding_quotes_regex, '').gsub(quote_sep_quote_regex, "\x1E").gsub(QUOTE_CHAR, REPLACEMENT_CHAR)

… basically:
1. strip the opening and closing quotes
2. replace internal quote-sep-quote sequences with ASCII 0x1E (record separator)
3. replace any remaining quotes with the replacement character (double -> single quote)

It works if you assume that you're processing a whole line at a time and that all strings are quoted. Not sure how to modify it to handle problems where #gets gives you a partial line, or where you need to process as you go rather than line-at-once.

It fails if there is inconsistent quoting though; e.g. 123,"foo "bar" baz",321, 123,foo,"bar",baz "qux",321, 123,"foo "bar" baz",1,2,"1,2,3",4. Hmmm.

@JEG2
Owner
JEG2 commented Jul 31, 2012

Yeah, that's the problem with this data. Once you throw out proper escaping there are way too many edge cases. :(

@saizai
saizai commented Jul 31, 2012

@JEG2 Unfortunately, the data is what it is. I don't get to control it, and I do have to import it. I can wish that they used \x1E separators to begin with, and used delimiters consistently or did other basic validation; then I'd have more sanity. But they didn't, and I'd like to retain at least some. :-P

@JEG2
Owner
JEG2 commented Aug 6, 2012

I hear ya. I'm just not sure this leads to: CSV should support this kind of content. Know what I mean?

@dpritchett

Wouldn't patching FasterCSV lead to interface incompatibilities with 1.9's CSV library anyway? That csv.rb on fech appears to be passing text into its chosen CSV processor one line at a time already; why not drop your pre-processing text cleaner in there instead?

@saizai
saizai commented Aug 7, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.