Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Handling fallbacks for unreliable data like quote-delim variance #16

Closed
saizai opened this Issue · 11 comments

3 participants

@saizai

I'm trying to fix https://github.com/NYTimes/Fech/blob/master/lib/fech/csv.rb

According to the FEC file spec, fields are supposed to be quote delimited.

However, sometimes you get things like this:

SB23,C00121368,CAN,HAYES FOR CONGRESS,PO BOX 2000,,CONCORD,NC,28026,15,PRIMARY,P2004,,20031014,000001000.00,,,ROBERT CANNON "ROBIN" HAYES,H,NC,08,,,,,,,,CONT TO ROBIN HAYES HOUSE NC 08,,03103111504948000013,,,,

Note the within-field quote marks, which are not supposed to be there per file format spec, but robustness principle means that's our problem. It's legacy data, can't be changed.

@nytimes' CsvDoctor tries to fix this by manually loading the file, but of course its logic there is wrong (e.g. it breaks if a field has linebreaks within it, because File.open.each thinks that linebreaks are the separator).

However there doesn't seem to be a better place in FasterCSV to patch in a retry; if I try to wrap #shift for instance, retry would break because the io has already moved forward in a way that can't be reversed. And I'd rather not have to replicate a whole lot of FasterCSV's file-parsing code just to make this patch.

Any suggestions?

@JEG2
Owner

Hmm, the truth is that I just probably wouldn't use a CSV parser for this. The data isn't quite valid CSV. Because of that, I want a parser more tuned to this data. Does that make sense?

@saizai

The data is at least intended to be CSV — and it would work correctly if there's some way to detect the quoting style being different at runtime, like CsvDoctor tries to do. It's just dirty data; not everyone complied with the format rules, and now we're stuck with having to read it anyway. What else is new, right? :-P

@JEG2
Owner

I think something like this parser is what you need:

while line = DATA.gets
  fields = [ ]
  unless line == "\n"
    loop do
      while line.sub!(/\A("(?:""|[^"]*)+"|(?!")[^,]*)(?:,|\n)/, "")
        if $1.empty?
          fields << nil
        elsif $1.start_with? '"'
          fields << $1[1..-2].gsub('""', '"')
        else
          fields << $1
        end
      end
      break if     line.empty?
      break unless more = DATA.gets
      line += more
    end
  end
  p fields
end

__END__
,,,
123,"testing a field with
a newline in it"
SB23,C00121368,CAN,HAYES FOR CONGRESS,PO BOX 2000,,CONCORD,NC,28026,15,PRIMARY,P2004,,20031014,000001000.00,,,ROBERT CANNON "ROBIN" HAYES,H,NC,08,,,,,,,,CONT TO ROBIN HAYES HOUSE NC 08,,03103111504948000013,,,,
@JEG2 JEG2 closed this
@saizai

@JEG2 Tried yours, and it fails for the following case:
123,"foo","bar "baz" qux","stuff",321
note the within-field quotes for bar "baz" qux.

Here's a stupid method that I think should work:

QUOTE_CHAR = '"'; REPLACEMENT_CHAR = "'"; ROW_SEP = "\r\n"; COL_SEP = ","
surrounding_quotes_regex = /(\A\s*#{QUOTE_CHAR}|#{QUOTE_CHAR}\s*#{ROW_SEP}\Z)/
quote_sep_quote_regex = /#{QUOTE_CHAR}\s*#{COL_SEP}\s*#{QUOTE_CHAR}/
line = line.gsub(surrounding_quotes_regex, '').gsub(quote_sep_quote_regex, "\x1E").gsub(QUOTE_CHAR, REPLACEMENT_CHAR)

… basically:
1. strip the opening and closing quotes
2. replace internal quote-sep-quote sequences with ASCII 0x1E (record separator)
3. replace any remaining quotes with the replacement character (double -> single quote)

It works if you assume that you're processing a whole line at a time and that all strings are quoted. Not sure how to modify it to handle problems where #gets gives you a partial line, or where you need to process as you go rather than line-at-once.

It fails if there is inconsistent quoting though; e.g. 123,"foo "bar" baz",321, 123,foo,"bar",baz "qux",321, 123,"foo "bar" baz",1,2,"1,2,3",4. Hmmm.

@JEG2
Owner

Yeah, that's the problem with this data. Once you throw out proper escaping there are way too many edge cases. :(

@saizai

@JEG2 Unfortunately, the data is what it is. I don't get to control it, and I do have to import it. I can wish that they used \x1E separators to begin with, and used delimiters consistently or did other basic validation; then I'd have more sanity. But they didn't, and I'd like to retain at least some. :-P

@JEG2
Owner

I hear ya. I'm just not sure this leads to: CSV should support this kind of content. Know what I mean?

@dpritchett

Wouldn't patching FasterCSV lead to interface incompatibilities with 1.9's CSV library anyway? That csv.rb on fech appears to be passing text into its chosen CSV processor one line at a time already; why not drop your pre-processing text cleaner in there instead?

@saizai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.