Failure to detect correct delimiter #746

yakir12 · 2020-10-05T14:26:23Z

I'd argue that the correct delimiter in the following file:

col1;col2;col3
1;a,b,c;3

is a semicolon. Yet:

julia> CSV.read("tmp.csv")
thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 2. Ignoring any extra columns on this row
1×1 DataFrame
│ Row │ col1;col2;col3 │
│     │ String         │
├─────┼────────────────┤
│ 1   │ 1;a            │

I admit that the data in the second column should probably be surrounded by quotation marks (since it contains commas) -- and indeed that fixes this issue, but maybe it makes more sense that fields should be surrounded with quotation marks if and only if they contain the delimiter character (and not just comma).

The text was updated successfully, but these errors were encountered:

quinnj · 2020-10-05T21:24:44Z

Yes, the issue here is that we're not tracking the possible delimiters per row, but rather over the 1st N rows used for delimiter detection (usually 10). Once we finish scanning those N rows, we check the most common delimiters in order to see which ones have count % N == 0. In this case, it's ',' because there were 2, spread over 2 lines, and it has precedent over checking for a possible ';' delimiter.

One thing to remember is that this auto-delimiter detection happens before we necessarily know how many columns there are or whether a row is a header row or not.

In this case, if you had more rows (3-5), it would most likely detect ';' correctly since there probably wouldn't be a consistent # of commas on subsequent lines.

In the end, auto-delimiter detection will always be a bit of an "art" and with so few data points (2 rows), there's just not much we could really do here that would be consistently better for all cases.

quinnj closed this as completed Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to detect correct delimiter #746

Failure to detect correct delimiter #746

yakir12 commented Oct 5, 2020

quinnj commented Oct 5, 2020

Failure to detect correct delimiter #746

Failure to detect correct delimiter #746

Comments

yakir12 commented Oct 5, 2020

quinnj commented Oct 5, 2020