You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd argue that the correct delimiter in the following file:
col1;col2;col3
1;a,b,c;3
is a semicolon. Yet:
julia> CSV.read("tmp.csv")
thread =1 warning: parsed expected 1 columns, but didn't reach end of line around data row:2. Ignoring any extra columns on this row
1×1 DataFrame
│ Row │ col1;col2;col3 │
│ │ String │
├─────┼────────────────┤
│ 1 │ 1;a │
I admit that the data in the second column should probably be surrounded by quotation marks (since it contains commas) -- and indeed that fixes this issue, but maybe it makes more sense that fields should be surrounded with quotation marks if and only if they contain the delimiter character (and not just comma).
The text was updated successfully, but these errors were encountered:
Yes, the issue here is that we're not tracking the possible delimiters per row, but rather over the 1st N rows used for delimiter detection (usually 10). Once we finish scanning those N rows, we check the most common delimiters in order to see which ones have count % N == 0. In this case, it's ',' because there were 2, spread over 2 lines, and it has precedent over checking for a possible ';' delimiter.
One thing to remember is that this auto-delimiter detection happens before we necessarily know how many columns there are or whether a row is a header row or not.
In this case, if you had more rows (3-5), it would most likely detect ';' correctly since there probably wouldn't be a consistent # of commas on subsequent lines.
In the end, auto-delimiter detection will always be a bit of an "art" and with so few data points (2 rows), there's just not much we could really do here that would be consistently better for all cases.
I'd argue that the correct delimiter in the following file:
is a semicolon. Yet:
I admit that the data in the second column should probably be surrounded by quotation marks (since it contains commas) -- and indeed that fixes this issue, but maybe it makes more sense that fields should be surrounded with quotation marks if and only if they contain the delimiter character (and not just comma).
The text was updated successfully, but these errors were encountered: