Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to detect correct delimiter #746

Closed
yakir12 opened this issue Oct 5, 2020 · 1 comment
Closed

Failure to detect correct delimiter #746

yakir12 opened this issue Oct 5, 2020 · 1 comment

Comments

@yakir12
Copy link

yakir12 commented Oct 5, 2020

I'd argue that the correct delimiter in the following file:

col1;col2;col3
1;a,b,c;3

is a semicolon. Yet:

julia> CSV.read("tmp.csv")
thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 2. Ignoring any extra columns on this row
1×1 DataFrame
│ Row │ col1;col2;col3 │
│     │ String         │
├─────┼────────────────┤
│ 11;a            │

I admit that the data in the second column should probably be surrounded by quotation marks (since it contains commas) -- and indeed that fixes this issue, but maybe it makes more sense that fields should be surrounded with quotation marks if and only if they contain the delimiter character (and not just comma).

@quinnj
Copy link
Member

quinnj commented Oct 5, 2020

Yes, the issue here is that we're not tracking the possible delimiters per row, but rather over the 1st N rows used for delimiter detection (usually 10). Once we finish scanning those N rows, we check the most common delimiters in order to see which ones have count % N == 0. In this case, it's ',' because there were 2, spread over 2 lines, and it has precedent over checking for a possible ';' delimiter.

One thing to remember is that this auto-delimiter detection happens before we necessarily know how many columns there are or whether a row is a header row or not.

In this case, if you had more rows (3-5), it would most likely detect ';' correctly since there probably wouldn't be a consistent # of commas on subsequent lines.

In the end, auto-delimiter detection will always be a bit of an "art" and with so few data points (2 rows), there's just not much we could really do here that would be consistently better for all cases.

@quinnj quinnj closed this as completed Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants