Can't read file with quotes in comments #788

aplavin · 2020-11-22T12:59:41Z

Without quotes everything works as expected:

CSV.File(IOBuffer("""
# 1'2
name
junk
1
"""), comment="#", header=2, datarow=4)

# 1-element CSV.File{false}:
# CSV.Row: (name = 1,)

Adding a quote to the first line results in no rows read:

CSV.File(IOBuffer("""
# 1'2"
name
junk
1
"""), comment="#", header=2, datarow=4)

# 0-element CSV.File{false}

I would guess the issue is somewhere around here: https://github.com/JuliaData/CSV.jl/blob/master/src/detection.jl#L202-L217, but don't know if it's possible to easily fix.

The text was updated successfully, but these errors were encountered:

quinnj · 2020-11-24T08:12:23Z

Hmmmm, yeah, I can see why this is a bit confusing. The problem is that the commented row essentially "doesn't count" towards the row counts for header and datarow, so, for example, this works:

julia> CSV.File(IOBuffer("""
       # 1'2"
       name
       junk
       1
       """), comment="#", header=1, datarow=3)
1-element CSV.File{false}:
 CSV.Row: (name = 1,)

Maybe we can make this clearer in the documentation that when specifying row numbers, commented rows won't count and should be ignored.

aplavin · 2020-11-24T08:18:44Z

Then I'm even more confused... Note that the only difference between two examples in the first post is the quotation mark, all comments and row indices are the same. So sometimes the rows are counted including comments, sometimes not.

quinnj · 2020-11-24T08:21:09Z

Hmmm.......you're right; there's still something fishy going on here.

aplavin · 2020-11-24T08:24:39Z

Also, the original example where I noticed such a behaviour had many thousands of rows, and none of those were actually read. So I still think the issue is that (sometimes?) quotes in comments are treated as "real" quotes. E.g., see a longer example:

CSV.File(IOBuffer("""
       # 1'2"
       name
       junk
       1
       2
       3
       1
       2
       3
       1
       2
       3
       1
       2
       3
       """), comment="#", header=2, datarow=3)

# 0-element CSV.File{false}

Improves #788. In the original issue, a quote character on a commented row messes the parsing positioning up because it's looking for a closing quote character. By checking for and skipping commented rows, no matter the characters present, we ensure parsing integrity. One ramification of this, however, is that commented rows now "no longer count" when considering row numbers, i.e. when specifying the `header=2` or `datarow=4` keyword arguments, because the commented rows are literally ignored when parsing. This seems fine to me, but probably warrants some documentation so it's clear.

quinnj · 2020-11-24T08:31:27Z

Ok, so with the change/fix in #789, I get consistent results in that commented rows are completely ignored and dont' count towards "row number". I'm trying to think through whether that's fine or really confusing for users.

aplavin · 2020-11-24T08:39:51Z

I would say consistency is important, both within the library and to other common implementations. For now empty lines are not counted in CSV.jl when the ignoreemptylines=true, as I understand; pandas.read_csv works similarly as well, in regards to empty or commented lines.

quinnj · 2020-11-24T21:10:42Z

Ok, I've updated the PR and actually went the other direction: commented rows and empty rows do count when considering header and datarow, because I think it ends up being more intuitive. If you need to specify header/data row arguments, you've obviously visually looked at the file somehow, so having these arguments correspond more naturally to what you would see visually seems to make the most sense. In the case of the user specifying a header/data row and that row happens to be commented/empty, then we'll skip until the next non-commented/non-empty row to use as the header/data.

quinnj mentioned this issue Nov 24, 2020

Ensure we check for commented rows when skipping rows for header/data #789

Merged

quinnj closed this as completed Nov 24, 2020

aplavin mentioned this issue Jul 9, 2022

Can't skip rows with quote #1012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't read file with quotes in comments #788

Can't read file with quotes in comments #788

aplavin commented Nov 22, 2020

quinnj commented Nov 24, 2020

aplavin commented Nov 24, 2020

quinnj commented Nov 24, 2020

aplavin commented Nov 24, 2020

quinnj commented Nov 24, 2020

aplavin commented Nov 24, 2020

quinnj commented Nov 24, 2020

Can't read file with quotes in comments #788

Can't read file with quotes in comments #788

Comments

aplavin commented Nov 22, 2020

quinnj commented Nov 24, 2020

aplavin commented Nov 24, 2020

quinnj commented Nov 24, 2020

aplavin commented Nov 24, 2020

quinnj commented Nov 24, 2020

aplavin commented Nov 24, 2020

quinnj commented Nov 24, 2020