Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't read file with quotes in comments #788

Closed
aplavin opened this issue Nov 22, 2020 · 7 comments
Closed

Can't read file with quotes in comments #788

aplavin opened this issue Nov 22, 2020 · 7 comments

Comments

@aplavin
Copy link

aplavin commented Nov 22, 2020

Without quotes everything works as expected:

CSV.File(IOBuffer("""
# 1'2
name
junk
1
"""), comment="#", header=2, datarow=4)

# 1-element CSV.File{false}:
# CSV.Row: (name = 1,)

Adding a quote to the first line results in no rows read:

CSV.File(IOBuffer("""
# 1'2"
name
junk
1
"""), comment="#", header=2, datarow=4)

# 0-element CSV.File{false}

I would guess the issue is somewhere around here: https://github.com/JuliaData/CSV.jl/blob/master/src/detection.jl#L202-L217, but don't know if it's possible to easily fix.

@quinnj
Copy link
Member

quinnj commented Nov 24, 2020

Hmmmm, yeah, I can see why this is a bit confusing. The problem is that the commented row essentially "doesn't count" towards the row counts for header and datarow, so, for example, this works:

julia> CSV.File(IOBuffer("""
       # 1'2"
       name
       junk
       1
       """), comment="#", header=1, datarow=3)
1-element CSV.File{false}:
 CSV.Row: (name = 1,)

Maybe we can make this clearer in the documentation that when specifying row numbers, commented rows won't count and should be ignored.

@aplavin
Copy link
Author

aplavin commented Nov 24, 2020

Then I'm even more confused... Note that the only difference between two examples in the first post is the quotation mark, all comments and row indices are the same. So sometimes the rows are counted including comments, sometimes not.

@quinnj
Copy link
Member

quinnj commented Nov 24, 2020

Hmmm.......you're right; there's still something fishy going on here.

@aplavin
Copy link
Author

aplavin commented Nov 24, 2020

Also, the original example where I noticed such a behaviour had many thousands of rows, and none of those were actually read. So I still think the issue is that (sometimes?) quotes in comments are treated as "real" quotes. E.g., see a longer example:

CSV.File(IOBuffer("""
       # 1'2"
       name
       junk
       1
       2
       3
       1
       2
       3
       1
       2
       3
       1
       2
       3
       """), comment="#", header=2, datarow=3)

# 0-element CSV.File{false}

quinnj added a commit that referenced this issue Nov 24, 2020
Improves #788. In the original issue, a quote character on a commented
row messes the parsing positioning up because it's looking for a closing
quote character. By checking for and skipping commented rows, no matter
the characters present, we ensure parsing integrity. One ramification of
this, however, is that commented rows now "no longer count" when
considering row numbers, i.e. when specifying the `header=2` or
`datarow=4` keyword arguments, because the commented rows are literally
ignored when parsing. This seems fine to me, but probably warrants some
documentation so it's clear.
@quinnj
Copy link
Member

quinnj commented Nov 24, 2020

Ok, so with the change/fix in #789, I get consistent results in that commented rows are completely ignored and dont' count towards "row number". I'm trying to think through whether that's fine or really confusing for users.

@aplavin
Copy link
Author

aplavin commented Nov 24, 2020

I would say consistency is important, both within the library and to other common implementations. For now empty lines are not counted in CSV.jl when the ignoreemptylines=true, as I understand; pandas.read_csv works similarly as well, in regards to empty or commented lines.

@quinnj
Copy link
Member

quinnj commented Nov 24, 2020

Ok, I've updated the PR and actually went the other direction: commented rows and empty rows do count when considering header and datarow, because I think it ends up being more intuitive. If you need to specify header/data row arguments, you've obviously visually looked at the file somehow, so having these arguments correspond more naturally to what you would see visually seems to make the most sense. In the case of the user specifying a header/data row and that row happens to be commented/empty, then we'll skip until the next non-commented/non-empty row to use as the header/data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants