Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io.read_csv: ValueError due to long lines #12203

Closed
dacoex opened this issue Feb 2, 2016 · 6 comments
Closed

io.read_csv: ValueError due to long lines #12203

dacoex opened this issue Feb 2, 2016 · 6 comments
Labels
API Design IO CSV read_csv, to_csv
Milestone

Comments

@dacoex
Copy link
Contributor

dacoex commented Feb 2, 2016

I try to read a csv file were some lines are longer than the rest.

Pandas C engine throws an error with these lines.
But I do not was to skip these as discussed in a comment on a similar issue.

I prefer to "cut" the "bad columns" off using usecols. But I get the following errors:

df = pd.read_csv(file_path, sep=',', skiprows=1, header=None, 
                       usecols=range(0,23), nrows=None, 
                         engine='python')

Throws:

ValueError: Expected 10 fields in line 100, saw 20

Why is the ValueError raised although I explicitly defined the columns to use?

References:

@dacoex
Copy link
Contributor Author

dacoex commented Feb 2, 2016

Here's a longer but self-contained example based on pandas read_csv and filter columns with usecols


import pandas as pd
#from cSt import StringIO
#from cStringIO import StringIO

import io

# works
csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1
bar,20090102,b,3
bar,20090103,b,5"""

df = pd.read_csv(io.StringIO(csv),
        header=0,
        sep=",",
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"],
        parse_dates=["date"])
df2 = pd.read_csv(io.StringIO(csv),
        header=None,
        sep=",",
        usecols=[0, 2, 3])
#        parse_dates=0)

# throws value error
# ValueError: Expected 4 fields in line 5, saw 6

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1,2,2
bar,20090102,b,3
bar,20090103,b,5"""

df2 = pd.read_csv(io.StringIO(csv),
        header=None,
        sep=",",
        usecols=[0, 2, 3],
         engine='python')
#        parse_dates=0)

# works

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1,2,2
bar,20090102,b,3
bar,20090103,b,5"""

mycols = ['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
df3 = pd.read_csv(io.StringIO(csv),
        header=None,
        sep=",",
        names=mycols,
         engine='python')
#        parse_dates=0)


# works but inefficient

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1,2,2
bar,20090102,b,3
bar,20090103,b,5"""

mycols = list(map(str, range(6)))
df4 = pd.read_csv(io.StringIO(csv),
        header=None,
        sep=",",
        names=mycols,
         engine='python')
#        parse_dates=0)

df4_cut = df4.iloc[:,:4]

# improved parametrised solution:

csv = r"""dummy,date,loc,x
bar,20090101,a,1
bar,20090102,a,3
bar,20090103,a,5
bar,20090101,b,1,2,2
bar,20090102,b,3
bar,20090103,b,5"""

col_expected = 4
col_saw = 6
mycols = list(map(str, range(col_saw)))
df5 = pd.read_csv(io.StringIO(csv),
        header=None,
        sep=",",
        names=mycols,
         engine='python')
#        parse_dates=0)

df5_cut = df5.iloc[:,:col_expected]

So why does the version where the columns are selected based on names work and the version where use cols are given does not work?

@jreback
Copy link
Contributor

jreback commented Feb 2, 2016

looks like a dupe of this: #9755

pls confirm

@jreback jreback added Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv labels Feb 2, 2016
@dacoex
Copy link
Contributor Author

dacoex commented Feb 3, 2016

I connot confirm.

The reason: in all examples of #9755 names and usecols were given.

In my case, it's data without any column names.

So I would like to be able to specify the columns to be used without using names.

@jreback
Copy link
Contributor

jreback commented Feb 3, 2016

ok, i'll mark it. though I suspect its a very similar soln. pull-requests welcome.

@jreback jreback added API Design Difficulty Intermediate and removed Duplicate Report Duplicate issue or pull request labels Feb 3, 2016
@jreback jreback added this to the Next Major Release milestone Feb 3, 2016
@dacoex
Copy link
Contributor Author

dacoex commented Feb 4, 2016

I noted that nrows is even more sensitive

@dacoex
Copy link
Contributor Author

dacoex commented Feb 4, 2016

The whole problem is that usecols is used to derive a subset of the cleanly parsed csv file.

E.g. when reading


1,2,3
1,2,3
1,2,3
1,2,3
1,2,3

usecols[0,1] would aloow to get a df with the col 1 and col 2.

What I seek, any others on SO, is to limit the number of columns that are parsed initially.

e.g.

1,2,3
1,2,3
1,2,3,4,5,6,7,8,9,0
1,2,3
1,2,3,4,5,6,7,8,9,0,,4,5,6,7,8,9,0

In this case, usecols does not apply and work.
The different length of the 2 bad rows confuse teh reader completely.

But real world data, often from dataloggers, produces such artefacts on random basis...

gfyoung added a commit to forking-repos/pandas that referenced this issue Mar 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

2 participants