read_csv() & extra trailing comma(s) cause parsing issues. #2886

dragoljub · 2013-02-17T02:15:20Z

I have run into a few opportunities to further improve the wonderful read_csv() function. I'm using the latest x64 0.10.1 build from 10-Feb-2013 16:52.

Symptoms:

With extra trailing commas and setting index_col=False read_csv() fails with: IndexError: list index out of range
When one or more CSV rows has additional trailing commas (compared to previous rows) read_csv() fails with: CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6

I believe the opportunity to fix these exceptions would require simply ignoring any extra trailing commas. This is how many CSV readers work such as opening CSVs in excel. In my case I regularly work with 100K line CSVs that occasionally have extra trailing columns causing read_csv() to fail. Perhaps its possible to have an option to ignore trailing commas, or even better an option to ignore/skip any malformed rows without raising a terminal exception. :)

-Gagi

If a CSV has 'n' matched extra trailing columns and you do not specify any index_col then the parser will correctly assume that the first 'n' columns are the index , if you set index_col=False it fails with: IndexError: list index out of range

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import StringIO

In [4]: pd.__version__
Out[4]: '0.10.1'

In [5]: data = 'a,b,c\n4,apple,bat,,\n8,orange,cow,,' <-- Matched extra commas

In [6]: data2 = 'a,b,c\n4,apple,bat,,\n8,orange,cow,,,' <-- Miss-matched extra commas

In [7]: print data
a,b,c
4,apple,bat,,
8,orange,cow,,

In [8]: print data2
a,b,c
4,apple,bat,,
8,orange,cow,,,

In [9]: df = pd.read_csv(StringIO.StringIO(data))

In [10]: df
Out[10]:             a   b   c
4 apple   bat NaN NaN
8 orange  cow NaN NaN

In [11]: df.index
Out[11]: MultiIndex
[(4, apple), (8, orange)]

In [12]: df2 = pd.read_csv(StringIO.StringIO(data), index_col=False)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-12-4d7ece45eef3> in <module>()
----> 1 df2 = pd.read_csv(StringIO.StringIO(data), index_col=False)

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    397                     buffer_lines=buffer_lines)
    398
--> 399         return _read(filepath_or_buffer, kwds)
    400
    401     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    213         return parser
    214
--> 215     return parser.read()
    216
    217 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    629             #     self._engine.set_error_bad_lines(False)
    630
--> 631         ret = self._engine.read(nrows)
    632
    633         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    952
    953         try:
--> 954             data = self._reader.read(nrows)
    955         except StopIteration:
    956             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5915)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6946)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7670)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._get_column_name (pandas\src\parser.c:10545)()

IndexError: list index out of range

In [13]: df3 = pd.read_csv(StringIO.StringIO(data2), index_col=False)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-13-441bfd4aff6e> in <module>()
----> 1 df3 = pd.read_csv(StringIO.StringIO(data2), index_col=False)

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    397                     buffer_lines=buffer_lines)
    398
--> 399         return _read(filepath_or_buffer, kwds)
    400
    401     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    213         return parser
    214
--> 215     return parser.read()
    216
    217 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    629             #     self._engine.set_error_bad_lines(False)
    630
--> 631         ret = self._engine.read(nrows)
    632
    633         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    952
    953         try:
--> 954             data = self._reader.read(nrows)
    955         except StopIteration:
    956             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5915)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6734)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._tokenize_rows (pandas\src\parser.c:6619)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.raise_parser_error (pandas\src\parser.c:17023)()

CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6
- show quoted text -

wesm · 2013-03-12T13:56:09Z

This isn't a totally trivial problem. I'll take a look into it, though

davidjameshumphreys · 2013-03-22T18:36:22Z

With the mismatched trailing commas 'a,b,c\n4,apple,bat,,\n8,orange,cow,,,'
if I call read_csv(..., error_bad_lines=False) omitting the index_col=False
then it will keep processing the data but will drop the bad line.

If index_col=False is added in then it will fail with the error as described in 1 above.

I have a similar issue processing files where the last field is freeform text and the separator is sometimes included.

I have two different workarounds depending on the situation:

if it is just trailing commas then I pre-process the data to remove the commas.
For the case where the final field contains the separator I pre-process the file using:

perl -pne '$_ = join("|", split(/,/, $_, 4) )' < input.csv > output.csv

where 4 is the max number of fields that I want to have. This option requires a second delimiter (all valid , get changed to | and any , after the 4th is unchanged).

I hope this workaround helps.

jreback · 2013-03-22T18:47:45Z

@davidjameshumphreys if you would like to do a PR to edit the http://pandas.pydata.org/pandas-docs/dev/cookbook.html#csv (doc/source/cookbook.rst), and add a link to this issue, would be great!

dragoljub · 2013-03-22T18:55:16Z

Wow I had no idea read_csv() had the option "error_bad_lines=False". That will save me some headaches in the future for sure! It's documented however not explicitly added to the top argument header section which is why I never found it.

@jreback

Added link to CSV section referencing issue pandas-dev#2886 where read_csv fails with badly aligned data. @jreback should I expand the text to describe other scenarios?

jreback · 2013-03-22T20:39:50Z

@dragoljub given the error_bad_lines (which @davidjameshumphreys is documentingin the correct place as well as this recipe), do you think anything needs to be added to read_csv? or just close this?

dragoljub · 2013-03-22T20:46:15Z

I think it would be a nice feature for the parser to ignore extra trailing commas so this can be turned into a feature request rather than an issue. Since we can pre-process to drop extra commas or ignore malformed lines we can close this as an issue.

tbicr · 2013-12-12T14:10:43Z

Does pandas have any ability process or just get list of bad lines?

jreback · 2013-12-12T14:23:46Z

@tbicr bad lines can be ignored with the option error_bad_lines=False; I don't think processing them will work as they are bad! however, it may be possible to return them / line numbers or such. pls open a new issue and suggest an api for this (with an example would be great)!

BastinRobin · 2015-06-21T06:58:01Z

I sorted this issue with use of csv module. My code look like

import csv
import pandas as pd
import numpy as np

df = pd.read_csv('filename.csv', parse_dates=True, dtype=Object, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

Ianna1009 · 2016-11-10T19:53:29Z

@davidjameshumphreys I met this similar problem today and I've tried your second pre-processing method. Changing all the first , to | worked perfectly. Thanks!

ahowl7mx · 2018-05-16T16:55:30Z

Have the same "extra comma" in the CSVs due to a free form field. Another way besides adding "
"error_bad_lines=False" is to read the CSV vertically instead of horizontally. Do this by replacing error_bad_lines with usecols=([0,1,2,3,4,5,...]).
Problem is in the instance of the "extra comma" it is outside the "usecols" range. Haven't figured out how to isolate the problem lines yet.

musicpiano · 2018-09-21T08:46:56Z

I sorted this issue with use of csv module. My code look like

import csv
import pandas as pd
import numpy as np

df = pd.read_csv('filename.csv', parse_dates=True, dtype=Object, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

I have a table. It has the same issue. But the following command works!
pd.read_table('filename.txt',encoding='cp1252',sep='\t',quoting=3,low_memory=False)

dragoljub closed this as completed Mar 22, 2013

tbicr mentioned this issue Dec 12, 2013

Add ability to process bad lines for read_csv #5686

Closed

dacoex mentioned this issue Feb 2, 2016

io.read_csv: ValueError due to long lines #12203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv() & extra trailing comma(s) cause parsing issues. #2886

read_csv() & extra trailing comma(s) cause parsing issues. #2886

dragoljub commented Feb 17, 2013

wesm commented Mar 12, 2013

davidjameshumphreys commented Mar 22, 2013

jreback commented Mar 22, 2013

dragoljub commented Mar 22, 2013

jreback commented Mar 22, 2013

dragoljub commented Mar 22, 2013

tbicr commented Dec 12, 2013

jreback commented Dec 12, 2013

BastinRobin commented Jun 21, 2015

Ianna1009 commented Nov 10, 2016

ahowl7mx commented May 16, 2018

musicpiano commented Sep 21, 2018

read_csv() & extra trailing comma(s) cause parsing issues. #2886

read_csv() & extra trailing comma(s) cause parsing issues. #2886

Comments

dragoljub commented Feb 17, 2013

wesm commented Mar 12, 2013

davidjameshumphreys commented Mar 22, 2013

jreback commented Mar 22, 2013

dragoljub commented Mar 22, 2013

jreback commented Mar 22, 2013

dragoljub commented Mar 22, 2013

tbicr commented Dec 12, 2013

jreback commented Dec 12, 2013

BastinRobin commented Jun 21, 2015

Ianna1009 commented Nov 10, 2016

ahowl7mx commented May 16, 2018

musicpiano commented Sep 21, 2018