Skip to content

Problems with reading in TSV files which contain a rowerror #1840

@danimad

Description

@danimad

Hi Arun,

We met at the SatRday in Budapest. Here is the problem I was talking you about.
I attached an anonymized sample of the data with the row that contains the error (the characters in the data are completely replaced with "a"-s, and "1"-s, the dates are replaced with a random date, but the underlying file structure is the same: It is a tab separated file, with no quotation marks used to delimit the data int the columns, and one row is broken, and the data is placed it two separate rows. There are 11 rows and 51 columns in the sample file, and the error is in the 6th and 7th rows.
I think you were right, that the problem is likely with the end of line character, the error message reads:

Error in fread("data/dt_anonymized_test.txt") : 
  Expected sep ('   ') but new line or EOF ends field 39 on line 6 when reading data: 1970.03.24    1111111111111   aaaaaaaaaaa aaaaaaaaaa  aaa aaaa aaaa.  aaaa11  1970.03.24  aaaaaa aaaaaa   aa  1970.03.24  1970.03.24  1111-1111111            aaa aaaa    1111111.11  111111.11   111111.11   1111111 111111  111111  1.11    1   1   1   1111111 111111  111111  1111111 111111  111111  1111111 1111111 1111111 111111  111 1111111111111111    1111111111111.11.11

In addition the sample file gives the following warning:

In addition: Warning message:
In fread("data/dt_anonymized_test.txt") :
  Bumped column 39 to type character on data row 6, field contains '1111111111111.11.11'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

I included the later, because the real data also gives this kinds of error, but I think that part is less important for me.
I use data.table version 1.9.6 from CRAN.
Thanks beforehand for your help!

dt_anonymized_test.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions