Parsing fails with long strings #1009

CarlColglazier · 2022-06-20T15:15:47Z

Replication

Using this test file saved as "test.csv"

Run the following to try to read it:

CSV.read("test.csv", DataFrame, ntasks=1)

Which gives the following error:

ERROR: ArgumentError: length argument to Parsers.PosLen (1100002) is too large; max length allowed is 1048575
Stacktrace:
  [1] lentoolarge(len::Int64)
    @ Parsers ~/.julia/packages/Parsers/KmPKe/src/utils.jl:302
  [2] PosLen
    @ ~/.julia/packages/Parsers/KmPKe/src/utils.jl:306 [inlined]
  [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
    @ Parsers ~/.julia/packages/Parsers/KmPKe/src/strings.jl:289
  [4] xparse
    @ ~/.julia/packages/Parsers/KmPKe/src/strings.jl:3 [inlined]
  [5] detectcell(buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context, rowsguess::Int64)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:739
  [6] parserow
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:598 [inlined]
  [7] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, #unused#::Type{Tuple{}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:551
  [8] CSV.File(ctx::CSV.Context, chunking::Bool)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:291
  [9] File
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:226 [inlined]
 [10] #File#25
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:222 [inlined]
 [11] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:ntasks,), Tuple{Int64}}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/CSV.jl:91
 [12] top-level scope
    @ REPL[4]:1

The text was updated successfully, but these errors were encountered:

nickrobinson251 · 2022-06-20T15:23:16Z

yeah, this is tricky one -- some discusson about it here: #935 and JuliaData/Parsers.jl#98

dlakelan · 2022-08-02T05:00:04Z

Ugh, this is really bad and it happens even without enormously long lines, just big files...

Here's the Census 2020 ACS household data

https://www2.census.gov/programs-surveys/acs/experimental/2020/data/pums/1-Year/csv_hus.zip

unzip it and try to read the second large file:

df = CSV.read("psam_husb.csv",DataFrame)

You'll get one of these parse errors. Lines in this file are like a few thousand characters, not hundreds of thousands of characters. But there are 645744 lines in the file.

Is there a workaround here?

quinnj · 2022-08-02T05:44:39Z

@dlakelan, it sounds to me like there might be some bad quoting in your file. The limits when you would hit this bug are:

Greater than ~1MB for an individual cell value
Greater than ~4.4TB for entire file size

If there was a cell, however, that started with "some text ..., but there wasn't a terminating " character, then the parsing will continue until the EOF looking for the closing ".

nilshg · 2022-08-02T07:51:56Z

FWIW I can't reproduce it on that file:

julia> CSV.read(f, DataFrame)
647968×239 DataFrame
    Row │ RT       SERIALNO       DIVISION  PUMA   REGION  ST     ADJHSG   ADJINC   WGTP   NP     TYPEHUGQ  ACCESSINET  ACR      AGS      BATH     BDSP     BLD      BROADBND  COMPOTHX  CONP     DIALUP   ELEFP  ⋯
        │ String1  String15       Int64     Int64  Int64   Int64  Int64    Int64    Int64  Int64  Int64     Int64?      Int64?   Int64?   Int64?   Int64?   Int64?   Int64?    Int64?    Int64?   Int64?   Int64? ⋯
────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      1 │ H        2020GQ0000022         4    400       2     29  1000000  1006149      0      1         2     missing  missing  missing  missing  missing  missing   missing   missing  missing  missing  missin ⋯
      2 │ H        2020GQ0000086
(...)

(jl_o4iDqb) pkg> st
Status `C:\Users\ngudat\AppData\Local\Temp\jl_o4iDqb\Project.toml`
  [336ed68f] CSV v0.10.4
  [a93c6f00] DataFrames v1.3.4

jd-foster · 2022-08-02T09:46:08Z

I also cannot reproduce this on that file on mac, same versions as the Windows test above.

dlakelan · 2022-08-02T14:42:40Z

Sure enough, on line 14 my version of the file has a quote character at the end:

I'll re-download the zip file and uncompress from scratch see if it was just damage in the download

dlakelan · 2022-08-02T14:51:05Z

Ok, Sure enough, a fresh download and the file loads... Computers are weird. Thanks for you guys helping with this!

nickrobinson251 · 2022-10-07T11:10:49Z

Ok, Sure enough, a fresh download and the file loads... Computers are weird. Thanks for you guys helping with this!

Glad it's sorted!

There's #935 open for the "really long strings" issue, so will close this one.

nickrobinson251 added new feature and removed new feature labels Oct 7, 2022

nickrobinson251 closed this as completed Oct 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing fails with long strings #1009

Parsing fails with long strings #1009

CarlColglazier commented Jun 20, 2022

nickrobinson251 commented Jun 20, 2022

dlakelan commented Aug 2, 2022

quinnj commented Aug 2, 2022 •

edited

nilshg commented Aug 2, 2022

jd-foster commented Aug 2, 2022

dlakelan commented Aug 2, 2022

dlakelan commented Aug 2, 2022

nickrobinson251 commented Oct 7, 2022

Parsing fails with long strings #1009

Parsing fails with long strings #1009

Comments

CarlColglazier commented Jun 20, 2022

Replication

nickrobinson251 commented Jun 20, 2022

dlakelan commented Aug 2, 2022

quinnj commented Aug 2, 2022 • edited

nilshg commented Aug 2, 2022

jd-foster commented Aug 2, 2022

dlakelan commented Aug 2, 2022

dlakelan commented Aug 2, 2022

nickrobinson251 commented Oct 7, 2022

quinnj commented Aug 2, 2022 •

edited