Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing fails with long strings #1009

Closed
CarlColglazier opened this issue Jun 20, 2022 · 8 comments
Closed

Parsing fails with long strings #1009

CarlColglazier opened this issue Jun 20, 2022 · 8 comments

Comments

@CarlColglazier
Copy link

Replication

Using this test file saved as "test.csv"

test.csv

Run the following to try to read it:

CSV.read("test.csv", DataFrame, ntasks=1)

Which gives the following error:

ERROR: ArgumentError: length argument to Parsers.PosLen (1100002) is too large; max length allowed is 1048575
Stacktrace:
  [1] lentoolarge(len::Int64)
    @ Parsers ~/.julia/packages/Parsers/KmPKe/src/utils.jl:302
  [2] PosLen
    @ ~/.julia/packages/Parsers/KmPKe/src/utils.jl:306 [inlined]
  [3] xparse(::Type{String}, source::Vector{UInt8}, pos::Int64, len::Int64, options::Parsers.Options, ::Type{Parsers.PosLen})
    @ Parsers ~/.julia/packages/Parsers/KmPKe/src/strings.jl:289
  [4] xparse
    @ ~/.julia/packages/Parsers/KmPKe/src/strings.jl:3 [inlined]
  [5] detectcell(buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context, rowsguess::Int64)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:739
  [6] parserow
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:598 [inlined]
  [7] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, #unused#::Type{Tuple{}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:551
  [8] CSV.File(ctx::CSV.Context, chunking::Bool)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/file.jl:291
  [9] File
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:226 [inlined]
 [10] #File#25
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:222 [inlined]
 [11] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Pairs{Symbol, Int64, Tuple{Symbol}, NamedTuple{(:ntasks,), Tuple{Int64}}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/CSV.jl:91
 [12] top-level scope
    @ REPL[4]:1
@nickrobinson251
Copy link
Collaborator

yeah, this is tricky one -- some discusson about it here: #935 and JuliaData/Parsers.jl#98

@dlakelan
Copy link

dlakelan commented Aug 2, 2022

Ugh, this is really bad and it happens even without enormously long lines, just big files...

Here's the Census 2020 ACS household data

https://www2.census.gov/programs-surveys/acs/experimental/2020/data/pums/1-Year/csv_hus.zip

unzip it and try to read the second large file:

df = CSV.read("psam_husb.csv",DataFrame)

You'll get one of these parse errors. Lines in this file are like a few thousand characters, not hundreds of thousands of characters. But there are 645744 lines in the file.

Is there a workaround here?

@quinnj
Copy link
Member

quinnj commented Aug 2, 2022

@dlakelan, it sounds to me like there might be some bad quoting in your file. The limits when you would hit this bug are:

  • Greater than ~1MB for an individual cell value
  • Greater than ~4.4TB for entire file size

If there was a cell, however, that started with "some text ..., but there wasn't a terminating " character, then the parsing will continue until the EOF looking for the closing ".

@nilshg
Copy link

nilshg commented Aug 2, 2022

FWIW I can't reproduce it on that file:

julia> CSV.read(f, DataFrame)
647968×239 DataFrame
    Row │ RT       SERIALNO       DIVISION  PUMA   REGION  ST     ADJHSG   ADJINC   WGTP   NP     TYPEHUGQ  ACCESSINET  ACR      AGS      BATH     BDSP     BLD      BROADBND  COMPOTHX  CONP     DIALUP   ELEFP  ⋯
        │ String1  String15       Int64     Int64  Int64   Int64  Int64    Int64    Int64  Int64  Int64     Int64?      Int64?   Int64?   Int64?   Int64?   Int64?   Int64?    Int64?    Int64?   Int64?   Int64? ⋯
────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      1 │ H        2020GQ0000022         4    400       2     29  1000000  1006149      0      1         2     missing  missing  missing  missing  missing  missing   missing   missing  missing  missing  missin ⋯
      2 │ H        2020GQ0000086
(...)

(jl_o4iDqb) pkg> st
Status `C:\Users\ngudat\AppData\Local\Temp\jl_o4iDqb\Project.toml`
  [336ed68f] CSV v0.10.4
  [a93c6f00] DataFrames v1.3.4

@jd-foster
Copy link

I also cannot reproduce this on that file on mac, same versions as the Windows test above.

@dlakelan
Copy link

dlakelan commented Aug 2, 2022

Sure enough, on line 14 my version of the file has a quote character at the end:

image

I'll re-download the zip file and uncompress from scratch see if it was just damage in the download

@dlakelan
Copy link

dlakelan commented Aug 2, 2022

Ok, Sure enough, a fresh download and the file loads... Computers are weird. Thanks for you guys helping with this!

@nickrobinson251
Copy link
Collaborator

Ok, Sure enough, a fresh download and the file loads... Computers are weird. Thanks for you guys helping with this!

Glad it's sorted!

There's #935 open for the "really long strings" issue, so will close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants