Skip to content

Error when combining single row with multiple row CSV file into a DataFrame with pooling on.  #1130

@snovum

Description

@snovum

Hi!

When combining .csv files of uniform column number and data type, the order of the files seems to matter when one of the files has only a single row if pooling is on.

for example

single_row_df = DataFrame(Name=["Alice"], Age=[30])
multiple_row_df = DataFrame( Name=["Bob", "Charlie", "David"], Age=[25, 28, 22])
CSV.write("filepath/single_row_data.csv", single_row_df)
CSV.write("filepath/multiple_row_data.csv", multiple_row_df)

If I try to combine the CSV files into a single DataFrame with the Single Row Data first I receive the following errors.

CSV.read(["filepath/single_row_data.csv", "filepath/multiple_row_data.csv"],DataFrame)


ERROR: UndefVarError: `A` not defined
Stacktrace:
 [1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}})
   @ CSV ./none:0
 [2] iterate
   @ ./generator.jl:47 [inlined]
 [3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
   @ Base ./array.jl:834
 [4] chaincolumns!(a::Any, b::Any)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
 [5] CSV.File(sources::Vector{String}; source::Nothing, kw::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
 [6] File
   @ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
 [7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
 [8] read(source::Vector{String}, sink::Type)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:113
 [9] top-level scope
   @ REPL[152]:1

Similarly

DataFrame!(CSV.File(["fielpath/single_row_data.csv","filepath/multiple_row_data.csv"]))

and

DataFrame!(CSV.File(["filepath/single_row_data.csv","filepath/multiple_row_data.csv"]))

both return the following error

ERROR: UndefVarError: `A` not defined
Stacktrace:
 [1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}})
   @ CSV ./none:0
 [2] iterate
   @ ./generator.jl:47 [inlined]
 [3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
   @ Base ./array.jl:834
 [4] chaincolumns!(a::Any, b::Any)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
 [5] CSV.File(sources::Vector{String}; source::Nothing, kw::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
 [6] CSV.File(sources::Vector{String})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:901
 [7] top-level scope
   @ REPL[164]:1

Whereas reversing the order of the files i.e. Multiple Row Data First works in all cases

julia> DataFrame(CSV.File(["filepath/multiple_row_data.csv","filepath/single_row_data.csv"]))
4×2 DataFrame
 Row │ Name     Age   
     │ String7  Int64 
─────┼────────────────
   1 │ Bob         25
   2 │ Charlie     28
   3 │ David       22
   4 │ Alice       30

julia> CSV.read(["filepath/multiple_row_data.csv","filepath/single_row_data.csv"],DataFrame)
4×2 DataFrame
 Row │ Name     Age   
     │ String7  Int64 
─────┼────────────────
   1 │ Bob         25
   2 │ Charlie     28
   3 │ David       22
   4 │ Alice       30

multiple_row_data.csv
single_row_data.csv

as pointed out by @nilshg here this comes from pooled arrays and turning pooling off by setting pool = false fixes the problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions