Error when combining single row with multiple row CSV file into a DataFrame with pooling on. #1130

snovum · 2024-03-15T20:46:39Z

Hi!

When combining .csv files of uniform column number and data type, the order of the files seems to matter when one of the files has only a single row if pooling is on.

for example

single_row_df = DataFrame(Name=["Alice"], Age=[30])
multiple_row_df = DataFrame( Name=["Bob", "Charlie", "David"], Age=[25, 28, 22])
CSV.write("filepath/single_row_data.csv", single_row_df)
CSV.write("filepath/multiple_row_data.csv", multiple_row_df)

If I try to combine the CSV files into a single DataFrame with the Single Row Data first I receive the following errors.

CSV.read(["filepath/single_row_data.csv", "filepath/multiple_row_data.csv"],DataFrame)


ERROR: UndefVarError: `A` not defined
Stacktrace:
 [1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}})
   @ CSV ./none:0
 [2] iterate
   @ ./generator.jl:47 [inlined]
 [3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
   @ Base ./array.jl:834
 [4] chaincolumns!(a::Any, b::Any)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
 [5] CSV.File(sources::Vector{String}; source::Nothing, kw::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
 [6] File
   @ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
 [7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
 [8] read(source::Vector{String}, sink::Type)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:113
 [9] top-level scope
   @ REPL[152]:1

Similarly

DataFrame!(CSV.File(["fielpath/single_row_data.csv","filepath/multiple_row_data.csv"]))

and

DataFrame!(CSV.File(["filepath/single_row_data.csv","filepath/multiple_row_data.csv"]))

both return the following error

ERROR: UndefVarError: `A` not defined
Stacktrace:
 [1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}})
   @ CSV ./none:0
 [2] iterate
   @ ./generator.jl:47 [inlined]
 [3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
   @ Base ./array.jl:834
 [4] chaincolumns!(a::Any, b::Any)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
 [5] CSV.File(sources::Vector{String}; source::Nothing, kw::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
 [6] CSV.File(sources::Vector{String})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:901
 [7] top-level scope
   @ REPL[164]:1

Whereas reversing the order of the files i.e. Multiple Row Data First works in all cases

julia> DataFrame(CSV.File(["filepath/multiple_row_data.csv","filepath/single_row_data.csv"]))
4×2 DataFrame
 Row │ Name     Age   
     │ String7  Int64 
─────┼────────────────
   1 │ Bob         25
   2 │ Charlie     28
   3 │ David       22
   4 │ Alice       30

julia> CSV.read(["filepath/multiple_row_data.csv","filepath/single_row_data.csv"],DataFrame)
4×2 DataFrame
 Row │ Name     Age   
     │ String7  Int64 
─────┼────────────────
   1 │ Bob         25
   2 │ Charlie     28
   3 │ David       22
   4 │ Alice       30

multiple_row_data.csv
single_row_data.csv

as pointed out by @nilshg here this comes from pooled arrays and turning pooling off by setting pool = false fixes the problem.

The text was updated successfully, but these errors were encountered:

longemen3000 · 2024-04-09T04:27:20Z

found the error.

CSV.jl/src/utils.jl

Lines 234 to 242 in acd36a6

    
           elseif c isa Vector && b isa Vector 
        
               # two vectors, but we know eltype doesn't match, so try to promote 
        
               A = Vector{promote_types(eltype(c), eltype(b))} 
        
           elseif c isa SentinelVector && b isa SentinelVector 
        
               A = vectype(promote_types(Base.nonmissingtype(eltype(c)), Base.nonmissingtype(eltype(b)))) 
        
           end 
        
           x = ChainedVector([_promote(A, x) for x in a.arrays]) 
        
           y = _promote(A, b) 
        
           return append!(x, y)

there is a missing case where c isa PooledVector

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when combining single row with multiple row CSV file into a DataFrame with pooling on. #1130

Error when combining single row with multiple row CSV file into a DataFrame with pooling on. #1130

snovum commented Mar 15, 2024

longemen3000 commented Apr 9, 2024

Error when combining single row with multiple row CSV file into a DataFrame with pooling on. #1130

Error when combining single row with multiple row CSV file into a DataFrame with pooling on. #1130

Comments

snovum commented Mar 15, 2024

longemen3000 commented Apr 9, 2024