Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when combining single row with multiple row CSV file into a DataFrame with pooling on. #1130

Open
snovum opened this issue Mar 15, 2024 · 1 comment

Comments

@snovum
Copy link

snovum commented Mar 15, 2024

Hi!

When combining .csv files of uniform column number and data type, the order of the files seems to matter when one of the files has only a single row if pooling is on.

for example

single_row_df = DataFrame(Name=["Alice"], Age=[30])
multiple_row_df = DataFrame( Name=["Bob", "Charlie", "David"], Age=[25, 28, 22])
CSV.write("filepath/single_row_data.csv", single_row_df)
CSV.write("filepath/multiple_row_data.csv", multiple_row_df)

If I try to combine the CSV files into a single DataFrame with the Single Row Data first I receive the following errors.

CSV.read(["filepath/single_row_data.csv", "filepath/multiple_row_data.csv"],DataFrame)


ERROR: UndefVarError: `A` not defined
Stacktrace:
 [1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}})
   @ CSV ./none:0
 [2] iterate
   @ ./generator.jl:47 [inlined]
 [3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
   @ Base ./array.jl:834
 [4] chaincolumns!(a::Any, b::Any)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
 [5] CSV.File(sources::Vector{String}; source::Nothing, kw::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
 [6] File
   @ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
 [7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
 [8] read(source::Vector{String}, sink::Type)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:113
 [9] top-level scope
   @ REPL[152]:1

Similarly

DataFrame!(CSV.File(["fielpath/single_row_data.csv","filepath/multiple_row_data.csv"]))

and

DataFrame!(CSV.File(["filepath/single_row_data.csv","filepath/multiple_row_data.csv"]))

both return the following error

ERROR: UndefVarError: `A` not defined
Stacktrace:
 [1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}})
   @ CSV ./none:0
 [2] iterate
   @ ./generator.jl:47 [inlined]
 [3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String7, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
   @ Base ./array.jl:834
 [4] chaincolumns!(a::Any, b::Any)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
 [5] CSV.File(sources::Vector{String}; source::Nothing, kw::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
 [6] CSV.File(sources::Vector{String})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:901
 [7] top-level scope
   @ REPL[164]:1

Whereas reversing the order of the files i.e. Multiple Row Data First works in all cases

julia> DataFrame(CSV.File(["filepath/multiple_row_data.csv","filepath/single_row_data.csv"]))
4×2 DataFrame
 Row │ Name     Age   
     │ String7  Int64 
─────┼────────────────
   1 │ Bob         25
   2 │ Charlie     28
   3 │ David       22
   4 │ Alice       30

julia> CSV.read(["filepath/multiple_row_data.csv","filepath/single_row_data.csv"],DataFrame)
4×2 DataFrame
 Row │ Name     Age   
     │ String7  Int64 
─────┼────────────────
   1 │ Bob         25
   2 │ Charlie     28
   3 │ David       22
   4 │ Alice       30

multiple_row_data.csv
single_row_data.csv

as pointed out by @nilshg here this comes from pooled arrays and turning pooling off by setting pool = false fixes the problem.

@longemen3000
Copy link
Contributor

found the error.

CSV.jl/src/utils.jl

Lines 234 to 242 in acd36a6

elseif c isa Vector && b isa Vector
# two vectors, but we know eltype doesn't match, so try to promote
A = Vector{promote_types(eltype(c), eltype(b))}
elseif c isa SentinelVector && b isa SentinelVector
A = vectype(promote_types(Base.nonmissingtype(eltype(c)), Base.nonmissingtype(eltype(b))))
end
x = ChainedVector([_promote(A, x) for x in a.arrays])
y = _promote(A, b)
return append!(x, y)
there is a missing case where c isa PooledVector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants