Inconsistent handling of Tuple and NamedTuple #204

bkamins · 2020-09-30T12:35:07Z

Currently we have this:

julia> Tables.columns(NamedTuple{(:a, :b),Tuple{Int,Int}}[])
Tables.CopiedColumns{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Int64,1}}}}: (a = Int64[], b = Int64[])

julia> Tables.columntable(Tuple{Int,Int}[])
NamedTuple()

Is this intended?

(I know it is a corner case but I test against it in DataFrames.jl)

The text was updated successfully, but these errors were encountered:

quinnj · 2020-10-06T22:45:42Z

Somewhat intentional; i.e. we have explicit code in namedtuples.jl that treats any AbstractVector{<:NamedTuple} as a rowtable, and we don't have similar code for AbstractVector{<:Tuple}. We could add that I guess; I don't think it would mess anything else up.

bkamins · 2020-10-07T06:49:31Z

we don't have similar code for AbstractVector{<:Tuple}

This is how things currently work for non-empty vectors of respective types:

julia> Tables.columntable([(1,2), (3,4)])
(1 = [1, 3], 2 = [2, 4])

julia> Tables.columntable([(a=1,b=2), (a=3,b=4)])
(a = [1, 3], b = [2, 4])

quinnj · 2020-10-08T22:33:34Z

Yeah, the difference is we have:

schema(x::AbstractVector{NamedTuple{names, types}}) where {names, types} = Schema(names, types)

defined for vector of namedtuples, but no similar definition for vector of tuple. In this case, Tables.columns goes to the schemaless fallback column building routine, but the vector is empty, so we end up with an empty table.

Marking this as "help wanted" if someone wanted to take a stab at implementing this, since it's not too hard and could give someone a taste of Tables.jl internals.

bkamins · 2020-10-09T06:40:19Z

Yes - this is exactly this difference. Actually this is a more general thing as schema can be inferred from collection eltype for structs also (which currently does not happen):

julia> struct A
       x
       y
       end

julia> Tables.columntable([A(1,2)])
(x = [1], y = [2])

julia> Tables.columntable(A[])
NamedTuple()

quinnj · 2020-10-09T17:21:05Z

Good point. So instead of just special-casing AbstractVector{<:Tuple}, we could instead enhance the buildcolumns routines if the input is empty to check if there's an eltype we can use to generate a schema.

bkamins · 2020-10-09T17:52:56Z

This is what I assumed would be best (hopefully someone will be willing to grab it as a hacktoberfest challenge - maybe tomorrow? - I will post on #data).

(if this is not resolved by someone else till DataFrames.jl 0.22 is released I can try to propose a PR)

quinnj · 2020-10-09T18:18:36Z

I can leave a few bread crumbs for anyone who wants to take a stab at this:

Here is where we hit this code path (i.e. calling Tables.columns(empty_vector)): as you can see, we just return NamedTuple()
Instead, we want to check if the rowitr input has an eltype and if so, generate a smarter NamedTuple than just empty.
We have the allocatecolumns function that will return a "smart" NamedTuple given a Tables.Schema{names, types}, so it's really a matter of generating a Tables.Schema{names, types} from our empty rowitr
We do know that rowitr will be of type IteratorWrapper and iterates IteratorRow{T}, so I think what we'd need is some inspection code to generate names and types from the T of IteratorRow{T}; whether that be a Tuple, or struct Foo

quinnj · 2020-10-09T18:20:35Z

@bkamins , on a separate note, even with a potential fix to this issue, we still have issues in DataFrame because we hit this line and all(col isa AbstractVector, x) is true because x is empty. I wonder if we should also check that !isempty(x) so that if it is empty, we fallback to Tables.jl code?

bkamins · 2020-10-09T18:25:28Z

Right - I will review that constructor and make a PR (as there might be some more logic to add below also). In particular I would change all(col -> isa(col, AbstractVector), x) to eltype(x) <: AbstractVector || eltype(x) === Any I think (it will be faster and should catch the cases when we potentially can have a vector of vectors)

EDIT: it is not that simple - I will propose what I think is OK in the PR

bkamins · 2020-10-09T19:11:35Z

Fixed in a commit to JuliaData/DataFrames.jl#2464

Fixes #204. For empty row interator inputs, we were just returning an empty `NamedTuple`. This has the disadvantage of not preserving an input's schema in the case like `NamedTuple{(:a, :b), Tuple{Int64, Float64}}[]`. Sometimes the input may be empty, but have a queryable schema, so we should try to preserve that. For `Tuple` rows, we generate column names like `Column$i`. I think this is fine because we're in the fallback `buildcolumns` code where we just want to return a standard "table" object, so returning a NamedTuple instead of `Tuple` of vectors seems more appropriate; i.e. it's Tables.jl job here to "build the columns", so we have latitude and control over what we return.

quinnj · 2020-10-14T23:07:57Z

Ok, I went ahead and did this: #206

…ts (#206) * For Tables.columns fallback, attempt to preserve schema on empty inputs Fixes #204. For empty row interator inputs, we were just returning an empty `NamedTuple`. This has the disadvantage of not preserving an input's schema in the case like `NamedTuple{(:a, :b), Tuple{Int64, Float64}}[]`. Sometimes the input may be empty, but have a queryable schema, so we should try to preserve that. For `Tuple` rows, we generate column names like `Column$i`. I think this is fine because we're in the fallback `buildcolumns` code where we just want to return a standard "table" object, so returning a NamedTuple instead of `Tuple` of vectors seems more appropriate; i.e. it's Tables.jl job here to "build the columns", so we have latitude and control over what we return.

bkamins · 2020-10-15T05:18:26Z

Thank you!

bkamins mentioned this issue Oct 7, 2020

Allow multicolumn transformations for AbstractDataFrame JuliaData/DataFrames.jl#2461

Merged

quinnj added good first issue Good for newcomers help wanted Extra attention is needed labels Oct 8, 2020

bkamins mentioned this issue Oct 9, 2020

[BREAKING] deprecate DataFrame constructors JuliaData/DataFrames.jl#2464

Merged

quinnj mentioned this issue Oct 14, 2020

For Tables.columns fallback, attempt to preserve schema on empty inputs #206

Merged

quinnj closed this as completed in #206 Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent handling of Tuple and NamedTuple #204

Inconsistent handling of Tuple and NamedTuple #204

bkamins commented Sep 30, 2020

quinnj commented Oct 6, 2020

bkamins commented Oct 7, 2020

quinnj commented Oct 8, 2020

bkamins commented Oct 9, 2020

quinnj commented Oct 9, 2020

bkamins commented Oct 9, 2020

quinnj commented Oct 9, 2020

quinnj commented Oct 9, 2020

bkamins commented Oct 9, 2020 •

edited

Loading

bkamins commented Oct 9, 2020

quinnj commented Oct 14, 2020

bkamins commented Oct 15, 2020

Inconsistent handling of Tuple and NamedTuple #204

Inconsistent handling of Tuple and NamedTuple #204

Comments

bkamins commented Sep 30, 2020

quinnj commented Oct 6, 2020

bkamins commented Oct 7, 2020

quinnj commented Oct 8, 2020

bkamins commented Oct 9, 2020

quinnj commented Oct 9, 2020

bkamins commented Oct 9, 2020

quinnj commented Oct 9, 2020

quinnj commented Oct 9, 2020

bkamins commented Oct 9, 2020 • edited Loading

bkamins commented Oct 9, 2020

quinnj commented Oct 14, 2020

bkamins commented Oct 15, 2020

bkamins commented Oct 9, 2020 •

edited

Loading