Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent handling of Tuple and NamedTuple #204

Closed
bkamins opened this issue Sep 30, 2020 · 12 comments · Fixed by #206
Closed

Inconsistent handling of Tuple and NamedTuple #204

bkamins opened this issue Sep 30, 2020 · 12 comments · Fixed by #206
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@bkamins
Copy link
Member

bkamins commented Sep 30, 2020

Currently we have this:

julia> Tables.columns(NamedTuple{(:a, :b),Tuple{Int,Int}}[])
Tables.CopiedColumns{NamedTuple{(:a, :b),Tuple{Array{Int64,1},Array{Int64,1}}}}: (a = Int64[], b = Int64[])

julia> Tables.columntable(Tuple{Int,Int}[])
NamedTuple()

Is this intended?

(I know it is a corner case but I test against it in DataFrames.jl)

@quinnj
Copy link
Member

quinnj commented Oct 6, 2020

Somewhat intentional; i.e. we have explicit code in namedtuples.jl that treats any AbstractVector{<:NamedTuple} as a rowtable, and we don't have similar code for AbstractVector{<:Tuple}. We could add that I guess; I don't think it would mess anything else up.

@bkamins
Copy link
Member Author

bkamins commented Oct 7, 2020

we don't have similar code for AbstractVector{<:Tuple}

This is how things currently work for non-empty vectors of respective types:

julia> Tables.columntable([(1,2), (3,4)])
(1 = [1, 3], 2 = [2, 4])

julia> Tables.columntable([(a=1,b=2), (a=3,b=4)])
(a = [1, 3], b = [2, 4])

@quinnj
Copy link
Member

quinnj commented Oct 8, 2020

Yeah, the difference is we have:

schema(x::AbstractVector{NamedTuple{names, types}}) where {names, types} = Schema(names, types)

defined for vector of namedtuples, but no similar definition for vector of tuple. In this case, Tables.columns goes to the schemaless fallback column building routine, but the vector is empty, so we end up with an empty table.

Marking this as "help wanted" if someone wanted to take a stab at implementing this, since it's not too hard and could give someone a taste of Tables.jl internals.

@quinnj quinnj added good first issue Good for newcomers help wanted Extra attention is needed labels Oct 8, 2020
@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2020

Yes - this is exactly this difference. Actually this is a more general thing as schema can be inferred from collection eltype for structs also (which currently does not happen):

julia> struct A
       x
       y
       end

julia> Tables.columntable([A(1,2)])
(x = [1], y = [2])

julia> Tables.columntable(A[])
NamedTuple()

@quinnj
Copy link
Member

quinnj commented Oct 9, 2020

Good point. So instead of just special-casing AbstractVector{<:Tuple}, we could instead enhance the buildcolumns routines if the input is empty to check if there's an eltype we can use to generate a schema.

@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2020

This is what I assumed would be best (hopefully someone will be willing to grab it as a hacktoberfest challenge - maybe tomorrow? - I will post on #data).

(if this is not resolved by someone else till DataFrames.jl 0.22 is released I can try to propose a PR)

@quinnj
Copy link
Member

quinnj commented Oct 9, 2020

I can leave a few bread crumbs for anyone who wants to take a stab at this:

  • Here is where we hit this code path (i.e. calling Tables.columns(empty_vector)): as you can see, we just return NamedTuple()
  • Instead, we want to check if the rowitr input has an eltype and if so, generate a smarter NamedTuple than just empty.
  • We have the allocatecolumns function that will return a "smart" NamedTuple given a Tables.Schema{names, types}, so it's really a matter of generating a Tables.Schema{names, types} from our empty rowitr
  • We do know that rowitr will be of type IteratorWrapper and iterates IteratorRow{T}, so I think what we'd need is some inspection code to generate names and types from the T of IteratorRow{T}; whether that be a Tuple, or struct Foo

@quinnj
Copy link
Member

quinnj commented Oct 9, 2020

@bkamins , on a separate note, even with a potential fix to this issue, we still have issues in DataFrame because we hit this line and all(col isa AbstractVector, x) is true because x is empty. I wonder if we should also check that !isempty(x) so that if it is empty, we fallback to Tables.jl code?

@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2020

Right - I will review that constructor and make a PR (as there might be some more logic to add below also). In particular I would change all(col -> isa(col, AbstractVector), x) to eltype(x) <: AbstractVector || eltype(x) === Any I think (it will be faster and should catch the cases when we potentially can have a vector of vectors)

EDIT: it is not that simple - I will propose what I think is OK in the PR

@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2020

Fixed in a commit to JuliaData/DataFrames.jl#2464

quinnj added a commit that referenced this issue Oct 14, 2020
Fixes #204. For empty row interator inputs, we were just returning an
empty `NamedTuple`. This has the disadvantage of not preserving an
input's schema in the case like `NamedTuple{(:a, :b), Tuple{Int64,
Float64}}[]`. Sometimes the input may be empty, but have a queryable
schema, so we should try to preserve that. For `Tuple` rows, we generate
column names like `Column$i`. I think this is fine because we're in the
fallback `buildcolumns` code where we just want to return a standard
"table" object, so returning a NamedTuple instead of `Tuple` of vectors
seems more appropriate; i.e. it's Tables.jl job here to "build the
columns", so we have latitude and control over what we return.
@quinnj
Copy link
Member

quinnj commented Oct 14, 2020

Ok, I went ahead and did this: #206

quinnj added a commit that referenced this issue Oct 15, 2020
…ts (#206)

* For Tables.columns fallback, attempt to preserve schema on empty inputs

Fixes #204. For empty row interator inputs, we were just returning an
empty `NamedTuple`. This has the disadvantage of not preserving an
input's schema in the case like `NamedTuple{(:a, :b), Tuple{Int64,
Float64}}[]`. Sometimes the input may be empty, but have a queryable
schema, so we should try to preserve that. For `Tuple` rows, we generate
column names like `Column$i`. I think this is fine because we're in the
fallback `buildcolumns` code where we just want to return a standard
"table" object, so returning a NamedTuple instead of `Tuple` of vectors
seems more appropriate; i.e. it's Tables.jl job here to "build the
columns", so we have latitude and control over what we return.
@bkamins
Copy link
Member Author

bkamins commented Oct 15, 2020

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants