-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support unknown schema from sources #10
Conversation
This is getting in pretty good shape; there are still some stray allocations coming from somewhere, perhaps the getproperty isn't getting inlined or something. It's about a factor of ~3x of the known schema case. |
@davidanthoff @piever @bkamins @andyferris @nalimilan , This PR now represents a slightly larger refactoring than anticipated of the Tables.jl interfaces. In particular, I'm proposing the following changes:
|
(I've updated the DataFrames PR and put up a CSV PR that both conform to these changes for reference) |
src/Tables.jl
Outdated
# when Tables.types(x) === nothing | ||
function buildcolumns(sch::Schema{names, nothing}, rowitr::T) where {names, T} | ||
state = iterate(rowitr) | ||
state === nothing && return NamedTuple{names}(Tuple(Missing[] for _ in names)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexedTables uses inference in this case to guess the type, though returning Missing
type like you do here is probably more principled. We should be consistent, but maybe it's the IndexedTables side that needs to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Missing[]
rather than Union{}
? Anyway it would make sense to use inference just like map
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no "inference" here, we're not applying a function, we're just getting an empty iterator. I considered Union{}
, but it felt to Base-collect-y to me: here we're talking tables in the data world, where Missing
seems a more natural default, like an all NULL
column in a database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm worried about with Missing
is that if you later combine the resulting table with another (non-empty) one, you'll get a Union{T,Missing}
column, which wouldn't really be appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm, that's a good point; that certainly makes a case for Union{}
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You possibly want an iterator (edit: for the columns) that has eltype Union{}
and zero elements. You could e.g. create an struct EmptyVector <: AbstractVector{Union{}}; end
(or something similar) for this, which will probably behave nicely under vcat
and so-on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the eltype of the iterator shouldn't be Union{}
, it should be a NamedTuple
with the specified fields; they just hold empty vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I probably wasn't clear, I mean the columns should be like a Vector{Union{}}()
instead of a Vector{Missing}()
.
But it's also ever-so-slightly wasteful to allocate Vector
here for containers you know are empty (this is only important in the case you have many, many small or empty tables... sorry, I'm the StaticArrays guy so I tend to think of these cases and try to minimize overhead for small containers...)
src/Tables.jl
Outdated
cols = length(names) | ||
L = Base.IteratorSize(T) | ||
len = haslength(L) ? length(rowitr) : 0 | ||
columns = Vector{AbstractVector}(undef, cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the things that got me stuck when trying to implement this in this framework is that it's not completely clear what structure should contain the columns. NamedTuple
is out of the question because they can change type, but doesn't a Vector{AbstractVector}
cause performance issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was even thinking, maybe the only performant way to do this is to use generated functions that create code where we use a different variable per each column, but maybe that's overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Indeed, since the loop over columns is applied for each row, it's essential to avoid any type uncertainty.
Maybe we can avoid a generated function though, and instead use the same strategy as map
: store columns as a named tuple, and have the function call itself with the updated named tuple each time the types change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the strategy in IndexedTables. However here it's a bit tricky to implement for technical reasons as in IndexedTables you are iterating over NamedTuples
so you can tell "statically" whether you need to expand or not, just by looking at the type of the row. One strategy I was proposing here is to store columns as a NamedTuple
, and change add!
so that if the types are incompatible, it would return nothing
and not do anything. Then in turn eachcolumns
would know that if a function call returns nothing
, it has to exit and signal that it failed, so the main loop would just call eachcolumns
and only call itself with the updated named tuple when eachcolumn
signals that it failed.
src/utils.jl
Outdated
This is useful for sinks iterating rows who wish to provide a type-stable mechanism for their "inner loops". Typically, | ||
such inner loops suffer from dynamic dispath due to the varying types of columns. | ||
Takes a function `f`, column names `names`, column types `types`, a `row` type (that satisfies the Row interface), and any other `args...`; | ||
it generates calls to get the value for each column in the row (`getproperty(row, nm)`) and then calls `f(val, col, name)`, where `f` is the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f(val, col, name, args...)
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Looks really nice. I've left a couple of comments, but I need to think a bit more carefully how this relates to what we have in StructArrays and IndexedTables. In particular I'm curious about nesting. Say I have an iterator of |
|
||
* `Tables.rows(src)` returns a `Row` iterator, where `Row` is any object that supports value access via `getproperty(row, nm::Symbol)`. For example, if I have a NamedTuple like `row = (a=1, b=2, c=3)`, its values can be accessed like `row.a`, which is desugared to a call to `getproperty(row, :a)`. Thus a NamedTuple implicitly satisfies the `Row` interface. A `Row` should support `getproperty(row, name)` for any column name returned in a table's `Tables.schema(src)`. This allows an end-user access pattern like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concrete example in this paragraph was useful IMHO.
README.md
Outdated
|
||
So how does one go about satisfying these interface functions as a table type developer? It mainly depends on the `Tables.AccessStyle(T)` of your table: | ||
In addition, it can be helpful to define the following: | ||
* `Tables.istable(::Type{MyTable}) = true`: there are runtime checks to see if an object implements the interface, but this can provide an explicit affirmation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go as far as making this mandatory. Currently, checking whether a type implements a method is slow AFAIK, and it's cleaner to clearly require types to implement this trait. Else nobody can really rely on it, which makes it not that useful.
src/Tables.jl
Outdated
"Tables.schema(s) => NamedTuple{names, types}" | ||
function schema end | ||
# default definitions | ||
rowaccess(x) = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe limit this to ::Type
?
src/Tables.jl
Outdated
end | ||
|
||
function schema(x::T) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are fallbacks like this really useful? I can see a few reasons not to provide such a function:
- some types may not allow calling
first(x)
more than once, making this fallback incorrect - it's quite trivial to implement for a specific type
- better get a
MethodError
than an empty schema for types that don't implement the interface
EDIT: see also JuliaData/DataFrames.jl#1495 (comment)
src/Tables.jl
Outdated
end | ||
end | ||
|
||
haslength(T) = T === Base.HasLength() || T === Base.HasShape{1}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird to use T
here since it's an instance rather than a type.
src/Tables.jl
Outdated
columns = Vector{AbstractVector}(undef, cols) | ||
eachcolumn(add_or_widen!, sch, row, L, columns, 1, true, len) | ||
rownbr = 2 | ||
while true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use a for
loop here, maybe with outer rownbr
.
src/Tables.jl
Outdated
cols = length(names) | ||
L = Base.IteratorSize(T) | ||
len = haslength(L) ? length(rowitr) : 0 | ||
columns = Vector{AbstractVector}(undef, cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Indeed, since the loop over columns is applied for each row, it's essential to avoid any type uncertainty.
Maybe we can avoid a generated function though, and instead use the same strategy as map
: store columns as a named tuple, and have the function call itself with the updated named tuple each time the types change.
src/namedtuples.jl
Outdated
TT = Tuple{Any[ _eltype(fieldtype(NT, i)) for i = 1:fieldcount(NT) ]...} | ||
return NamedTuple{names, TT} | ||
Base.@pure function types(::Type{NT}) where {NT <: NamedTuple{names, T}} where {names, T <: NTuple{N, AbstractVector{S} where S}} where {N} | ||
return Tuple{Any[ _eltype(fieldtype(NT, i)) for i = 1:fieldcount(NT) ]...} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a generator instead of an array comprehension to avoid an allocation? Could also use ntuple
, which often gives cleaner code IIRC.
src/utils.jl
Outdated
return rle | ||
end | ||
|
||
# generic fallback from getproperty w/ type information to basic symbol lookup | ||
Base.getproperty(x, ::Type{T}, i::Int, nm::Symbol) where {T} = getproperty(x, nm) | ||
|
||
""" | ||
Tables.unroll(f, schema, row, args...) | ||
Tables.eachcolumn(f, names, types, row, args...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT the actual signature is eachcolumn(f, schema, row, args...)
.
README.md
Outdated
for row in rows | ||
# a convenience function provided in Tables.jl for "unrolling" access to each column/property of a `Row` | ||
# it works by applying a provided function to each value; see `?Tables.eachcolumn` for more details | ||
Tables.eachcolumn(sch, row, mytbl) do val, col, name, mytbl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually not needed to pass mytbl
, and I find it confusing.
src/Tables.jl
Outdated
Encoding the names & types as type parameters allows convenient use of the type in generated functions | ||
and other optimization use-cases. | ||
""" | ||
struct Schema{names, types} end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small comment: current implementation of Schema
differs from NamedTuple
also because it allows duplicate column names (as opposed to NamedTuple
). Not sure if it is a significant thing.
I didn't look through the whole PR (I'll try to find some time, but who knows right now). But this caught me eye:
Does that also support the case that the column names are not known? That is what we need to handle for the Query.jl situation. |
After some interface soul-searching over the last few days, I think I boiled it down to this core idea: Tables.jl is really the union of two interfaces in Base: Iterable and what I'm calling PropertyAccessible (i.e. implements Now, I realize that this isn't quite what's implemented here as you pointed out. We shouldn't ask developers to implement But, I think we still need to allow a user to call |
Thinking about it even further, I think we do still allow users to call I'll try to iron this out some more tomorrow morning by tweaking the implementations. |
I feel that maybe we shouldn't remove this completely... an array's
Yes! I've been wondering about this for a little while, too. In fact, I was speculating to myself if we could simplify the entire thing and say this: a "table" supports both: you can iterate things that are property accessible (rows), AND you can get properties which are iterable (columns), and these are equivalent views of the content. Anyway, that's what I went with for the new TypedTables. I realize this intersection is a much narrower scope than intended here. I just wonder if we should imagine complementary sets of interfaces (iterable rows, property-accessible columns, and both) and call them by three different names? Does that make sense? |
Ok, I just pushed a commit that further refines the unknown schema case according to feedback from review and my comments the other day. In particular:
The fallbacks are now in a separate @andyferris, in response to removing |
src/fallbacks.jl
Outdated
end | ||
end | ||
|
||
haslength(x) = x === Base.HasLength() || x === Base.HasShape{1}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually needed? For example, Base.IteratorSize(Matrix) isa Base.HasShape{2}
, yet you can call length
and use linear indexing just like with a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could expand the check to be x === Base.HasLength() || x isa Base.HasShape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, would you look at that
state = iterate(rowitr, st) | ||
state === nothing && break | ||
row, st = state | ||
columns !== updated[] && return _buildcolumns(rowitr, row, st, sch, L, updated[], rownbr, len, updated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piever, this is similar to the idea you proposed for making the iteration loop more type-stable. Here I'm passing in a Ref{Any}(columns)
when we set/push each row value, and if a column needs to widen, it will update the ref w/ merge(updated[], NamedTuple{(nm,)}((new_column,))
. A tricky thing that makes this work is the fact that all of the non-widened columns are identical objects between the columns
and updated[]
arguments; i.e. the merge
call makes a new namedtuple that points to all the original columns except the one new widened column. And since we only update each column once per row, we can just check at the end of a row loop if the updated[]
changed and re-dispatch on that.
This approach seems to perform really well on some larger datasets I tested, comparing the type-stable columntable(f)
vs. Tables.buildcolumns(nothing, f)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good trick! How about adding a comment? ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Definitely a "best of both worlds" implementation. I agree with @nalimilan that this needs comments in the code as it's a really smart strategy.
That's a very interesting question. There are containers that already support (or could reasonably be expected to support) both access styles, with different speeds - things like |
src/fallbacks.jl
Outdated
L = Base.IteratorSize(T) | ||
len = haslength(L) ? length(rowitr) : 0 | ||
sch = Schema(names, nothing) | ||
columns = NamedTuple{names}(Tuple(Union{}[] for _ = 1:length(names))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piever @davidanthoff @nalimilan, I've thought of one more thing we should probably discuss here before merging. In this line, I'm defaulting the arrays to Union{}[]
, which will then widen to actual encountered types while iterating. Reading back over this monster thread however, I remembered that people have this paralyzing fear of this "collect-style" table building losing their original table's missingness. For example, with the current implementation here, we have:
julia> f = CSV.File(joinpath(dir, "test_basic.csv"))
CSV.File(/Users/jacobquinn/.julia/dev/CSV/test/testfiles/test_basic.csv, rows=3):
Tables.Schema:
:col1 Union{Missing, Int64}
:col2 Union{Missing, Int64}
:col3 Union{Missing, Int64}
julia> f |> columntable
(col1 = Union{Missing, Int64}[1, 4, 7], col2 = Union{Missing, Int64}[2, 5, 8], col3 = Union{Missing, Int64}[3, 6, 9])
julia> Tables.buildcolumns(nothing, f)
(col1 = [1, 4, 7], col2 = [2, 5, 8], col3 = [3, 6, 9])
So the collect-strategy drops the missingness, though it's notable that in the "inferred" case (or rather when the source knows its schema, which should be the vast majority of cases), the sink can pass that schema on just fine. Now, I think we have a few different options of things to do here:
- Provide at least a keyword argument, so that people could do, essentially,
f |> x->DataFrame(x; allowmissing=:all)
, (to use the CSV.jl keyword argallowmissing
, which can be:auto
,:none
, or:all
). That would essentially control whether we doUnion{}[]
orMissing[]
columns by default when collecting. - What should be the default?
Union{}[]
? OrMissing[]
; I'm inclined to sayUnion{}[]
, because I think people over-worry about this case, whereas in practical cases, it will be very much controlled by individual sinks. Sure a DataFrame might lose missingness, but that's a very easy case to handle (i.e. vcat will auto-promote columns for you). For a CSV.write operation, it couldn't care less what the column types are in the first place! Similarly for a database sink, the table will have been pre-defined for columns to allow null values or not, which is completely independent of uninferrable iterators that might come along.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the problem here that there is no equivalent to fieldtype
for properties in base? Really if there was a propertytypes
, this wouldn't be an issue at all, right? You would look at the first row you get, you could tell that the type of a given property is say Union{T,Missing}
and you could allocate the appropriate array.
Maybe just define that function, provide a default fallback that uses fieldtype
, and tell folks that if they use a type that adds a getproperty
, they also need to add a method for propertytypes
?
Or alternatively, catch the special case of a NamedTuple
and use fieldtype
to do the same thing, and lobby that base adds a propertytypes
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that would really solve the problem though of sometimes losing missingness; if my row happens to be NamedTuple
and none of the individual iterated NamedTuples had a missing value for a column, the resulting column wouldn't allow missingness, even though that might be desirable. That's why I think it might be useful for a user to at least pass something like DataFrame(x; defaultmissing=true)
and the columns would all pop out like Union{T, Missing}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? The field types in the NamedTuple
would be Union{T,Missing}
, even if there might never be a missing
value, and if in the sink you could query the field type, you can pick the right column type. At least for tables that come from Query.jl that should definitely work, as long as the translator from DataValue
makes sure that if a field in the named tuple is of type DataValue
, the field in the named tuple that is returned by the iterator from Tables is of type Union{T,Missing}
. I don't see why that wouldn't work.
And in general, I think this problem would only show up in a projection case a la Query.jl, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, since NamedTuple
is a standard parametric type (contrary to Tuple
), its instances can encode the information about whether a value could have been missing in its parameters. Having a propertytype
function in Base would make sense.
In the meantime, we could define propertytype
internally without exporting it, have a method for NamedTuple
(based on fieldtype
), and require custom row types to define it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented more extensively below (in particular I do think there will always be cases where strictification happens), but I agree here that the result of columns([NamedTuple{(:a, :b), Tuple{Int, Union{Int, Missing}}}((1, 2))])
should allow missing values in the second column and this could be done via propertytype
Concerning the discussion about "strictifying things", here's my take. For In particular I think we should provide users with a set of functions that make this easier. Instead of @inline function buildcolumns(schema, rowitr::T, iT=nothing) where {T}
L = Base.IteratorSize(T)
len = Base.haslength(L) ? length(rowitr) : 0
nt = allocatecolumns(schema, len)
for (i, row) in enumerate(rowitr)
eachcolumn(add!, schema, row, L, nt, i)
end
return nt
end into the part which preallocates We should also have an equivalent of The case of |
@davidanthoff @piever