Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unknown schema from sources #10

Merged
merged 9 commits into from
Sep 5, 2018
Merged

Support unknown schema from sources #10

merged 9 commits into from
Sep 5, 2018

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Aug 31, 2018

@quinnj
Copy link
Member Author

quinnj commented Aug 31, 2018

This is getting in pretty good shape; there are still some stray allocations coming from somewhere, perhaps the getproperty isn't getting inlined or something. It's about a factor of ~3x of the known schema case.

@quinnj quinnj changed the title Support missing schema from sources Support unknown schema from sources Sep 1, 2018
@quinnj
Copy link
Member Author

quinnj commented Sep 1, 2018

@davidanthoff @piever @bkamins @andyferris @nalimilan ,

This PR now represents a slightly larger refactoring than anticipated of the Tables.jl interfaces. In particular, I'm proposing the following changes:

  • Insead of AccessStyle traits, use Tables.rowaccess(::Type{MyType}) = true overloads to affirm a table supports a certain interface access pattern; I think this this overall simplifies things because tables can just signal/opt-in to supporting a certain interface that generic fallbacks or other users can check later on. In the same vein as Tables.istable(T), these just live at the individual interface function level
  • Introduce a new Tables.Schema{names, T} type; this actually isn't drastically different than the Tables.schema(x) pattern before, but we need this custom type to better handle the case of unknown column types, since a NamedTuple{names, types} requires names and types to be the same length. Now a table w/ an unknown schema will have a schema like Tables.Schema{names, nothing}, with nothing indicating that column types are not known. I opted to use nothing instead of missing since we don't want any propagation of the missingness here and users/developers will need to handle that case specially with explicit checks (though convenience functions are provided like Tables.eachcolumn that will adjust for users).
  • The other change also tries to further reduce the "schema" role in the Tables.jl story by only having Tables.schema(x) be called on the result of Tables.rows(x) or Tables.columns(x); this makes the call more likely to succeed if a table type returns a separate object w/ more name/type information, and also allows for easier use where users/developers don't actually need to ever use the schema.

@quinnj
Copy link
Member Author

quinnj commented Sep 1, 2018

(I've updated the DataFrames PR and put up a CSV PR that both conform to these changes for reference)

src/Tables.jl Outdated
# when Tables.types(x) === nothing
function buildcolumns(sch::Schema{names, nothing}, rowitr::T) where {names, T}
state = iterate(rowitr)
state === nothing && return NamedTuple{names}(Tuple(Missing[] for _ in names))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexedTables uses inference in this case to guess the type, though returning Missing type like you do here is probably more principled. We should be consistent, but maybe it's the IndexedTables side that needs to change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Missing[] rather than Union{}? Anyway it would make sense to use inference just like map.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no "inference" here, we're not applying a function, we're just getting an empty iterator. I considered Union{}, but it felt to Base-collect-y to me: here we're talking tables in the data world, where Missing seems a more natural default, like an all NULL column in a database.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm worried about with Missing is that if you later combine the resulting table with another (non-empty) one, you'll get a Union{T,Missing} column, which wouldn't really be appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, that's a good point; that certainly makes a case for Union{} instead.

Copy link
Member

@andyferris andyferris Sep 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You possibly want an iterator (edit: for the columns) that has eltype Union{} and zero elements. You could e.g. create an struct EmptyVector <: AbstractVector{Union{}}; end (or something similar) for this, which will probably behave nicely under vcat and so-on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the eltype of the iterator shouldn't be Union{}, it should be a NamedTuple with the specified fields; they just hold empty vectors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I probably wasn't clear, I mean the columns should be like a Vector{Union{}}() instead of a Vector{Missing}().

But it's also ever-so-slightly wasteful to allocate Vector here for containers you know are empty (this is only important in the case you have many, many small or empty tables... sorry, I'm the StaticArrays guy so I tend to think of these cases and try to minimize overhead for small containers...)

src/Tables.jl Outdated
cols = length(names)
L = Base.IteratorSize(T)
len = haslength(L) ? length(rowitr) : 0
columns = Vector{AbstractVector}(undef, cols)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things that got me stuck when trying to implement this in this framework is that it's not completely clear what structure should contain the columns. NamedTuple is out of the question because they can change type, but doesn't a Vector{AbstractVector} cause performance issues?

Copy link

@piever piever Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was even thinking, maybe the only performant way to do this is to use generated functions that create code where we use a different variable per each column, but maybe that's overkill.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Indeed, since the loop over columns is applied for each row, it's essential to avoid any type uncertainty.

Maybe we can avoid a generated function though, and instead use the same strategy as map: store columns as a named tuple, and have the function call itself with the updated named tuple each time the types change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the strategy in IndexedTables. However here it's a bit tricky to implement for technical reasons as in IndexedTables you are iterating over NamedTuples so you can tell "statically" whether you need to expand or not, just by looking at the type of the row. One strategy I was proposing here is to store columns as a NamedTuple, and change add! so that if the types are incompatible, it would return nothing and not do anything. Then in turn eachcolumns would know that if a function call returns nothing, it has to exit and signal that it failed, so the main loop would just call eachcolumns and only call itself with the updated named tuple when eachcolumn signals that it failed.

src/utils.jl Outdated
This is useful for sinks iterating rows who wish to provide a type-stable mechanism for their "inner loops". Typically,
such inner loops suffer from dynamic dispath due to the varying types of columns.
Takes a function `f`, column names `names`, column types `types`, a `row` type (that satisfies the Row interface), and any other `args...`;
it generates calls to get the value for each column in the row (`getproperty(row, nm)`) and then calls `f(val, col, name)`, where `f` is the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f(val, col, name, args...), right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@piever
Copy link

piever commented Sep 1, 2018

Looks really nice. I've left a couple of comments, but I need to think a bit more carefully how this relates to what we have in StructArrays and IndexedTables. In particular I'm curious about nesting. Say I have an iterator of Pair{NamedTuple, NamedTuple} like (a = 1, b = 2) => (c=3, d = 4) (this is actually quite useful in IndexedTables, the first NamedTuple is the primary keys and the other the non-primary keys. Would there be a way to use this code / framework to collect into a Pair(NamedTuple{Vector...}, NamedTuple{Vector...}) (syntax is not quite right but I hope it's clear)?


* `Tables.rows(src)` returns a `Row` iterator, where `Row` is any object that supports value access via `getproperty(row, nm::Symbol)`. For example, if I have a NamedTuple like `row = (a=1, b=2, c=3)`, its values can be accessed like `row.a`, which is desugared to a call to `getproperty(row, :a)`. Thus a NamedTuple implicitly satisfies the `Row` interface. A `Row` should support `getproperty(row, name)` for any column name returned in a table's `Tables.schema(src)`. This allows an end-user access pattern like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concrete example in this paragraph was useful IMHO.

README.md Outdated

So how does one go about satisfying these interface functions as a table type developer? It mainly depends on the `Tables.AccessStyle(T)` of your table:
In addition, it can be helpful to define the following:
* `Tables.istable(::Type{MyTable}) = true`: there are runtime checks to see if an object implements the interface, but this can provide an explicit affirmation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go as far as making this mandatory. Currently, checking whether a type implements a method is slow AFAIK, and it's cleaner to clearly require types to implement this trait. Else nobody can really rely on it, which makes it not that useful.

src/Tables.jl Outdated
"Tables.schema(s) => NamedTuple{names, types}"
function schema end
# default definitions
rowaccess(x) = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe limit this to ::Type?

src/Tables.jl Outdated
end

function schema(x::T) where {T}
Copy link
Member

@nalimilan nalimilan Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are fallbacks like this really useful? I can see a few reasons not to provide such a function:

  • some types may not allow calling first(x) more than once, making this fallback incorrect
  • it's quite trivial to implement for a specific type
  • better get a MethodError than an empty schema for types that don't implement the interface

EDIT: see also JuliaData/DataFrames.jl#1495 (comment)

src/Tables.jl Outdated
end
end

haslength(T) = T === Base.HasLength() || T === Base.HasShape{1}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird to use T here since it's an instance rather than a type.

src/Tables.jl Outdated
columns = Vector{AbstractVector}(undef, cols)
eachcolumn(add_or_widen!, sch, row, L, columns, 1, true, len)
rownbr = 2
while true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use a for loop here, maybe with outer rownbr.

src/Tables.jl Outdated
cols = length(names)
L = Base.IteratorSize(T)
len = haslength(L) ? length(rowitr) : 0
columns = Vector{AbstractVector}(undef, cols)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Indeed, since the loop over columns is applied for each row, it's essential to avoid any type uncertainty.

Maybe we can avoid a generated function though, and instead use the same strategy as map: store columns as a named tuple, and have the function call itself with the updated named tuple each time the types change.

TT = Tuple{Any[ _eltype(fieldtype(NT, i)) for i = 1:fieldcount(NT) ]...}
return NamedTuple{names, TT}
Base.@pure function types(::Type{NT}) where {NT <: NamedTuple{names, T}} where {names, T <: NTuple{N, AbstractVector{S} where S}} where {N}
return Tuple{Any[ _eltype(fieldtype(NT, i)) for i = 1:fieldcount(NT) ]...}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a generator instead of an array comprehension to avoid an allocation? Could also use ntuple, which often gives cleaner code IIRC.

src/utils.jl Outdated
return rle
end

# generic fallback from getproperty w/ type information to basic symbol lookup
Base.getproperty(x, ::Type{T}, i::Int, nm::Symbol) where {T} = getproperty(x, nm)

"""
Tables.unroll(f, schema, row, args...)
Tables.eachcolumn(f, names, types, row, args...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the actual signature is eachcolumn(f, schema, row, args...).

README.md Outdated
for row in rows
# a convenience function provided in Tables.jl for "unrolling" access to each column/property of a `Row`
# it works by applying a provided function to each value; see `?Tables.eachcolumn` for more details
Tables.eachcolumn(sch, row, mytbl) do val, col, name, mytbl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not needed to pass mytbl, and I find it confusing.

src/Tables.jl Outdated
Encoding the names & types as type parameters allows convenient use of the type in generated functions
and other optimization use-cases.
"""
struct Schema{names, types} end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small comment: current implementation of Schema differs from NamedTuple also because it allows duplicate column names (as opposed to NamedTuple). Not sure if it is a significant thing.

@davidanthoff
Copy link
Collaborator

I didn't look through the whole PR (I'll try to find some time, but who knows right now). But this caught me eye:

Now a table w/ an unknown schema will have a schema like Tables.Schema{names, nothing}, with nothing indicating that column types are not known.

Does that also support the case that the column names are not known? That is what we need to handle for the Query.jl situation.

@quinnj
Copy link
Member Author

quinnj commented Sep 2, 2018

After some interface soul-searching over the last few days, I think I boiled it down to this core idea: Tables.jl is really the union of two interfaces in Base: Iterable and what I'm calling PropertyAccessible (i.e. implements propertynames and getproperty). As mentioned in the docs, Tables.rows and Tables.columns are really the duals of each other with regards to these two interfaces: Tables.columns(x) returns a PropertyAccessible object of iterators, and Tables.rows(x) returns an Iterator of PropertyAccessible objects. So being able to get the "column names" (i.e. propertynames) of the results of Tables.columns and individual Row-iterations from Tables.rows is fundamental.

Now, I realize that this isn't quite what's implemented here as you pointed out. We shouldn't ask developers to implement Tables.schema on the direct result of Tables.rows(x), but rather on the iterated values of Tables.rows (i.e. each "row" instance should allow calling Tables.schema on it). I can make that update and I don't think it changes things too much (looking over CSV.write, for example, we would just do an initial iteration, then call Tables.schema, then handle things appropriately.

But, I think we still need to allow a user to call Tables.schema (or perhaps just Tables.names) on the table type, because a table type might have a known schema, but not have any values. I.e. DataFrame(a=Int[], b=Float64[]). In this case, the DataFrame knows its column names and types, yet Tables.rows(df) would product an empty iterator. So I think we need to somehow allow the user to call Tables.schema on whatever is returned by Tables.rows or Tables.columns, but somehow also signal that they should only call that in the "empty" case. This is somewhat our dual of the "what's the default type of an empty Generator for collect" problem. If a table is empty, we should have a way to say, "hey, do you happen to know your column names & types anyway?" and if so, the sink can use those, otherwise, return the equivalent of a DataFrame().

@quinnj
Copy link
Member Author

quinnj commented Sep 2, 2018

Thinking about it even further, I think we do still allow users to call Tables.schema on the direct Tables.rows or Tables.columns objects, but also allow them to return nothing directly instead of a Tables.Schema object. The difference here is that Some tables know statically their column names/types and sinks should be able to use that information in the nice case. But if a table doesn't know them, the sink can still rely on calling Tables.schema on the individual rows, but also realizing that the Tables.Schema may change from iteration to iteration (i.e. (a=1, b=1.0), then (a=2, b=missing)).

I'll try to iron this out some more tomorrow morning by tweaking the implementations.

@andyferris
Copy link
Member

andyferris commented Sep 3, 2018

Insead of AccessStyle traits

I feel that maybe we shouldn't remove this completely... an array's IndexStyle isn't about what interface it supports (all arrays support both linear and Cartesian indexing) but about which is faster. Is there some way to reflect upon this in this new design?

the union of two interfaces in Base: Iterable and what I'm calling PropertyAccessible

Yes! I've been wondering about this for a little while, too. In fact, I was speculating to myself if we could simplify the entire thing and say this: a "table" supports both: you can iterate things that are property accessible (rows), AND you can get properties which are iterable (columns), and these are equivalent views of the content. Anyway, that's what I went with for the new TypedTables.

I realize this intersection is a much narrower scope than intended here. I just wonder if we should imagine complementary sets of interfaces (iterable rows, property-accessible columns, and both) and call them by three different names? Does that make sense?

@quinnj
Copy link
Member Author

quinnj commented Sep 4, 2018

Ok, I just pushed a commit that further refines the unknown schema case according to feedback from review and my comments the other day. In particular:

  • tables implement Tables.rows and/or Tables.columns as before
  • tables should implement Tables.schema(rows_or_columns) on the result of Tables.rows or Tables.columns to return either
    • Tables.Schema object
    • nothing, indicating unknown schema (column names and types)

The fallbacks are now in a separate fallbacks.jl file, and are updated to accurately reflect the unknown schema case, as well as the reference implementations. I've also updated the CSV.write implementation and DataFrames implementation to reflect these changes.

@andyferris, in response to removing AccessStyle, I guess I wonder where it would really be used? For sinks (at least in my implementing so far), always have a "preferred" access style, either they want rows or columns, so they just call what is most natural for them and run w/ it.

src/fallbacks.jl Outdated
end
end

haslength(x) = x === Base.HasLength() || x === Base.HasShape{1}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually needed? For example, Base.IteratorSize(Matrix) isa Base.HasShape{2}, yet you can call length and use linear indexing just like with a vector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could expand the check to be x === Base.HasLength() || x isa Base.HasShape

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, would you look at that

state = iterate(rowitr, st)
state === nothing && break
row, st = state
columns !== updated[] && return _buildcolumns(rowitr, row, st, sch, L, updated[], rownbr, len, updated)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piever, this is similar to the idea you proposed for making the iteration loop more type-stable. Here I'm passing in a Ref{Any}(columns) when we set/push each row value, and if a column needs to widen, it will update the ref w/ merge(updated[], NamedTuple{(nm,)}((new_column,)). A tricky thing that makes this work is the fact that all of the non-widened columns are identical objects between the columns and updated[] arguments; i.e. the merge call makes a new namedtuple that points to all the original columns except the one new widened column. And since we only update each column once per row, we can just check at the end of a row loop if the updated[] changed and re-dispatch on that.

This approach seems to perform really well on some larger datasets I tested, comparing the type-stable columntable(f) vs. Tables.buildcolumns(nothing, f).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good trick! How about adding a comment? ;-)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Definitely a "best of both worlds" implementation. I agree with @nalimilan that this needs comments in the code as it's a really smart strategy.

@andyferris
Copy link
Member

in response to removing AccessStyle, I guess I wonder where it would really be used?

That's a very interesting question.

There are containers that already support (or could reasonably be expected to support) both access styles, with different speeds - things like DataFrames, TypedTables Table and FlexTable, etc - I could even create a view of a column of a CSV file with relatively decent performance given you've already identified the offsets for all the line beginnings (and it would be really cool if you implemented getproperty to do exactly this). If I've got a (generic) function which happens to transform a Table to a Table, which style should it pick, when both access predicates return true? Depending on the operation, it might be faster to deal with rows or with columns from a Table, but in this case generally operations on columns might be a better default, similar to how linear indexing is generally faster for arrays that support it (but linear indexing is not suitable/ergonomic for all operations - we don't use it for multidimensional broadcasting, for instance).

src/fallbacks.jl Outdated
L = Base.IteratorSize(T)
len = haslength(L) ? length(rowitr) : 0
sch = Schema(names, nothing)
columns = NamedTuple{names}(Tuple(Union{}[] for _ = 1:length(names)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piever @davidanthoff @nalimilan, I've thought of one more thing we should probably discuss here before merging. In this line, I'm defaulting the arrays to Union{}[], which will then widen to actual encountered types while iterating. Reading back over this monster thread however, I remembered that people have this paralyzing fear of this "collect-style" table building losing their original table's missingness. For example, with the current implementation here, we have:

julia> f = CSV.File(joinpath(dir, "test_basic.csv"))
CSV.File(/Users/jacobquinn/.julia/dev/CSV/test/testfiles/test_basic.csv, rows=3):
Tables.Schema:
 :col1  Union{Missing, Int64}
 :col2  Union{Missing, Int64}
 :col3  Union{Missing, Int64}

julia> f |> columntable
(col1 = Union{Missing, Int64}[1, 4, 7], col2 = Union{Missing, Int64}[2, 5, 8], col3 = Union{Missing, Int64}[3, 6, 9])

julia> Tables.buildcolumns(nothing, f)
(col1 = [1, 4, 7], col2 = [2, 5, 8], col3 = [3, 6, 9])

So the collect-strategy drops the missingness, though it's notable that in the "inferred" case (or rather when the source knows its schema, which should be the vast majority of cases), the sink can pass that schema on just fine. Now, I think we have a few different options of things to do here:

  • Provide at least a keyword argument, so that people could do, essentially, f |> x->DataFrame(x; allowmissing=:all), (to use the CSV.jl keyword arg allowmissing, which can be :auto, :none, or :all). That would essentially control whether we do Union{}[] or Missing[] columns by default when collecting.
  • What should be the default? Union{}[]? Or Missing[]; I'm inclined to say Union{}[], because I think people over-worry about this case, whereas in practical cases, it will be very much controlled by individual sinks. Sure a DataFrame might lose missingness, but that's a very easy case to handle (i.e. vcat will auto-promote columns for you). For a CSV.write operation, it couldn't care less what the column types are in the first place! Similarly for a database sink, the table will have been pre-defined for columns to allow null values or not, which is completely independent of uninferrable iterators that might come along.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the problem here that there is no equivalent to fieldtype for properties in base? Really if there was a propertytypes, this wouldn't be an issue at all, right? You would look at the first row you get, you could tell that the type of a given property is say Union{T,Missing} and you could allocate the appropriate array.

Maybe just define that function, provide a default fallback that uses fieldtype, and tell folks that if they use a type that adds a getproperty, they also need to add a method for propertytypes?

Or alternatively, catch the special case of a NamedTuple and use fieldtype to do the same thing, and lobby that base adds a propertytypes function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would really solve the problem though of sometimes losing missingness; if my row happens to be NamedTuple and none of the individual iterated NamedTuples had a missing value for a column, the resulting column wouldn't allow missingness, even though that might be desirable. That's why I think it might be useful for a user to at least pass something like DataFrame(x; defaultmissing=true) and the columns would all pop out like Union{T, Missing}.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? The field types in the NamedTuple would be Union{T,Missing}, even if there might never be a missing value, and if in the sink you could query the field type, you can pick the right column type. At least for tables that come from Query.jl that should definitely work, as long as the translator from DataValue makes sure that if a field in the named tuple is of type DataValue, the field in the named tuple that is returned by the iterator from Tables is of type Union{T,Missing}. I don't see why that wouldn't work.

And in general, I think this problem would only show up in a projection case a la Query.jl, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, since NamedTuple is a standard parametric type (contrary to Tuple), its instances can encode the information about whether a value could have been missing in its parameters. Having a propertytype function in Base would make sense.

In the meantime, we could define propertytype internally without exporting it, have a method for NamedTuple (based on fieldtype), and require custom row types to define it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented more extensively below (in particular I do think there will always be cases where strictification happens), but I agree here that the result of columns([NamedTuple{(:a, :b), Tuple{Int, Union{Int, Missing}}}((1, 2))]) should allow missing values in the second column and this could be done via propertytype

@piever
Copy link

piever commented Sep 5, 2018

Concerning the discussion about "strictifying things", here's my take. For NamedTuple iterators of unknown type, it will be inevitable to strictify things at times: (f(x) for x in rows(df)) with f type unstable or non-inferrable for example. I think we should accept this (esp. in the case of Missing) and encourage a programming style that's compatible with the type of the output occasionally strictifying depending on runtime properties.

In particular I think we should provide users with a set of functions that make this easier. Instead of append!(df, rows) that would fail if somehow df became stricter than expected, we should have a version of buildcolumns that starts with an already initialized sink, so one would write df = append_or_widen!(df, rows) (obviously with a better name). It should be easy to get this function just by splitting:

@inline function buildcolumns(schema, rowitr::T, iT=nothing) where {T}
    L = Base.IteratorSize(T)
    len = Base.haslength(L) ? length(rowitr) : 0
    nt = allocatecolumns(schema, len)
    for (i, row) in enumerate(rowitr)
        eachcolumn(add!, schema, row, L, nt, i)
    end
    return nt
end

into the part which preallocates nt and the part that appends to it .

We should also have an equivalent of collect_columns_flattened (see here) to handle the by case where different groups could return columns with different levels of strictification and we want to concatenate them all together (to be honest, it maybe enough to optimize columns(Iterators.flatten(rows)).

The case of setindex! is trickier but setting a value to be Missing seems like a strange thing to do anyway.

@quinnj
Copy link
Member Author

quinnj commented Sep 5, 2018

Ah, great idea @piever on splitting Tables.buildcolumns into the pre-allocation part & the append_or_widen! part. Let me play around w/ that and also find a way to incorporate #12, which is related.

@quinnj quinnj merged commit 3bae472 into master Sep 5, 2018
@quinnj quinnj deleted the jq/unknowneltype branch September 5, 2018 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants