add compatibility to a general table type and remove DataFrames dependency #82

piever · 2017-08-31T13:58:01Z

This is work in progress to pass from a DataFrames based implementation of @df to a IterableTables base implementation. The following commands now give the desired plots (should work for everything that is supported by IterableTables):

using IterableTables, DataFrames, IndexedTables, RDatasets, CSV, StatPlots
iris = dataset("datasets","iris")
writetable("/tmp/iris.csv", iris)
@df CSV.Source("/tmp/iris.csv") scatter(:SepalLength, :SepalWidth)
iris[:SepalLength][2] = NA
iris_table = IndexedTable(iris)
@df iris_table scatter(:SepalLength, :SepalWidth)

The implementation of the @df macro no longer requires a DataFrames dependency. It works in the following way: IterableTables provides a getiterator which, from a table, creates a row based iterator which spits out every row as a named tuple. What I do here is to replace every symbol in the plot call with a new variable. Then the macro generates 2 commands: one to assign to the new variables the respective columns (using the function StatPlots.compute_all) and the other to do the plot:

julia> @macroexpand @df iris_table scatter(:SepalLength, :SepalWidth)
quote 
    (##SepalLength#730, ##SepalWidth#731) = StatPlots.compute_all(iris_table, :SepalLength, :SepalWidth)
    scatter(##SepalLength#730, ##SepalWidth#731)
end

The advantage of doing all variables at once (and knowing exactly which they are) is twofold:

As they are all replaced together, we can see which rows have unsupported missing data and potentially remove them with a warning
the getiterator has to be iterated only once, which is better especially in the case of CSV or remote databases

I still need to polish a bit and further optimize StatPlots.compute_all but I wanted to check if this is a direction we're happy to take.

mkborregaard

👍 from me, this looks like a nice clean way to get a lot of functionality

mkborregaard · 2017-08-31T16:16:17Z

src/StatPlots.jl

@@ -7,6 +7,9 @@ import Plots: _cycle
 using StatsBase
 using Distributions
 using DataFrames
+using IterableTables
+import DataValues: DataValue


I think you could also import these from IterableTables - wouldn't that be cleaner?

I gotta say your macro-fu is pretty on point :-)

Actually, I'd leave it like it is, see my other comments.

mkborregaard · 2017-08-31T17:33:19Z

I think this could be argued to also close JuliaPlots/Plots.jl#53

mkborregaard · 2017-08-31T21:40:40Z

It'd be nice to have some more opinions here, as the table support is really important for StatPlots. The idea is that StatPlots could use IterableTables to 1) support a very general interface to julia's table ecosystem, and 2) drop the bloated DataFrames dep. I like it. Thoughts from other JuliaPlots members? In particular @daschw @tbreloff @oschulz @ChrisRackauckas ?

davidanthoff · 2017-09-01T00:14:03Z

Really excited to see this!

I'll try to look at the code in more detail tomorrow, but I should probably say a word about the relation of TableTraits.jl, TableTraitsUtils.jl and IterableTables.jl because that is not yet described in the documentation.

The high level story is that I moved a bunch of things out of IterableTables.jl, so that packages need to take on less dependencies if they want to integrate with this stack. Here is the new split:

TableTraits.jl has the very minimal definition of the functions that make up the iterable tables trait. Any package that wants to integrate with iterable tables needs to take a dependency on this one.
TableTraitsUtils.jl is a package that provides a default implementation of the iterable table interface. It essentially provides a couple of functions that might make it easier for a given source or sink to integrate with iterable tables. One can use it or not, up to each individual package. That package exports two functions right now: create_tableiterator you pass it a tuple of arrays and the names of the columns, and it will return an efficient row iterator. This is useful for packages that are sources (so not relevant here). create_columns_from_iterabletable is for sinks like this here: you pass it an iterable table, and it returns a tuple with the first argument being a tuple of arrays (the column data) and the second element being the names of the columns. Here is an example how it can be used. So any sink that ends up materializing an iterable table into a vector of vectors can use this implementation, which is quite optimized. Not sure whether that would be helpful here or not.
IterableTables.jl has all the integrations with various packages that have not moved into those packages. My hope is that eventually it will go away. For example, we very recently moved the integration for IndexedTables.jl out of IterableTables.jl and into IndexedTables.jl. My hope is that we can do that for all the sinks and sources. BUT, my sense is that will not happen anytime soon for DataFrames.jl, because there is some mega-refactoring/roadmapping going on over there and my sense is that the maintainers over there don't want to take a dependency on DataValues.jl (they are moving things over to Nulls.jl right now, which is not an option for my stack because it is too slow on julia 0.6 and doesn't work with Query.jl as of right now). So, for right now, someone has to load IterableTables.jl so that the DataFrames.jl integration just works... On my end I have packages like Query.jl, CSVFiles.jl etc. load IterableTables.jl by default, and I think I'll keep it that way, just to make sure users have a smooth experience... Probably makes sense to do the same here. But the way I would do this is that your complete implementation should only depend on stuff in TableTraits.jl (and maybe TableTraitsUtils.jl), and then you just put one import IterableTables somewhere so that the integration with DataFrames.jl is loaded, but don't use anything from IterableTables.jl explicitly.

Sorry for the long text, might well be that this was all clear to you guys already :)

davidanthoff · 2017-09-01T00:14:59Z

I think this could be argued to also close JuliaPlots/Plots.jl#53

Yes, you should be able to consume any DataStreams.jl source via this PR here.

ChrisRackauckas · 2017-09-01T00:26:23Z

I think IterableTables.jl is great and setting StatPlots.jl to work with this ecosystem will only be a good idea in the long run.

davidanthoff

This all looks great to me. I think the only thing that probably should be added is a call to isiterabletable somewhere. The other suggestions might make things a bit faster, but aren't strictly necessary.

I don't fully understand the macro foo or the details of the Plots.jl architecture, so really glad you took a stab at this, I could not have done it.

davidanthoff · 2017-09-01T05:08:28Z

src/StatPlots.jl

@@ -7,6 +7,9 @@ import Plots: _cycle
 using StatsBase
 using Distributions
 using DataFrames
+using IterableTables


I would make this import IterableTables. You really just want the package to be loaded, but you don't need anything from it directly.

davidanthoff · 2017-09-01T05:08:56Z

src/StatPlots.jl

@@ -7,6 +7,9 @@ import Plots: _cycle
 using StatsBase
 using Distributions
 using DataFrames
+using IterableTables
+import DataValues: DataValue


Actually, I'd leave it like it is, see my other comments.

davidanthoff · 2017-09-01T05:12:46Z

src/df.jl

-    catch
-        error("Missing data of type $T is not supported")
+function compute_all(df, syms...)
+    iter = IterableTables.getiterator(df)


You probably want to check whether you actually got an iterable table by calling TableTraits.isiterabletable(df), right? Someone could have passed you something that is something else entirely.

davidanthoff · 2017-09-01T05:16:30Z

src/df.jl

+    iter = IterableTables.getiterator(df)
+    type_info = Dict(zip(column_names(iter), column_types(iter)))
+    cols = Tuple(isa(s, Integer) ? Array{column_types(iter)[s]}(0) :
+        s in column_names(iter) ? Array{type_info[s]}(0) : s for s in syms)


Ah, so here you actually only materialize some columns, right? I think that means you should not use TableTraitsUtils.jl, because that will always materialize all columns.

One thing you could add later is to check whether Base.iteratorsize(typeof(iter))==Base.HasLength(), in which case you can call length(iter) to get the number of rows you will eventually get and then you can either pre-allocate the array or maybe just use sizehint! on the arrays that you allocate here. In general that should speed up the for loop later where you push into the array.

Yes, essentially the idea is for the df macro to be a with-style macro that only extracts the columns that are actually needed (and only tries to do this on symbols that are names in the table).

@davidantoff: I should probably check all the techniques you use in TableTraitsUtils.jl to speed up this part (it seems to me that mainly I'm missing unlooping the for loop over column names with generated functions and preallocating wherever possible). Still, this feels like unnecessary code duplication. By looking at your code, it seems like it'd be very easy to generalize it to only return some of the columns, with an appropriate keyword argument (if you want I can try and make a PR there). In this case, I don't think this would go against the spirit of keeping IterableTables as tiny as possible, as developers of other packages wouldn't need to do anything extra to support this new feature, whereas the StatPlots use case (select a few columns from an IterableTable) seems common enough and I'm not sure the code to do that efficiently belongs to a statistical plotting package. What do you think?

Yes, I thought a bit more about this and actually came to the same conclusion :) If we just keep that as an option in TableTraitsUtils.jl it is all cool as it won't complicate the core interface in TableTraits.jl at all.

@davidanthoff This is the first time I'm taking a deeper look at IterableTables (has been on my to-do list for a while though ;-) ). Maybe I didn't look in the right place in the docs - what is the recommended way to iterate over a subset of columns without loading all of them?

I don't want to change the iterable tables interface, i.e. you would still always get an iterator that returns a named tuple that has all the columns in it. But I want to now provide a helper function in TableTraitsUtils.jl that materializes only a subset of the columns into arrays.

There are two reasons: one is that it keeps the iterable tables interface simple. Adding the ability to tell the source which columns it should return adds additional complexity requirements on the source, and it is a bit difficult to see where to stop. Plus, I think I have a vague idea how one can have a similar story based on something in Query.jl. Not fleshed out yet, but stay tuned ;)

The other reason is that I think the compiler might actually be able to optimize all the tuple access to columns that aren't used away. I'm not a 100% sure about that, but when I talked with Keno about that in July I got the impression that he might think that feasible. In general, if one is careful, a lot of things in the iterable tables interface get inlined and one typically ends up with just one tight loop with no function calls at all. If, inside that loop, one accesses a column, but then that column is never used inside the loop, it should in theory be feasible for the compiler to detect that and just remove that access. If that is true and works, there is actually no need to add an option to the interface itself to select columns. I think for now it is probably best to at some point check out that option carefully, and if that can't be pulled of we can revisit an option in the core iterable tables interface.

The other reason is that I think the compiler might actually be able to optimize all the tuple access to columns that aren't used away.

I had hoped that might be the case, but at least in a very simple toy example I tried it wasn't. Inlining is, of course, sometimes hard to predict/control.

But I was also thinking about out-of-core data sources (e.g. databases, column-store file-formats, etc.), where one may need up-front knowledge about which columns to load, as loading columns may be expensive. I think in Query.jl, that's currently hard-coded for SQLite, but a generic solution would be very cool.

I do understand the need to keep things simple, of course.

typically ends up with just one tight loop with no function calls at all

Might that even SIMD-vectorize? Would depend on not having any branch-expressions, of course (so no on-demand loading, etc.).

Sorry for getting out-of-topic a bit, here.

I had hoped that might be the case, but at least in a very simple toy example I tried it wasn't. Inlining is, of course, sometimes hard to predict/control.

I tried that way, way back and a lot of even quite involved Query.jl queries ended up being inlined into one tight loop. I'm not sure whether that is enough for the compiler to ignore access to columns that are not needed, though...

I think in Query.jl, that's currently hard-coded for SQLite, but a generic solution would be very cool.

Yes, totally agree. My current thinking is that it would be great if a source could implement a partial Query.jl backend. For example, say you have this code load("file.csv") |> @select({_.a, _.b}) |> DataFrame (*), it would first go to a CSVFiles.jl Query.jl backend that realizes that this query implies that only column a and b are needed, and then skips all other columns at CSV parse time already. The same logic could apply to lots of other sources. But for this to work it would be really key that CSVFiles.jl doesn't have to implement a full Query.jl backend, but some lighter version of it. I've been mulling this whole question for a while, and it is slowly clearing up in my mind, but will probably still take a while.

Might that even SIMD-vectorize? Would depend on not having any branch-expressions, of course (so no on-demand loading, etc.).

Yeah, I think that could probably be made to work.

Sounds exciting - I look forward to see what you'll come up with. I know how it is, sometimes concepts just need some time to mature in ones mind ...

daschw · 2017-09-01T06:02:22Z

This is great! Nice work @piever! I'm definitely in favour of this change and dropping the DataFrames dependency.
Thanks also @davidanthoff for providing these valuable insights! It was not at all clear to me already 😄

mkborregaard · 2017-09-01T06:35:50Z

Yes, thanks @davidanthoff and once again to @piever for taking the lead on this!

mkborregaard · 2017-09-01T06:37:13Z

With regards to potential conflicts with the Nulls.jl-based architecture I think we can cross that bridge when we come to it.

piever · 2017-09-01T14:18:16Z

Yep, thanks @davidanthoff for the explanation, it was very helpful and at least to me not at all obvious.

piever · 2017-09-04T15:17:49Z

I've incorporated @davidanthoff feedback and used the method from queryverse/TableTraitsUtils.jl#2 to efficiently materialize a subset of columns from an iterable table. Will merge tomorrow if everybody's ok.

mkborregaard

Apart from the concern about _df I think this looks great!

mkborregaard · 2017-09-04T19:13:52Z

src/df.jl


-function _df(d, x::Expr)
-    (x.head == :quote) && return :(StatPlots.select_column($d, $x))
+function _df(d, x::Expr, syms, vars)


Shouldn't this be _df! seeing as it modifies the input?

Good point!

Could the names also be more informative? compute_all seems very broad (should it perhaps be extract_columns_from_table?) and _df a little sparse (parse_table_call or something like that)?

Good choice of names, I'll just replace table to iterabletable to be consistent with the TraitsUtils functions.

mkborregaard · 2017-09-05T10:17:01Z

🎉 time for a new StatPlots release? Or do we want to play around with it a little first?

piever · 2017-09-05T10:45:34Z

I'd be in favor of releasing quickly, as I was planning to start making potentially breaking changes to groupapply and I think it's better to have a release first.

mkborregaard · 2017-09-05T13:18:10Z

I've made the release.

Pietro Vertechi added 9 commits August 30, 2017 16:26

added support to non dataframes

fca39ac

tested for csvsource

a6201ae

simplified missing data

a6fdd7c

clean code

64c4a9e

remove excessive try catch block

dd23d74

wip change vars together

7a02029

compute together arrays

f11fd2f

updated cols

b81027a

fixed argnames, IterableTables bound

51732b9

mkborregaard reviewed Aug 31, 2017

View reviewed changes

davidanthoff reviewed Sep 1, 2017

View reviewed changes

piever mentioned this pull request Sep 1, 2017

allow to create only some columns from iterator queryverse/TableTraitsUtils.jl#2

Merged

switched to tabletraitsutils

f5500d8

mkborregaard approved these changes Sep 4, 2017

View reviewed changes

renaming

07518ee

piever merged commit 0d10166 into JuliaPlots:master Sep 5, 2017

This was referenced Sep 5, 2017

group= plots don't work with DataFrames master #33

Closed

DataStreams JuliaPlots/Plots.jl#53

Closed

piever deleted the iter branch September 5, 2017 09:12

mkborregaard changed the title ~~WIP: add compatibility to a general table type and remove DataFrames dependency~~ add compatibility to a general table type and remove DataFrames dependency Sep 5, 2017

This was referenced Sep 5, 2017

Explicitly require tabletraits and Datavalues #87

Merged

WIP Add Plots sink support queryverse/IterableTables.jl#12

Closed

add compatibility to a general table type and remove DataFrames dependency #82

add compatibility to a general table type and remove DataFrames dependency #82

Conversation

piever commented Aug 31, 2017

mkborregaard left a comment

Choose a reason for hiding this comment

mkborregaard Aug 31, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkborregaard commented Aug 31, 2017

mkborregaard commented Aug 31, 2017

davidanthoff commented Sep 1, 2017

davidanthoff commented Sep 1, 2017

ChrisRackauckas commented Sep 1, 2017

davidanthoff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piever Sep 1, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oschulz Sep 1, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oschulz Sep 1, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daschw commented Sep 1, 2017

mkborregaard commented Sep 1, 2017

mkborregaard commented Sep 1, 2017

piever commented Sep 1, 2017

piever commented Sep 4, 2017

mkborregaard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkborregaard commented Sep 5, 2017

piever commented Sep 5, 2017

mkborregaard commented Sep 5, 2017

mkborregaard Aug 31, 2017 •

edited

piever Sep 1, 2017 •

edited

oschulz Sep 1, 2017 •

edited

oschulz Sep 1, 2017 •

edited