Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add compatibility to a general table type and remove DataFrames dependency #82

Merged
merged 11 commits into from Sep 5, 2017

Conversation

piever
Copy link
Member

@piever piever commented Aug 31, 2017

This is work in progress to pass from a DataFrames based implementation of @df to a IterableTables base implementation. The following commands now give the desired plots (should work for everything that is supported by IterableTables):

using IterableTables, DataFrames, IndexedTables, RDatasets, CSV, StatPlots
iris = dataset("datasets","iris")
writetable("/tmp/iris.csv", iris)
@df CSV.Source("/tmp/iris.csv") scatter(:SepalLength, :SepalWidth)
iris[:SepalLength][2] = NA
iris_table = IndexedTable(iris)
@df iris_table scatter(:SepalLength, :SepalWidth)

The implementation of the @df macro no longer requires a DataFrames dependency. It works in the following way: IterableTables provides a getiterator which, from a table, creates a row based iterator which spits out every row as a named tuple. What I do here is to replace every symbol in the plot call with a new variable. Then the macro generates 2 commands: one to assign to the new variables the respective columns (using the function StatPlots.compute_all) and the other to do the plot:

julia> @macroexpand @df iris_table scatter(:SepalLength, :SepalWidth)
quote 
    (##SepalLength#730, ##SepalWidth#731) = StatPlots.compute_all(iris_table, :SepalLength, :SepalWidth)
    scatter(##SepalLength#730, ##SepalWidth#731)
end

The advantage of doing all variables at once (and knowing exactly which they are) is twofold:

  • As they are all replaced together, we can see which rows have unsupported missing data and potentially remove them with a warning
  • the getiterator has to be iterated only once, which is better especially in the case of CSV or remote databases

I still need to polish a bit and further optimize StatPlots.compute_all but I wanted to check if this is a direction we're happy to take.

Copy link
Member

@mkborregaard mkborregaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 from me, this looks like a nice clean way to get a lot of functionality

src/StatPlots.jl Outdated
@@ -7,6 +7,9 @@ import Plots: _cycle
using StatsBase
using Distributions
using DataFrames
using IterableTables
import DataValues: DataValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could also import these from IterableTables - wouldn't that be cleaner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotta say your macro-fu is pretty on point :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'd leave it like it is, see my other comments.

@mkborregaard
Copy link
Member

I think this could be argued to also close JuliaPlots/Plots.jl#53

@mkborregaard
Copy link
Member

It'd be nice to have some more opinions here, as the table support is really important for StatPlots. The idea is that StatPlots could use IterableTables to 1) support a very general interface to julia's table ecosystem, and 2) drop the bloated DataFrames dep. I like it. Thoughts from other JuliaPlots members? In particular @daschw @tbreloff @oschulz @ChrisRackauckas ?

@davidanthoff
Copy link

Really excited to see this!

I'll try to look at the code in more detail tomorrow, but I should probably say a word about the relation of TableTraits.jl, TableTraitsUtils.jl and IterableTables.jl because that is not yet described in the documentation.

The high level story is that I moved a bunch of things out of IterableTables.jl, so that packages need to take on less dependencies if they want to integrate with this stack. Here is the new split:

  • TableTraits.jl has the very minimal definition of the functions that make up the iterable tables trait. Any package that wants to integrate with iterable tables needs to take a dependency on this one.
  • TableTraitsUtils.jl is a package that provides a default implementation of the iterable table interface. It essentially provides a couple of functions that might make it easier for a given source or sink to integrate with iterable tables. One can use it or not, up to each individual package. That package exports two functions right now: create_tableiterator you pass it a tuple of arrays and the names of the columns, and it will return an efficient row iterator. This is useful for packages that are sources (so not relevant here). create_columns_from_iterabletable is for sinks like this here: you pass it an iterable table, and it returns a tuple with the first argument being a tuple of arrays (the column data) and the second element being the names of the columns. Here is an example how it can be used. So any sink that ends up materializing an iterable table into a vector of vectors can use this implementation, which is quite optimized. Not sure whether that would be helpful here or not.
  • IterableTables.jl has all the integrations with various packages that have not moved into those packages. My hope is that eventually it will go away. For example, we very recently moved the integration for IndexedTables.jl out of IterableTables.jl and into IndexedTables.jl. My hope is that we can do that for all the sinks and sources. BUT, my sense is that will not happen anytime soon for DataFrames.jl, because there is some mega-refactoring/roadmapping going on over there and my sense is that the maintainers over there don't want to take a dependency on DataValues.jl (they are moving things over to Nulls.jl right now, which is not an option for my stack because it is too slow on julia 0.6 and doesn't work with Query.jl as of right now). So, for right now, someone has to load IterableTables.jl so that the DataFrames.jl integration just works... On my end I have packages like Query.jl, CSVFiles.jl etc. load IterableTables.jl by default, and I think I'll keep it that way, just to make sure users have a smooth experience... Probably makes sense to do the same here. But the way I would do this is that your complete implementation should only depend on stuff in TableTraits.jl (and maybe TableTraitsUtils.jl), and then you just put one import IterableTables somewhere so that the integration with DataFrames.jl is loaded, but don't use anything from IterableTables.jl explicitly.

Sorry for the long text, might well be that this was all clear to you guys already :)

@davidanthoff
Copy link

I think this could be argued to also close JuliaPlots/Plots.jl#53

Yes, you should be able to consume any DataStreams.jl source via this PR here.

@ChrisRackauckas
Copy link
Member

I think IterableTables.jl is great and setting StatPlots.jl to work with this ecosystem will only be a good idea in the long run.

Copy link

@davidanthoff davidanthoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great to me. I think the only thing that probably should be added is a call to isiterabletable somewhere. The other suggestions might make things a bit faster, but aren't strictly necessary.

I don't fully understand the macro foo or the details of the Plots.jl architecture, so really glad you took a stab at this, I could not have done it.

src/StatPlots.jl Outdated
@@ -7,6 +7,9 @@ import Plots: _cycle
using StatsBase
using Distributions
using DataFrames
using IterableTables

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this import IterableTables. You really just want the package to be loaded, but you don't need anything from it directly.

src/StatPlots.jl Outdated
@@ -7,6 +7,9 @@ import Plots: _cycle
using StatsBase
using Distributions
using DataFrames
using IterableTables
import DataValues: DataValue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'd leave it like it is, see my other comments.

src/df.jl Outdated
catch
error("Missing data of type $T is not supported")
function compute_all(df, syms...)
iter = IterableTables.getiterator(df)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to check whether you actually got an iterable table by calling TableTraits.isiterabletable(df), right? Someone could have passed you something that is something else entirely.

src/df.jl Outdated
iter = IterableTables.getiterator(df)
type_info = Dict(zip(column_names(iter), column_types(iter)))
cols = Tuple(isa(s, Integer) ? Array{column_types(iter)[s]}(0) :
s in column_names(iter) ? Array{type_info[s]}(0) : s for s in syms)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so here you actually only materialize some columns, right? I think that means you should not use TableTraitsUtils.jl, because that will always materialize all columns.

One thing you could add later is to check whether Base.iteratorsize(typeof(iter))==Base.HasLength(), in which case you can call length(iter) to get the number of rows you will eventually get and then you can either pre-allocate the array or maybe just use sizehint! on the arrays that you allocate here. In general that should speed up the for loop later where you push into the array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, essentially the idea is for the df macro to be a with-style macro that only extracts the columns that are actually needed (and only tries to do this on symbols that are names in the table).

Copy link
Member Author

@piever piever Sep 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidantoff: I should probably check all the techniques you use in TableTraitsUtils.jl to speed up this part (it seems to me that mainly I'm missing unlooping the for loop over column names with generated functions and preallocating wherever possible). Still, this feels like unnecessary code duplication. By looking at your code, it seems like it'd be very easy to generalize it to only return some of the columns, with an appropriate keyword argument (if you want I can try and make a PR there). In this case, I don't think this would go against the spirit of keeping IterableTables as tiny as possible, as developers of other packages wouldn't need to do anything extra to support this new feature, whereas the StatPlots use case (select a few columns from an IterableTable) seems common enough and I'm not sure the code to do that efficiently belongs to a statistical plotting package. What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought a bit more about this and actually came to the same conclusion :) If we just keep that as an option in TableTraitsUtils.jl it is all cool as it won't complicate the core interface in TableTraits.jl at all.

Copy link
Contributor

@oschulz oschulz Sep 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidanthoff This is the first time I'm taking a deeper look at IterableTables (has been on my to-do list for a while though ;-) ). Maybe I didn't look in the right place in the docs - what is the recommended way to iterate over a subset of columns without loading all of them?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to change the iterable tables interface, i.e. you would still always get an iterator that returns a named tuple that has all the columns in it. But I want to now provide a helper function in TableTraitsUtils.jl that materializes only a subset of the columns into arrays.

There are two reasons: one is that it keeps the iterable tables interface simple. Adding the ability to tell the source which columns it should return adds additional complexity requirements on the source, and it is a bit difficult to see where to stop. Plus, I think I have a vague idea how one can have a similar story based on something in Query.jl. Not fleshed out yet, but stay tuned ;)

The other reason is that I think the compiler might actually be able to optimize all the tuple access to columns that aren't used away. I'm not a 100% sure about that, but when I talked with Keno about that in July I got the impression that he might think that feasible. In general, if one is careful, a lot of things in the iterable tables interface get inlined and one typically ends up with just one tight loop with no function calls at all. If, inside that loop, one accesses a column, but then that column is never used inside the loop, it should in theory be feasible for the compiler to detect that and just remove that access. If that is true and works, there is actually no need to add an option to the interface itself to select columns. I think for now it is probably best to at some point check out that option carefully, and if that can't be pulled of we can revisit an option in the core iterable tables interface.

Copy link
Contributor

@oschulz oschulz Sep 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other reason is that I think the compiler might actually be able to optimize all the tuple access to columns that aren't used away.

I had hoped that might be the case, but at least in a very simple toy example I tried it wasn't. Inlining is, of course, sometimes hard to predict/control.

But I was also thinking about out-of-core data sources (e.g. databases, column-store file-formats, etc.), where one may need up-front knowledge about which columns to load, as loading columns may be expensive. I think in Query.jl, that's currently hard-coded for SQLite, but a generic solution would be very cool.

I do understand the need to keep things simple, of course.

typically ends up with just one tight loop with no function calls at all

Might that even SIMD-vectorize? Would depend on not having any branch-expressions, of course (so no on-demand loading, etc.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for getting out-of-topic a bit, here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had hoped that might be the case, but at least in a very simple toy example I tried it wasn't. Inlining is, of course, sometimes hard to predict/control.

I tried that way, way back and a lot of even quite involved Query.jl queries ended up being inlined into one tight loop. I'm not sure whether that is enough for the compiler to ignore access to columns that are not needed, though...

I think in Query.jl, that's currently hard-coded for SQLite, but a generic solution would be very cool.

Yes, totally agree. My current thinking is that it would be great if a source could implement a partial Query.jl backend. For example, say you have this code load("file.csv") |> @select({_.a, _.b}) |> DataFrame (*), it would first go to a CSVFiles.jl Query.jl backend that realizes that this query implies that only column a and b are needed, and then skips all other columns at CSV parse time already. The same logic could apply to lots of other sources. But for this to work it would be really key that CSVFiles.jl doesn't have to implement a full Query.jl backend, but some lighter version of it. I've been mulling this whole question for a while, and it is slowly clearing up in my mind, but will probably still take a while.

Might that even SIMD-vectorize? Would depend on not having any branch-expressions, of course (so no on-demand loading, etc.).

Yeah, I think that could probably be made to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds exciting - I look forward to see what you'll come up with. I know how it is, sometimes concepts just need some time to mature in ones mind ...

@daschw
Copy link
Member

daschw commented Sep 1, 2017

This is great! Nice work @piever! I'm definitely in favour of this change and dropping the DataFrames dependency.
Thanks also @davidanthoff for providing these valuable insights! It was not at all clear to me already 😄

@mkborregaard
Copy link
Member

Yes, thanks @davidanthoff and once again to @piever for taking the lead on this!

@mkborregaard
Copy link
Member

With regards to potential conflicts with the Nulls.jl-based architecture I think we can cross that bridge when we come to it.

@piever
Copy link
Member Author

piever commented Sep 1, 2017

Yep, thanks @davidanthoff for the explanation, it was very helpful and at least to me not at all obvious.

@piever
Copy link
Member Author

piever commented Sep 4, 2017

I've incorporated @davidanthoff feedback and used the method from queryverse/TableTraitsUtils.jl#2 to efficiently materialize a subset of columns from an iterable table. Will merge tomorrow if everybody's ok.

Copy link
Member

@mkborregaard mkborregaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the concern about _df I think this looks great!

src/df.jl Outdated

function _df(d, x::Expr)
(x.head == :quote) && return :(StatPlots.select_column($d, $x))
function _df(d, x::Expr, syms, vars)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be _df! seeing as it modifies the input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the names also be more informative? compute_all seems very broad (should it perhaps be extract_columns_from_table?) and _df a little sparse (parse_table_call or something like that)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good choice of names, I'll just replace table to iterabletable to be consistent with the TraitsUtils functions.

@piever piever merged commit 0d10166 into JuliaPlots:master Sep 5, 2017
@piever piever deleted the iter branch September 5, 2017 09:12
@mkborregaard
Copy link
Member

🎉 time for a new StatPlots release? Or do we want to play around with it a little first?

@piever
Copy link
Member Author

piever commented Sep 5, 2017

I'd be in favor of releasing quickly, as I was planning to start making potentially breaking changes to groupapply and I think it's better to have a release first.

@mkborregaard mkborregaard changed the title WIP: add compatibility to a general table type and remove DataFrames dependency add compatibility to a general table type and remove DataFrames dependency Sep 5, 2017
@mkborregaard
Copy link
Member

I've made the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants