New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add compatibility to a general table type and remove DataFrames dependency #82
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 from me, this looks like a nice clean way to get a lot of functionality
src/StatPlots.jl
Outdated
@@ -7,6 +7,9 @@ import Plots: _cycle | |||
using StatsBase | |||
using Distributions | |||
using DataFrames | |||
using IterableTables | |||
import DataValues: DataValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could also import these from IterableTables - wouldn't that be cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gotta say your macro-fu is pretty on point :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'd leave it like it is, see my other comments.
I think this could be argued to also close JuliaPlots/Plots.jl#53 |
It'd be nice to have some more opinions here, as the table support is really important for StatPlots. The idea is that StatPlots could use IterableTables to 1) support a very general interface to julia's table ecosystem, and 2) drop the bloated DataFrames dep. I like it. Thoughts from other JuliaPlots members? In particular @daschw @tbreloff @oschulz @ChrisRackauckas ? |
Really excited to see this! I'll try to look at the code in more detail tomorrow, but I should probably say a word about the relation of TableTraits.jl, TableTraitsUtils.jl and IterableTables.jl because that is not yet described in the documentation. The high level story is that I moved a bunch of things out of IterableTables.jl, so that packages need to take on less dependencies if they want to integrate with this stack. Here is the new split:
Sorry for the long text, might well be that this was all clear to you guys already :) |
Yes, you should be able to consume any DataStreams.jl source via this PR here. |
I think IterableTables.jl is great and setting StatPlots.jl to work with this ecosystem will only be a good idea in the long run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks great to me. I think the only thing that probably should be added is a call to isiterabletable
somewhere. The other suggestions might make things a bit faster, but aren't strictly necessary.
I don't fully understand the macro foo or the details of the Plots.jl architecture, so really glad you took a stab at this, I could not have done it.
src/StatPlots.jl
Outdated
@@ -7,6 +7,9 @@ import Plots: _cycle | |||
using StatsBase | |||
using Distributions | |||
using DataFrames | |||
using IterableTables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this import IterableTables
. You really just want the package to be loaded, but you don't need anything from it directly.
src/StatPlots.jl
Outdated
@@ -7,6 +7,9 @@ import Plots: _cycle | |||
using StatsBase | |||
using Distributions | |||
using DataFrames | |||
using IterableTables | |||
import DataValues: DataValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'd leave it like it is, see my other comments.
src/df.jl
Outdated
catch | ||
error("Missing data of type $T is not supported") | ||
function compute_all(df, syms...) | ||
iter = IterableTables.getiterator(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to check whether you actually got an iterable table by calling TableTraits.isiterabletable(df)
, right? Someone could have passed you something that is something else entirely.
src/df.jl
Outdated
iter = IterableTables.getiterator(df) | ||
type_info = Dict(zip(column_names(iter), column_types(iter))) | ||
cols = Tuple(isa(s, Integer) ? Array{column_types(iter)[s]}(0) : | ||
s in column_names(iter) ? Array{type_info[s]}(0) : s for s in syms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so here you actually only materialize some columns, right? I think that means you should not use TableTraitsUtils.jl, because that will always materialize all columns.
One thing you could add later is to check whether Base.iteratorsize(typeof(iter))==Base.HasLength()
, in which case you can call length(iter)
to get the number of rows you will eventually get and then you can either pre-allocate the array or maybe just use sizehint!
on the arrays that you allocate here. In general that should speed up the for
loop later where you push into the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, essentially the idea is for the df macro to be a with-style macro that only extracts the columns that are actually needed (and only tries to do this on symbols that are names in the table).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidantoff: I should probably check all the techniques you use in TableTraitsUtils.jl to speed up this part (it seems to me that mainly I'm missing unlooping the for
loop over column names with generated functions and preallocating wherever possible). Still, this feels like unnecessary code duplication. By looking at your code, it seems like it'd be very easy to generalize it to only return some of the columns, with an appropriate keyword argument (if you want I can try and make a PR there). In this case, I don't think this would go against the spirit of keeping IterableTables as tiny as possible, as developers of other packages wouldn't need to do anything extra to support this new feature, whereas the StatPlots use case (select a few columns from an IterableTable) seems common enough and I'm not sure the code to do that efficiently belongs to a statistical plotting package. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought a bit more about this and actually came to the same conclusion :) If we just keep that as an option in TableTraitsUtils.jl it is all cool as it won't complicate the core interface in TableTraits.jl at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidanthoff This is the first time I'm taking a deeper look at IterableTables (has been on my to-do list for a while though ;-) ). Maybe I didn't look in the right place in the docs - what is the recommended way to iterate over a subset of columns without loading all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to change the iterable tables interface, i.e. you would still always get an iterator that returns a named tuple that has all the columns in it. But I want to now provide a helper function in TableTraitsUtils.jl that materializes only a subset of the columns into arrays.
There are two reasons: one is that it keeps the iterable tables interface simple. Adding the ability to tell the source which columns it should return adds additional complexity requirements on the source, and it is a bit difficult to see where to stop. Plus, I think I have a vague idea how one can have a similar story based on something in Query.jl. Not fleshed out yet, but stay tuned ;)
The other reason is that I think the compiler might actually be able to optimize all the tuple access to columns that aren't used away. I'm not a 100% sure about that, but when I talked with Keno about that in July I got the impression that he might think that feasible. In general, if one is careful, a lot of things in the iterable tables interface get inlined and one typically ends up with just one tight loop with no function calls at all. If, inside that loop, one accesses a column, but then that column is never used inside the loop, it should in theory be feasible for the compiler to detect that and just remove that access. If that is true and works, there is actually no need to add an option to the interface itself to select columns. I think for now it is probably best to at some point check out that option carefully, and if that can't be pulled of we can revisit an option in the core iterable tables interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other reason is that I think the compiler might actually be able to optimize all the tuple access to columns that aren't used away.
I had hoped that might be the case, but at least in a very simple toy example I tried it wasn't. Inlining is, of course, sometimes hard to predict/control.
But I was also thinking about out-of-core data sources (e.g. databases, column-store file-formats, etc.), where one may need up-front knowledge about which columns to load, as loading columns may be expensive. I think in Query.jl, that's currently hard-coded for SQLite, but a generic solution would be very cool.
I do understand the need to keep things simple, of course.
typically ends up with just one tight loop with no function calls at all
Might that even SIMD-vectorize? Would depend on not having any branch-expressions, of course (so no on-demand loading, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for getting out-of-topic a bit, here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had hoped that might be the case, but at least in a very simple toy example I tried it wasn't. Inlining is, of course, sometimes hard to predict/control.
I tried that way, way back and a lot of even quite involved Query.jl queries ended up being inlined into one tight loop. I'm not sure whether that is enough for the compiler to ignore access to columns that are not needed, though...
I think in Query.jl, that's currently hard-coded for SQLite, but a generic solution would be very cool.
Yes, totally agree. My current thinking is that it would be great if a source could implement a partial Query.jl backend. For example, say you have this code load("file.csv") |> @select({_.a, _.b}) |> DataFrame
(*), it would first go to a CSVFiles.jl Query.jl backend that realizes that this query implies that only column a
and b
are needed, and then skips all other columns at CSV parse time already. The same logic could apply to lots of other sources. But for this to work it would be really key that CSVFiles.jl doesn't have to implement a full Query.jl backend, but some lighter version of it. I've been mulling this whole question for a while, and it is slowly clearing up in my mind, but will probably still take a while.
Might that even SIMD-vectorize? Would depend on not having any branch-expressions, of course (so no on-demand loading, etc.).
Yeah, I think that could probably be made to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds exciting - I look forward to see what you'll come up with. I know how it is, sometimes concepts just need some time to mature in ones mind ...
This is great! Nice work @piever! I'm definitely in favour of this change and dropping the DataFrames dependency. |
Yes, thanks @davidanthoff and once again to @piever for taking the lead on this! |
With regards to potential conflicts with the Nulls.jl-based architecture I think we can cross that bridge when we come to it. |
Yep, thanks @davidanthoff for the explanation, it was very helpful and at least to me not at all obvious. |
I've incorporated @davidanthoff feedback and used the method from queryverse/TableTraitsUtils.jl#2 to efficiently materialize a subset of columns from an iterable table. Will merge tomorrow if everybody's ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the concern about _df
I think this looks great!
src/df.jl
Outdated
|
||
function _df(d, x::Expr) | ||
(x.head == :quote) && return :(StatPlots.select_column($d, $x)) | ||
function _df(d, x::Expr, syms, vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be _df!
seeing as it modifies the input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the names also be more informative? compute_all seems very broad (should it perhaps be extract_columns_from_table?) and _df a little sparse (parse_table_call or something like that)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good choice of names, I'll just replace table
to iterabletable
to be consistent with the TraitsUtils functions.
🎉 time for a new StatPlots release? Or do we want to play around with it a little first? |
I'd be in favor of releasing quickly, as I was planning to start making potentially breaking changes to |
I've made the release. |
This is work in progress to pass from a DataFrames based implementation of
@df
to a IterableTables base implementation. The following commands now give the desired plots (should work for everything that is supported by IterableTables):The implementation of the
@df
macro no longer requires a DataFrames dependency. It works in the following way: IterableTables provides agetiterator
which, from a table, creates a row based iterator which spits out every row as a named tuple. What I do here is to replace every symbol in the plot call with a new variable. Then the macro generates 2 commands: one to assign to the new variables the respective columns (using the functionStatPlots.compute_all
) and the other to do the plot:The advantage of doing all variables at once (and knowing exactly which they are) is twofold:
getiterator
has to be iterated only once, which is better especially in the case of CSV or remote databasesI still need to polish a bit and further optimize
StatPlots.compute_all
but I wanted to check if this is a direction we're happy to take.