added given macro to plot dataframes #74

piever · 2017-08-24T11:38:47Z

A possible solution to #72 using metaprogramming.

mkborregaard

Great, thanks for this, nice and succinct.

mkborregaard · 2017-08-24T12:21:12Z

src/given.jl

+    try
+        convert(Array, col)
+    catch
+        error("Missing data of type $T is not supported")


A possible, more featured, alternative, would be to tally NAs in undefined columns and then remove those from all input columns. We don't need that for now though.

mkborregaard · 2017-08-24T12:25:42Z

src/given.jl

+
+convert_column(col::AbstractDataArray{String})  = convert(Array, col, "")
+convert_column(col::AbstractDataArray{Symbol})  = convert(Array, col, Symbol())
+convert_column(col::AbstractDataArray{Float64}) = convert(Array, col, NaN)


I think this could be convert_column(col::AbstractDataArray{<:Real}) = convert(Array, convert(DataArray{Float64},col), NaN) for consistency with other number types.

mkborregaard · 2017-08-24T12:28:10Z

README.md

@@ -32,10 +32,12 @@ df = DataFrame(a = 1:10, b = 10*rand(10), c = 10 * rand(10))
 plot(df, :a, [:b :c])
 scatter(df, :a, :b, markersize = :(4 * log(:c + 0.1)))
 ```
-If you find an operation not supported by DataFrames, please open an issue. An alternative approach to the `StatPlots` syntax is to use the [DataFramesMeta](https://github.com/JuliaStats/DataFramesMeta.jl) macro `@with`. Symbols not referring to DataFrame columns must be escaped by `^()` e.g.


I think we should highlight this as the preferred approach, and note the current approach as a 'legacy' approach further down. No need to break the functionality but it is less powerful.

mkborregaard · 2017-08-24T12:28:53Z

src/given.jl

@@ -0,0 +1,39 @@
+"""
+    `@given d x`


an alternative could be @use? Need to think about names.

Regarding the name, I think that brevity is essential as this will be typed pretty often. I like @in to denote that you are working within a given DataFrame, but @use also makes sense.

Also one of my labmates and I came to @df independently, I think it's a winner.

mkborregaard · 2017-08-24T12:29:08Z

src/given.jl

+
+
+select_column(df, s) = haskey(df, s) ? convert_column(df[s]) : s
+


single-spaced

mkborregaard · 2017-08-24T12:29:19Z

src/given.jl

+    end
+end
+
+convert_column(col::AbstractDataArray{String})  = convert(Array, col, "")


AbstractString

mkborregaard · 2017-08-24T12:30:02Z

src/StatPlots.jl

@@ -17,7 +17,9 @@ import Loess

 export groupapply


could the groupapply functions be rewritten to use this format instead?

Well, at the very least the macro can certainly translate:
@given df groupapply(func, x,....) to groupapply(func, df, x,....)
But potentially it could be done in a smarter way so that it would also add support a general table and not only DataFrames and allow for some data manipulation (i.e. group by sth other than a column, for example :SepalLength .> 3).
One thing that worries me is the order of the arguments. As of now there are several ways to call it (example after the arrow):

groupapplay(func, df, x) ~> groupapplay(:density, iris, :SepalLength) groupapplay(func, df, x, y) ~> groupapplay(:locreg, iris, :SepalLength, :SepalWidth)

And 2 convenience methods for plotting y as a fct of x:

groupapplay(y, df, x) == groupapplay(:locreg, df, x, y) groupapplay(df, x, y) == groupapplay(:locreg, df, x, y)

The rational behind the first convenience method is that, when called in the standard way, the first variable represents what's on the y axis. Without df in the function call, these two methods would be ambiguous and we'd need to decide which one to keep.

Scrap groupapplay(y, df, x) == groupapplay(:locreg, df, x, y) IMHO.

It'd be elegant to reinterpret groupapply in the light of given. Intuitively I'd think it could be simplified a lot, but I may be wrong.

Agreed on both.
I also think groupapply can benefit from this (and also from the adjustments to the group keyword in #1033 ) but I need to think carefully about it. It probably deserves its own issue.

In terms of syntax, one idea would be to also allow:

@given df groupapply(...) plot(; kwargs....)

to mean the same as

plot(@given df groupapply(...); kwargs....)

But I guess we should open a separate issue to discuss changes to groupapply

Good idea with a separate issue.

mkborregaard · 2017-08-24T19:23:04Z

@tkelman we're considering adding this PR or something similar that adds a macro (called @given, @in or @use, which implements a behaviour similar to @with for DataFrames (and possibly a future abstract table type), i.e. it looks in the args and replaces symbols with DataFrame columns, removes missing values etc. Would you accept that in a release of StatPlots or do you see clashes with other functionality in the ecosystem?
I should add that it appears to be the best way to support DataFrame plotting in this package.

tkelman · 2017-08-24T20:43:53Z

Worth maybe searching through if that's a common export by many other packages, but that seems fine? I'd recommend having people try it out on master for a while with various other packages and see if "both ... and StatPlots export ..." warnings are something a lot of people see. They can always switch from using to import but not everyone likes doing that.

mkborregaard · 2017-08-24T20:54:02Z

Perfect, thanks a lot!

piever · 2017-08-27T18:21:29Z

Addressed all feedback (I think) and renamed the macro to @df. I'd suggest maybe merging this soonish to see if there are any complaints and then move on with groupapply changes (it needs a few, I'd like to make it compatible with the new @df syntax as well as the new grouping in Plots)

mkborregaard

Nice, thanks for doing this! Let me just give it a spin tomorrow before we decide to merge?

mkborregaard · 2017-08-27T19:09:55Z

Before merging we should update the entire readme (up until groupapply) to use the new syntax too - we could keep the old call commented out as an alternative.

piever · 2017-08-27T19:38:47Z

Nice, thanks for doing this! Let me just give it a spin tomorrow before we decide to merge?

Sure, I can update the README in the meantime.

piever · 2017-08-28T09:35:40Z

And another thing that deserves consideration is how to add the axis labels automatically in the macro framework. It's not completely trivial (and it was not trivial even with the old syntax: scatter(iris, :SepalLength) puts the label on the x axis when it would refer to the y axis).

To get the labels, now the user would have to type:
@df iris scatter(:SepalLength, :SepalWidth, xlabel = ^(:SepalLength), ylabel = ^(:SepalWidth))
which is really not ideal and should be automatised by the macro.
My suggestion is to add a kw argument to the plot call, called default_label where the macro would write all the relevant information (namely the list of args as it found it). For example this
@df iris scatter(:SepalLength, :SepalWidth)
would actually compile to:
scatter(iris[:SepalLength], iris[:SepalWidth], default_label = (:SepalLength, :SepalWidth))
and then plot would know that when there are 2 default_label one is for xlabel and one for ylabel unless they are specified by the user etcetera...
I'd rather do it with a separate kw because the translation is not obvious, as sometimes it refers to xlabel, ylabel and zlabel and sometimes it refers to label directly, such as in:

corrplot(M, label = ["x$i" for i=1:4])

But for this we'd first need to update Plots.jl so that if given a default_label = (args...), it knows how to use it.

mkborregaard · 2017-08-28T09:38:27Z

src/df.jl

+
+_df(d, x) = x
+
+function _df(d, x::Expr)


Perhaps here we could have some convenience functions here that are useful for plotting with DataFrames. Like allowing the symbol :nrow to be replaced by size(x,1).
Would it also be nice to be able to specify columns by number, like cols(1:4) and get reshape(names(d)[x],1,:)?

That's an interesting idea, especially for cols(1:4) (for let's say a corrplot). Two questions:

what's an example use of size(x,1) in a plot call?

should we go for something like: df gets replaced with the DataFrame? In that case maybe :nrow can be replaced by nrow(df) (which would work for free) and we'd get access to all the DataFrame functions (though, to be honest, I'm not sure how many would actually be useful)

marker_z = 1:size(x,1)

No, I'm not sure that would be necessary or clean

Agreed on point 2. About point 1, there's the issue that marker_z = 1:(:nrow) is not the prettiest syntax, plus we create even more ambiguity with the symbols: what if a column is called :nrow? Maybe simply nrow and ncol are used (normal function names act on the DataFrame and symbols are for columns). Now cols(1:ncol) does the full corrplot.

Are these extra features compatible to the plan to support a variety of data table types?
If not: Do we even still pursue this plan?

My take - we must pursue that long-term, but it seems early to me to start second-guessing what the interface will be.

@piever you have a point on nrow and friends, let's keep that out of it. When you say "now cols(1:ncol) does" do you mean that you've already added the functionality but not pushed it?

cols yes, nrow it depends (or at least it would require some thinking, depending on which plan we choose). I think the plan to support multiple data source has at least two possible implementations:

@df converts the first argument to a DataFrame (using IterableTables) and then proceeds as usual: anything that'd work on a DataFrame keeps working. It could be inefficient to convert the whole thing though (plotting a big DataFrame is rare, but I often use groupapply on pretty large data)

the only thing that is DataFrame specific, gets done with IterableTables: df[col] is changed to [getindex(s, col) for s in df] (I need to see how it'd work with missing data)

Getting plan 2 to work with groupapply may require some thinking, so I thought it's better to postpone this decision, until adjusting groupapply to the new syntax.

When you say "now cols(1:ncol) does

It's more like wishful thinking and poor phrasing... I meant with the hypothetical implementations. Still, if we decide that nrow and ncol are left out for the time being, implementing cols is pretty easy, I should be able to add it and push today.

mkborregaard · 2017-08-28T09:42:30Z

Why not just have the macro check if any of the kw xlabel, ylabel, zlabel or label (or aliases) appear in the call, and if they don't then put in the default labels?

To get the labels, now the user would have to type: @df iris scatter(:SepalLength, :SepalWidth, xlabel = ^(:SepalLength), ylabel = ^(:SepalWidth))

or @df iris scatter(:SepalLength, :SepalWidth, xlabel = "SepalLength", ylabel = "SepalWidth")

piever · 2017-08-28T10:21:38Z

Why not just have the macro check if any of the kw xlabel, ylabel, zlabel or label (or aliases) appear in the call, and if they don't then put in the default labels?

I was afraid the presence of aliases/magic attributes could make this a nightmare, for example imagine the user gives:

plot(y, xaxis = ("my label", (0,10), 0:0.5:10, :log, :flip, font(20, "Courier")))

from the plots docs with magic attributes. The macro will never find out that xlabel is already assigned. I think it's best if the macro passes on the information about the symbols it replaced, as they may be used later on, in a kw with that purpose (I like default_labels, but maybe arg_names makes more sense) and Plots.jl will take care of using them.

mkborregaard · 2017-08-28T11:53:35Z

Good point on the magic arguments but I'm not keen on adding an extra kw. I'd like to think about this more - perhaps we can somehow plug this into the recipe system.

piever · 2017-08-28T13:06:22Z

Maybe I was too pessimistic. If both explicit argument and magic argument are given, only the magic argument is taken into account. Your solution is probably best then

plot(rand(10), xaxis = ("my label"), xguide = "my new label")

piever · 2017-08-29T15:23:57Z

Actually it's even easier than that, as the rightmost keyword takes precedence, so it's enough to add the automatic keyword in the leftmost position to give it low priority.

piever · 2017-08-29T16:09:34Z

I think it has all the feature it needs (if we agree to wait for :nrow and :ncol). It only has one bug on marginalhist but I think the bug is on marginalhist side, I've reported it here: #79. Feel free to play a bit with it and let me know if you have more improvements, example usage is in the README.

mkborregaard · 2017-08-29T17:32:15Z

No, this is because my suggestion of adding xlabel to the plot call wasn't fully thought through. Doing it this way will override any labels set in a recipe with --> (which is commonly used for labels). That's unfortunate, but I think it will only be an issue with user recipes. One way to deal with this might be to only change it if the the plotting command is one of :scatter, :plot, and perhaps the :histogram types?

mkborregaard · 2017-08-29T17:33:53Z

Also, if you have

df = DataFrame(a = randn(1000))
@df df plot(:a)

You'll get the marker wrongly on the x axis.

piever · 2017-08-29T18:09:25Z

I'm also a bit unsure about adding xlabel (it feels hacky) and would overall prefer the extra keyword argnames, to allow more flexibility.

Still, in this particular case, I really think marginalhist is at fault, because even this command:

marginalhist(randn(1000), randn(1000), xlabel = :x , ylabel = :y)
gives what I believe is the wrong plot:

I think that this line was supposed to force both axis labels to be "", but only works on the x axis label.

In general, I can not think of many recipes that rely on the user not giving x and y labels.

As for the second point,

df = DataFrame(a = randn(1000))
@df df plot(:a)

gives the same as the old syntax:

plot(df, :a)

The big issue is that, if there is only one argument, it is really unclear where the label should go: here it should go on the y axis whereas on @df df density(:a) it should go on the x axis. I think it's hard to guess from the function call where it should go.

Possible solutions:

Implement the extra keyword argnames where to store all the argument names and then use runtime information to figure out the best label placement
recommend @df plot(1:nrow, :a) instead

mkborregaard · 2017-08-29T21:57:29Z

I really think marginalhist is at fault

Sure, there's also the problem that guide and label are aliases, but both are first-class keywords. IMHO we should make guide just an alias of label, that would necessitate shifting guide to label in all recipes.

would overall prefer the extra keyword argnames

That's cleaner here , but how would it be implemented inside Plots? We already have the recipes system for overloading keywords. If you can come up with a really clean way of implementing default_names then maybe that's the way to go. But it's tricky, e.g. how does it handle plot!?

it is really unclear where the label should go

By convention, it should go to the y axis, except for the case of the histogram-like methods (which are unique in plotting y on the x axis).

I'm also a bit unsure about adding xlabel (it feels hacky)

It does. The nice thing about is that we don't have to embed the StatPlots functionality deep within Plots - everything could be self-contained within this macro. I think only doing it for a call of plot or scatter should be safe, and feature-wise comparable to what we had.

recommend @df plot(1:nrow, :a) instead

No. That would be against the core philosophy of Plots.

In general, I can not think of many recipes that rely on the user not giving x and y labels.

Most multi-panel ones. And e.g. andrewsplot in PlotRecipes.

I am really sorry for disagreeing on a number of points here, considering that you've done such great work on this. One possibility would be to remove the automatic labels for now, and merge the rest as is? And we can continue working on the labels.

piever · 2017-08-29T23:30:29Z

I am really sorry for disagreeing on a number of points here

You are raising several good points, I'm also not fully convinced by this design. I can probably keep the helper function to extract column names as it will be useful in the future, take out the labelling part and merge as is.

Of all the concerns, plot! is probably the most catastrophic: one certainly doesn't want to choose the labels carefully on the first plot call and then get all of them overridden accidentally.

I think that the default_names approach may be the only way forward for the automated labelling. It does require a bit of work on the Plots side, so it's better to postpone it. My idea is that here the axis guide default would not be "" but rather it would look into the default_names and would choose it from there. This default value never overrides anything else, so it should be pretty safe. Maybe we should clean and merge this and then I can try to have branches here and in Plots in parallel to see if this idea works.

mkborregaard · 2017-08-30T04:52:54Z

Good idea 👍

piever · 2017-08-30T11:47:04Z

I've taken out the controversial labelling part (while keeping the unexported auxiliary function to collect the default_names from a plot call). I'm planning to merge later today and then proceed to:

See if I can get a default_names kw to make sense in Plots.jl using _axis_defaults[:guide] (I think it's especially important for plots like a corrplot where it may be hard to remember what each column is)
Figure out how to update groupapply (in a separate PR/issue)

Regarding 2, I think the consistent way to go is to write symbols in the groupapply call which would be then translated to arrays by the @df macro and then the get_groupederror function would take care of the splitting, averaging etcetera. I have a slightly off-topic question: at least at a first analysis, it would seem that the data manipulations inside get_groupederror are both easier to express and faster to execute using IndexedTables rather than DataFrames (these manipulations are invisible to the user, it is a matter of internally handling the data). Should that be the case, would it be possible to add an IndexedTables dependency or is it outside the scope of this package?

mkborregaard · 2017-08-30T12:17:56Z

Sounds good. This seems good to merge when you feel it's ready.

WRT to the IndexedTables dep, I think in the long run we should support whatever infrastructure is needed to support tables, but it seems a little strange to support DataFrames outwardly and then use indexedTables on the inside.

Do you think we could use the DataStreams.Sink type to accept input from any kind of table that implements the DataStreams.Source interface?

piever · 2017-08-30T12:35:11Z

WRT to the IndexedTables dep, I think in the long run we should support whatever infrastructure is needed to support tables, but it seems a little strange to support DataFrames outwardly and then use indexedTables on the inside.

It does, but the idea is that (modulo missing data issues which need to be clarified) with the tiny IterableTables dependency it's possible to extract columns of more or less anything, without adding a dependency on it:

The package currently has support for the following data sources: DataFrames, DataStreams (including CSV, Feather, SQLite, ODBC), DataTables, IndexedTables, TimeSeries, Temporal, TypedTables, JuliaDB, DifferentialEquations (any DESolution), CSVFiles, ExcelFiles, FeatherFiles, StatFiles and any iterator who produces elements of type NamedTuple

julia> using IterableTables

julia> [i.Species for i in getiterator(iris)]
150-element Array{String,1}:
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 ⋮          
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

And the same code would work with more or less anything instead of iris (IndexedTable, CSV/Excel file, SQLite database, I'm investigating on Discourse if one can also do it with MySQL databases). We would actually not even need to depend on DataFrames.

What is different in the case of groupapply is that, once we collect the arrays with the data, there is a fair bit of data shuffling left to do the plot. My intuition is that the native DataFrames API is not the most efficient at doing that, in that IndexedTables can use a very smart indexing system to optimize this data shuffling.

mkborregaard · 2017-08-30T12:52:56Z

If you think we could replace the DataFrames dep with an IterableTables dep and still support DataFrame inputs I'm all for it. But could we still support missing values? That's a real issue.

piever · 2017-08-30T13:28:44Z

I'm pretty sure you can (modulo a bug in IterableTables with PooledDataArrays), I'll make a PR as soon as I have a completely fleshed out solution.

The only real concern is what happens to the old DataFrames syntax if we drop the DataFrames dependency.

mkborregaard · 2017-08-30T13:44:14Z

We could keep it for a transition period until the interface crystallizes more.

mkborregaard · 2017-08-30T15:01:03Z

🎉

added given macro to plot dataframes

46e5768

mkborregaard requested changes Aug 24, 2017

View reviewed changes

rename to df

5bdd2a2

mkborregaard approved these changes Aug 27, 2017

View reviewed changes

mkborregaard reviewed Aug 28, 2017

View reviewed changes

Pietro Vertechi added 2 commits August 29, 2017 15:52

added cols updated readme

a63fcf2

set labels

27d8347

Pietro Vertechi added 3 commits August 29, 2017 16:25

wip add labels

ddd792b

code cleanup

7e03043

updated README

d13dbbe

removed labelling

973dc72

fix cols

0f0ed37

piever merged commit 491ca34 into JuliaPlots:master Aug 30, 2017

piever mentioned this pull request Aug 30, 2017

CorrPlot does not work on DataFrames #59

Closed

piever deleted the given branch August 30, 2017 15:27

This was referenced Sep 5, 2017

A Macro for automatic labelling of plots JuliaPlots/Plots.jl#1072

Open

Discussion: new groupapply syntax #88

Closed

piever mentioned this pull request Jan 5, 2018

Feature request: default names to axis MakieOrg/Makie.jl#41

Closed



		select_column(df, s) = haskey(df, s) ? convert_column(df[s]) : s

added given macro to plot dataframes #74

added given macro to plot dataframes #74

Conversation

piever commented Aug 24, 2017

mkborregaard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkborregaard commented Aug 24, 2017

tkelman commented Aug 24, 2017

mkborregaard commented Aug 24, 2017

piever commented Aug 27, 2017

mkborregaard left a comment

Choose a reason for hiding this comment

mkborregaard commented Aug 27, 2017

piever commented Aug 27, 2017

piever commented Aug 28, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daschw Aug 28, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkborregaard commented Aug 28, 2017 • edited

piever commented Aug 28, 2017

mkborregaard commented Aug 28, 2017 • edited

piever commented Aug 28, 2017

piever commented Aug 29, 2017

piever commented Aug 29, 2017

mkborregaard commented Aug 29, 2017

mkborregaard commented Aug 29, 2017

piever commented Aug 29, 2017

mkborregaard commented Aug 29, 2017

piever commented Aug 29, 2017

mkborregaard commented Aug 30, 2017

piever commented Aug 30, 2017

mkborregaard commented Aug 30, 2017

piever commented Aug 30, 2017 • edited

mkborregaard commented Aug 30, 2017 • edited

piever commented Aug 30, 2017 • edited

mkborregaard commented Aug 30, 2017

mkborregaard commented Aug 30, 2017

piever commented Aug 28, 2017 •

edited

daschw Aug 28, 2017 •

edited

mkborregaard commented Aug 28, 2017 •

edited

mkborregaard commented Aug 28, 2017 •

edited

piever commented Aug 30, 2017 •

edited

mkborregaard commented Aug 30, 2017 •

edited

piever commented Aug 30, 2017 •

edited