Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added given macro to plot dataframes #74

Merged
merged 9 commits into from Aug 30, 2017
Merged

added given macro to plot dataframes #74

merged 9 commits into from Aug 30, 2017

Conversation

piever
Copy link
Member

@piever piever commented Aug 24, 2017

A possible solution to #72 using metaprogramming.

Copy link
Member

@mkborregaard mkborregaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for this, nice and succinct.

src/given.jl Outdated
try
convert(Array, col)
catch
error("Missing data of type $T is not supported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible, more featured, alternative, would be to tally NAs in undefined columns and then remove those from all input columns. We don't need that for now though.

src/given.jl Outdated

convert_column(col::AbstractDataArray{String}) = convert(Array, col, "")
convert_column(col::AbstractDataArray{Symbol}) = convert(Array, col, Symbol())
convert_column(col::AbstractDataArray{Float64}) = convert(Array, col, NaN)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be convert_column(col::AbstractDataArray{<:Real}) = convert(Array, convert(DataArray{Float64},col), NaN) for consistency with other number types.

README.md Outdated
@@ -32,10 +32,12 @@ df = DataFrame(a = 1:10, b = 10*rand(10), c = 10 * rand(10))
plot(df, :a, [:b :c])
scatter(df, :a, :b, markersize = :(4 * log(:c + 0.1)))
```
If you find an operation not supported by DataFrames, please open an issue. An alternative approach to the `StatPlots` syntax is to use the [DataFramesMeta](https://github.com/JuliaStats/DataFramesMeta.jl) macro `@with`. Symbols not referring to DataFrame columns must be escaped by `^()` e.g.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should highlight this as the preferred approach, and note the current approach as a 'legacy' approach further down. No need to break the functionality but it is less powerful.

src/given.jl Outdated
@@ -0,0 +1,39 @@
"""
`@given d x`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative could be @use? Need to think about names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the name, I think that brevity is essential as this will be typed pretty often. I like @in to denote that you are working within a given DataFrame, but @use also makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or @df?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also one of my labmates and I came to @df independently, I think it's a winner.

src/given.jl Outdated


select_column(df, s) = haskey(df, s) ? convert_column(df[s]) : s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-spaced

src/given.jl Outdated
end
end

convert_column(col::AbstractDataArray{String}) = convert(Array, col, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractString

@@ -17,7 +17,9 @@ import Loess

export groupapply
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the groupapply functions be rewritten to use this format instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, at the very least the macro can certainly translate:
@given df groupapply(func, x,....) to groupapply(func, df, x,....)
But potentially it could be done in a smarter way so that it would also add support a general table and not only DataFrames and allow for some data manipulation (i.e. group by sth other than a column, for example :SepalLength .> 3).
One thing that worries me is the order of the arguments. As of now there are several ways to call it (example after the arrow):

groupapplay(func, df, x) ~> groupapplay(:density, iris, :SepalLength)
groupapplay(func, df, x, y) ~> groupapplay(:locreg, iris, :SepalLength, :SepalWidth)

And 2 convenience methods for plotting y as a fct of x:

groupapplay(y, df, x) == groupapplay(:locreg, df, x, y)
groupapplay(df, x, y) == groupapplay(:locreg, df, x, y)

The rational behind the first convenience method is that, when called in the standard way, the first variable represents what's on the y axis. Without df in the function call, these two methods would be ambiguous and we'd need to decide which one to keep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scrap groupapplay(y, df, x) == groupapplay(:locreg, df, x, y) IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be elegant to reinterpret groupapply in the light of given. Intuitively I'd think it could be simplified a lot, but I may be wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on both.
I also think groupapply can benefit from this (and also from the adjustments to the group keyword in #1033 ) but I need to think carefully about it. It probably deserves its own issue.

In terms of syntax, one idea would be to also allow:

@given df groupapply(...) plot(; kwargs....)

to mean the same as

plot(@given df groupapply(...); kwargs....)

But I guess we should open a separate issue to discuss changes to groupapply

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea with a separate issue.

@mkborregaard
Copy link
Member

@tkelman we're considering adding this PR or something similar that adds a macro (called @given, @in or @use, which implements a behaviour similar to @with for DataFrames (and possibly a future abstract table type), i.e. it looks in the args and replaces symbols with DataFrame columns, removes missing values etc. Would you accept that in a release of StatPlots or do you see clashes with other functionality in the ecosystem?
I should add that it appears to be the best way to support DataFrame plotting in this package.

@tkelman
Copy link

tkelman commented Aug 24, 2017

Worth maybe searching through if that's a common export by many other packages, but that seems fine? I'd recommend having people try it out on master for a while with various other packages and see if "both ... and StatPlots export ..." warnings are something a lot of people see. They can always switch from using to import but not everyone likes doing that.

@mkborregaard
Copy link
Member

Perfect, thanks a lot!

@piever
Copy link
Member Author

piever commented Aug 27, 2017

Addressed all feedback (I think) and renamed the macro to @df. I'd suggest maybe merging this soonish to see if there are any complaints and then move on with groupapply changes (it needs a few, I'd like to make it compatible with the new @df syntax as well as the new grouping in Plots)

Copy link
Member

@mkborregaard mkborregaard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for doing this! Let me just give it a spin tomorrow before we decide to merge?

@mkborregaard
Copy link
Member

Before merging we should update the entire readme (up until groupapply) to use the new syntax too - we could keep the old call commented out as an alternative.

@piever
Copy link
Member Author

piever commented Aug 27, 2017

Nice, thanks for doing this! Let me just give it a spin tomorrow before we decide to merge?

Sure, I can update the README in the meantime.

@piever
Copy link
Member Author

piever commented Aug 28, 2017

And another thing that deserves consideration is how to add the axis labels automatically in the macro framework. It's not completely trivial (and it was not trivial even with the old syntax: scatter(iris, :SepalLength) puts the label on the x axis when it would refer to the y axis).

To get the labels, now the user would have to type:
@df iris scatter(:SepalLength, :SepalWidth, xlabel = ^(:SepalLength), ylabel = ^(:SepalWidth))
which is really not ideal and should be automatised by the macro.
My suggestion is to add a kw argument to the plot call, called default_label where the macro would write all the relevant information (namely the list of args as it found it). For example this
@df iris scatter(:SepalLength, :SepalWidth)
would actually compile to:
scatter(iris[:SepalLength], iris[:SepalWidth], default_label = (:SepalLength, :SepalWidth))
and then plot would know that when there are 2 default_label one is for xlabel and one for ylabel unless they are specified by the user etcetera...
I'd rather do it with a separate kw because the translation is not obvious, as sometimes it refers to xlabel, ylabel and zlabel and sometimes it refers to label directly, such as in:

corrplot(M, label = ["x$i" for i=1:4])

But for this we'd first need to update Plots.jl so that if given a default_label = (args...), it knows how to use it.


_df(d, x) = x

function _df(d, x::Expr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps here we could have some convenience functions here that are useful for plotting with DataFrames. Like allowing the symbol :nrow to be replaced by size(x,1).
Would it also be nice to be able to specify columns by number, like cols(1:4) and get reshape(names(d)[x],1,:)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea, especially for cols(1:4) (for let's say a corrplot). Two questions:

  • what's an example use of size(x,1) in a plot call?
  • should we go for something like: df gets replaced with the DataFrame? In that case maybe :nrow can be replaced by nrow(df) (which would work for free) and we'd get access to all the DataFrame functions (though, to be honest, I'm not sure how many would actually be useful)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • marker_z = 1:size(x,1)
  • No, I'm not sure that would be necessary or clean

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on point 2. About point 1, there's the issue that marker_z = 1:(:nrow) is not the prettiest syntax, plus we create even more ambiguity with the symbols: what if a column is called :nrow? Maybe simply nrow and ncol are used (normal function names act on the DataFrame and symbols are for columns). Now cols(1:ncol) does the full corrplot.

Copy link
Member

@daschw daschw Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these extra features compatible to the plan to support a variety of data table types?
If not: Do we even still pursue this plan?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take - we must pursue that long-term, but it seems early to me to start second-guessing what the interface will be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piever you have a point on nrow and friends, let's keep that out of it. When you say "now cols(1:ncol) does" do you mean that you've already added the functionality but not pushed it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cols yes, nrow it depends (or at least it would require some thinking, depending on which plan we choose). I think the plan to support multiple data source has at least two possible implementations:

  • @df converts the first argument to a DataFrame (using IterableTables) and then proceeds as usual: anything that'd work on a DataFrame keeps working. It could be inefficient to convert the whole thing though (plotting a big DataFrame is rare, but I often use groupapply on pretty large data)
  • the only thing that is DataFrame specific, gets done with IterableTables: df[col] is changed to [getindex(s, col) for s in df] (I need to see how it'd work with missing data)

Getting plan 2 to work with groupapply may require some thinking, so I thought it's better to postpone this decision, until adjusting groupapply to the new syntax.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like 2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "now cols(1:ncol) does

It's more like wishful thinking and poor phrasing... I meant with the hypothetical implementations. Still, if we decide that nrow and ncol are left out for the time being, implementing cols is pretty easy, I should be able to add it and push today.

@mkborregaard
Copy link
Member

mkborregaard commented Aug 28, 2017

Why not just have the macro check if any of the kw xlabel, ylabel, zlabel or label (or aliases) appear in the call, and if they don't then put in the default labels?

To get the labels, now the user would have to type: @df iris scatter(:SepalLength, :SepalWidth, xlabel = ^(:SepalLength), ylabel = ^(:SepalWidth))

or @df iris scatter(:SepalLength, :SepalWidth, xlabel = "SepalLength", ylabel = "SepalWidth")

@piever
Copy link
Member Author

piever commented Aug 28, 2017

Why not just have the macro check if any of the kw xlabel, ylabel, zlabel or label (or aliases) appear in the call, and if they don't then put in the default labels?

I was afraid the presence of aliases/magic attributes could make this a nightmare, for example imagine the user gives:

plot(y, xaxis = ("my label", (0,10), 0:0.5:10, :log, :flip, font(20, "Courier")))

from the plots docs with magic attributes. The macro will never find out that xlabel is already assigned. I think it's best if the macro passes on the information about the symbols it replaced, as they may be used later on, in a kw with that purpose (I like default_labels, but maybe arg_names makes more sense) and Plots.jl will take care of using them.

@mkborregaard
Copy link
Member

mkborregaard commented Aug 28, 2017

Good point on the magic arguments but I'm not keen on adding an extra kw. I'd like to think about this more - perhaps we can somehow plug this into the recipe system.

@piever
Copy link
Member Author

piever commented Aug 28, 2017

Maybe I was too pessimistic. If both explicit argument and magic argument are given, only the magic argument is taken into account. Your solution is probably best then

plot(rand(10), xaxis = ("my label"), xguide = "my new label")

newplot-7

@piever
Copy link
Member Author

piever commented Aug 29, 2017

Actually it's even easier than that, as the rightmost keyword takes precedence, so it's enough to add the automatic keyword in the leftmost position to give it low priority.

@piever
Copy link
Member Author

piever commented Aug 29, 2017

I think it has all the feature it needs (if we agree to wait for :nrow and :ncol). It only has one bug on marginalhist but I think the bug is on marginalhist side, I've reported it here: #79. Feel free to play a bit with it and let me know if you have more improvements, example usage is in the README.

@mkborregaard
Copy link
Member

No, this is because my suggestion of adding xlabel to the plot call wasn't fully thought through. Doing it this way will override any labels set in a recipe with --> (which is commonly used for labels). That's unfortunate, but I think it will only be an issue with user recipes. One way to deal with this might be to only change it if the the plotting command is one of :scatter, :plot, and perhaps the :histogram types?

@mkborregaard
Copy link
Member

Also, if you have

df = DataFrame(a = randn(1000))
@df df plot(:a)

You'll get the marker wrongly on the x axis.

@piever
Copy link
Member Author

piever commented Aug 29, 2017

I'm also a bit unsure about adding xlabel (it feels hacky) and would overall prefer the extra keyword argnames, to allow more flexibility.

Still, in this particular case, I really think marginalhist is at fault, because even this command:

marginalhist(randn(1000), randn(1000), xlabel = :x , ylabel = :y)
gives what I believe is the wrong plot:

marginal

I think that this line was supposed to force both axis labels to be "", but only works on the x axis label.

In general, I can not think of many recipes that rely on the user not giving x and y labels.

As for the second point,

df = DataFrame(a = randn(1000))
@df df plot(:a)

gives the same as the old syntax:

plot(df, :a)

The big issue is that, if there is only one argument, it is really unclear where the label should go: here it should go on the y axis whereas on @df df density(:a) it should go on the x axis. I think it's hard to guess from the function call where it should go.

Possible solutions:

  • Implement the extra keyword argnames where to store all the argument names and then use runtime information to figure out the best label placement
  • recommend @df plot(1:nrow, :a) instead

@mkborregaard
Copy link
Member

I really think marginalhist is at fault

Sure, there's also the problem that guide and label are aliases, but both are first-class keywords. IMHO we should make guide just an alias of label, that would necessitate shifting guide to label in all recipes.

would overall prefer the extra keyword argnames

That's cleaner here , but how would it be implemented inside Plots? We already have the recipes system for overloading keywords. If you can come up with a really clean way of implementing default_names then maybe that's the way to go. But it's tricky, e.g. how does it handle plot!?

it is really unclear where the label should go

By convention, it should go to the y axis, except for the case of the histogram-like methods (which are unique in plotting y on the x axis).

I'm also a bit unsure about adding xlabel (it feels hacky)

It does. The nice thing about is that we don't have to embed the StatPlots functionality deep within Plots - everything could be self-contained within this macro. I think only doing it for a call of plot or scatter should be safe, and feature-wise comparable to what we had.

recommend @df plot(1:nrow, :a) instead

No. That would be against the core philosophy of Plots.

In general, I can not think of many recipes that rely on the user not giving x and y labels.

Most multi-panel ones. And e.g. andrewsplot in PlotRecipes.

I am really sorry for disagreeing on a number of points here, considering that you've done such great work on this. One possibility would be to remove the automatic labels for now, and merge the rest as is? And we can continue working on the labels.

@piever
Copy link
Member Author

piever commented Aug 29, 2017

I am really sorry for disagreeing on a number of points here

You are raising several good points, I'm also not fully convinced by this design. I can probably keep the helper function to extract column names as it will be useful in the future, take out the labelling part and merge as is.

Of all the concerns, plot! is probably the most catastrophic: one certainly doesn't want to choose the labels carefully on the first plot call and then get all of them overridden accidentally.

I think that the default_names approach may be the only way forward for the automated labelling. It does require a bit of work on the Plots side, so it's better to postpone it. My idea is that here the axis guide default would not be "" but rather it would look into the default_names and would choose it from there. This default value never overrides anything else, so it should be pretty safe. Maybe we should clean and merge this and then I can try to have branches here and in Plots in parallel to see if this idea works.

@mkborregaard
Copy link
Member

Good idea 👍

@piever
Copy link
Member Author

piever commented Aug 30, 2017

I've taken out the controversial labelling part (while keeping the unexported auxiliary function to collect the default_names from a plot call). I'm planning to merge later today and then proceed to:

  • See if I can get a default_names kw to make sense in Plots.jl using _axis_defaults[:guide] (I think it's especially important for plots like a corrplot where it may be hard to remember what each column is)
  • Figure out how to update groupapply (in a separate PR/issue)

Regarding 2, I think the consistent way to go is to write symbols in the groupapply call which would be then translated to arrays by the @df macro and then the get_groupederror function would take care of the splitting, averaging etcetera. I have a slightly off-topic question: at least at a first analysis, it would seem that the data manipulations inside get_groupederror are both easier to express and faster to execute using IndexedTables rather than DataFrames (these manipulations are invisible to the user, it is a matter of internally handling the data). Should that be the case, would it be possible to add an IndexedTables dependency or is it outside the scope of this package?

@mkborregaard
Copy link
Member

Sounds good. This seems good to merge when you feel it's ready.

WRT to the IndexedTables dep, I think in the long run we should support whatever infrastructure is needed to support tables, but it seems a little strange to support DataFrames outwardly and then use indexedTables on the inside.

Do you think we could use the DataStreams.Sink type to accept input from any kind of table that implements the DataStreams.Source interface?

@piever
Copy link
Member Author

piever commented Aug 30, 2017

WRT to the IndexedTables dep, I think in the long run we should support whatever infrastructure is needed to support tables, but it seems a little strange to support DataFrames outwardly and then use indexedTables on the inside.

It does, but the idea is that (modulo missing data issues which need to be clarified) with the tiny IterableTables dependency it's possible to extract columns of more or less anything, without adding a dependency on it:

The package currently has support for the following data sources: DataFrames, DataStreams (including CSV, Feather, SQLite, ODBC), DataTables, IndexedTables, TimeSeries, Temporal, TypedTables, JuliaDB, DifferentialEquations (any DESolution), CSVFiles, ExcelFiles, FeatherFiles, StatFiles and any iterator who produces elements of type NamedTuple

julia> using IterableTables

julia> [i.Species for i in getiterator(iris)]
150-element Array{String,1}:
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
           
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

And the same code would work with more or less anything instead of iris (IndexedTable, CSV/Excel file, SQLite database, I'm investigating on Discourse if one can also do it with MySQL databases). We would actually not even need to depend on DataFrames.

What is different in the case of groupapply is that, once we collect the arrays with the data, there is a fair bit of data shuffling left to do the plot. My intuition is that the native DataFrames API is not the most efficient at doing that, in that IndexedTables can use a very smart indexing system to optimize this data shuffling.

@mkborregaard
Copy link
Member

mkborregaard commented Aug 30, 2017

If you think we could replace the DataFrames dep with an IterableTables dep and still support DataFrame inputs I'm all for it. But could we still support missing values? That's a real issue.

@piever
Copy link
Member Author

piever commented Aug 30, 2017

I'm pretty sure you can (modulo a bug in IterableTables with PooledDataArrays), I'll make a PR as soon as I have a completely fleshed out solution.

The only real concern is what happens to the old DataFrames syntax if we drop the DataFrames dependency.

@mkborregaard
Copy link
Member

We could keep it for a transition period until the interface crystallizes more.

@piever piever merged commit 491ca34 into JuliaPlots:master Aug 30, 2017
@mkborregaard
Copy link
Member

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants