-
-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added given macro to plot dataframes #74
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for this, nice and succinct.
src/given.jl
Outdated
try | ||
convert(Array, col) | ||
catch | ||
error("Missing data of type $T is not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible, more featured, alternative, would be to tally NAs in undefined columns and then remove those from all input columns. We don't need that for now though.
src/given.jl
Outdated
|
||
convert_column(col::AbstractDataArray{String}) = convert(Array, col, "") | ||
convert_column(col::AbstractDataArray{Symbol}) = convert(Array, col, Symbol()) | ||
convert_column(col::AbstractDataArray{Float64}) = convert(Array, col, NaN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be convert_column(col::AbstractDataArray{<:Real}) = convert(Array, convert(DataArray{Float64},col), NaN)
for consistency with other number types.
README.md
Outdated
@@ -32,10 +32,12 @@ df = DataFrame(a = 1:10, b = 10*rand(10), c = 10 * rand(10)) | |||
plot(df, :a, [:b :c]) | |||
scatter(df, :a, :b, markersize = :(4 * log(:c + 0.1))) | |||
``` | |||
If you find an operation not supported by DataFrames, please open an issue. An alternative approach to the `StatPlots` syntax is to use the [DataFramesMeta](https://github.com/JuliaStats/DataFramesMeta.jl) macro `@with`. Symbols not referring to DataFrame columns must be escaped by `^()` e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should highlight this as the preferred approach, and note the current approach as a 'legacy' approach further down. No need to break the functionality but it is less powerful.
src/given.jl
Outdated
@@ -0,0 +1,39 @@ | |||
""" | |||
`@given d x` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an alternative could be @use
? Need to think about names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the name, I think that brevity is essential as this will be typed pretty often. I like @in
to denote that you are working within a given DataFrame, but @use
also makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or @df
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also one of my labmates and I came to @df
independently, I think it's a winner.
src/given.jl
Outdated
|
||
|
||
select_column(df, s) = haskey(df, s) ? convert_column(df[s]) : s | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single-spaced
src/given.jl
Outdated
end | ||
end | ||
|
||
convert_column(col::AbstractDataArray{String}) = convert(Array, col, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AbstractString
@@ -17,7 +17,9 @@ import Loess | |||
|
|||
export groupapply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could the groupapply functions be rewritten to use this format instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, at the very least the macro can certainly translate:
@given df groupapply(func, x,....)
to groupapply(func, df, x,....)
But potentially it could be done in a smarter way so that it would also add support a general table and not only DataFrames and allow for some data manipulation (i.e. group by sth other than a column, for example :SepalLength .> 3
).
One thing that worries me is the order of the arguments. As of now there are several ways to call it (example after the arrow):
groupapplay(func, df, x) ~> groupapplay(:density, iris, :SepalLength)
groupapplay(func, df, x, y) ~> groupapplay(:locreg, iris, :SepalLength, :SepalWidth)
And 2 convenience methods for plotting y
as a fct of x
:
groupapplay(y, df, x) == groupapplay(:locreg, df, x, y)
groupapplay(df, x, y) == groupapplay(:locreg, df, x, y)
The rational behind the first convenience method is that, when called in the standard way, the first variable represents what's on the y
axis. Without df
in the function call, these two methods would be ambiguous and we'd need to decide which one to keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scrap groupapplay(y, df, x) == groupapplay(:locreg, df, x, y)
IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be elegant to reinterpret groupapply in the light of given
. Intuitively I'd think it could be simplified a lot, but I may be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on both.
I also think groupapply
can benefit from this (and also from the adjustments to the group
keyword in #1033 ) but I need to think carefully about it. It probably deserves its own issue.
In terms of syntax, one idea would be to also allow:
@given df groupapply(...) plot(; kwargs....)
to mean the same as
plot(@given df groupapply(...); kwargs....)
But I guess we should open a separate issue to discuss changes to groupapply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea with a separate issue.
@tkelman we're considering adding this PR or something similar that adds a macro (called |
Worth maybe searching through if that's a common export by many other packages, but that seems fine? I'd recommend having people try it out on master for a while with various other packages and see if "both ... and StatPlots export ..." warnings are something a lot of people see. They can always switch from |
Perfect, thanks a lot! |
Addressed all feedback (I think) and renamed the macro to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for doing this! Let me just give it a spin tomorrow before we decide to merge?
Before merging we should update the entire readme (up until groupapply) to use the new syntax too - we could keep the old call commented out as an alternative. |
Sure, I can update the README in the meantime. |
And another thing that deserves consideration is how to add the axis labels automatically in the macro framework. It's not completely trivial (and it was not trivial even with the old syntax: To get the labels, now the user would have to type:
But for this we'd first need to update Plots.jl so that if given a |
|
||
_df(d, x) = x | ||
|
||
function _df(d, x::Expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps here we could have some convenience functions here that are useful for plotting with DataFrames. Like allowing the symbol :nrow
to be replaced by size(x,1)
.
Would it also be nice to be able to specify columns by number, like cols(1:4)
and get reshape(names(d)[x],1,:)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting idea, especially for cols(1:4)
(for let's say a corrplot
). Two questions:
- what's an example use of
size(x,1)
in a plot call? - should we go for something like:
df
gets replaced with theDataFrame
? In that case maybe:nrow
can be replaced bynrow(df)
(which would work for free) and we'd get access to all theDataFrame
functions (though, to be honest, I'm not sure how many would actually be useful)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
marker_z = 1:size(x,1)
- No, I'm not sure that would be necessary or clean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on point 2. About point 1, there's the issue that marker_z = 1:(:nrow)
is not the prettiest syntax, plus we create even more ambiguity with the symbols: what if a column is called :nrow
? Maybe simply nrow
and ncol
are used (normal function names act on the DataFrame and symbols are for columns). Now cols(1:ncol)
does the full corrplot
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these extra features compatible to the plan to support a variety of data table types?
If not: Do we even still pursue this plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take - we must pursue that long-term, but it seems early to me to start second-guessing what the interface will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piever you have a point on nrow and friends, let's keep that out of it. When you say "now cols(1:ncol) does" do you mean that you've already added the functionality but not pushed it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cols
yes, nrow
it depends (or at least it would require some thinking, depending on which plan we choose). I think the plan to support multiple data source has at least two possible implementations:
@df
converts the first argument to a DataFrame (using IterableTables) and then proceeds as usual: anything that'd work on a DataFrame keeps working. It could be inefficient to convert the whole thing though (plotting a big DataFrame is rare, but I often usegroupapply
on pretty large data)- the only thing that is DataFrame specific, gets done with IterableTables:
df[col]
is changed to[getindex(s, col) for s in df]
(I need to see how it'd work with missing data)
Getting plan 2 to work with groupapply
may require some thinking, so I thought it's better to postpone this decision, until adjusting groupapply
to the new syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "now cols(1:ncol) does
It's more like wishful thinking and poor phrasing... I meant with the hypothetical implementations. Still, if we decide that nrow
and ncol
are left out for the time being, implementing cols
is pretty easy, I should be able to add it and push today.
Why not just have the macro check if any of the kw
or |
I was afraid the presence of aliases/magic attributes could make this a nightmare, for example imagine the user gives:
from the plots docs with magic attributes. The macro will never find out that |
Good point on the magic arguments but I'm not keen on adding an extra kw. I'd like to think about this more - perhaps we can somehow plug this into the recipe system. |
Actually it's even easier than that, as the rightmost keyword takes precedence, so it's enough to add the automatic keyword in the leftmost position to give it low priority. |
I think it has all the feature it needs (if we agree to wait for |
No, this is because my suggestion of adding xlabel to the plot call wasn't fully thought through. Doing it this way will override any labels set in a recipe with |
Also, if you have df = DataFrame(a = randn(1000))
@df df plot(:a) You'll get the marker wrongly on the x axis. |
I'm also a bit unsure about adding Still, in this particular case, I really think
I think that this line was supposed to force both axis labels to be In general, I can not think of many recipes that rely on the user not giving x and y labels. As for the second point, df = DataFrame(a = randn(1000))
@df df plot(:a) gives the same as the old syntax:
The big issue is that, if there is only one argument, it is really unclear where the label should go: here it should go on the y axis whereas on Possible solutions:
|
Sure, there's also the problem that
That's cleaner here , but how would it be implemented inside Plots? We already have the recipes system for overloading keywords. If you can come up with a really clean way of implementing
By convention, it should go to the y axis, except for the case of the histogram-like methods (which are unique in plotting y on the x axis).
It does. The nice thing about is that we don't have to embed the StatPlots functionality deep within Plots - everything could be self-contained within this macro. I think only doing it for a call of
No. That would be against the core philosophy of Plots.
Most multi-panel ones. And e.g. andrewsplot in PlotRecipes. I am really sorry for disagreeing on a number of points here, considering that you've done such great work on this. One possibility would be to remove the automatic labels for now, and merge the rest as is? And we can continue working on the labels. |
You are raising several good points, I'm also not fully convinced by this design. I can probably keep the helper function to extract column names as it will be useful in the future, take out the labelling part and merge as is. Of all the concerns, I think that the |
Good idea 👍 |
I've taken out the controversial labelling part (while keeping the unexported auxiliary function to collect the
Regarding 2, I think the consistent way to go is to write symbols in the |
Sounds good. This seems good to merge when you feel it's ready. WRT to the IndexedTables dep, I think in the long run we should support whatever infrastructure is needed to support tables, but it seems a little strange to support DataFrames outwardly and then use indexedTables on the inside. Do you think we could use the DataStreams.Sink type to accept input from any kind of table that implements the DataStreams.Source interface? |
It does, but the idea is that (modulo missing data issues which need to be clarified) with the tiny IterableTables dependency it's possible to extract columns of more or less anything, without adding a dependency on it:
julia> using IterableTables
julia> [i.Species for i in getiterator(iris)]
150-element Array{String,1}:
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
⋮
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
And the same code would work with more or less anything instead of What is different in the case of |
If you think we could replace the DataFrames dep with an IterableTables dep and still support DataFrame inputs I'm all for it. But could we still support missing values? That's a real issue. |
I'm pretty sure you can (modulo a bug in IterableTables with PooledDataArrays), I'll make a PR as soon as I have a completely fleshed out solution. The only real concern is what happens to the old DataFrames syntax if we drop the DataFrames dependency. |
We could keep it for a transition period until the interface crystallizes more. |
🎉 |
A possible solution to #72 using metaprogramming.