-
-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: new groupapply syntax #88
Comments
I'm also taking the liberty to ping @davidanthoff to get a more experienced opinion as to what a query-like syntax for |
I can't help thinking that this is starting to look like a whole analytical framework in itself - if the groupapply would have a special bespoke linq-like syntax that wouldn't be used anywhere else in StatPlots, and actually doesn't draw very much on StatPlots either - except for the plotting. Also it sounds like you may develop this into a whole data-analytical framework focusing on splitting, calculating error and combining. Could it be an idea to make a package "GroupedErrors.jl" instead? It sounds to me as if all it would need from StatPlots was the user recipe on I can understand if you're concerned about discoverability - many people know StatPlots (though my impression is the vast majority just use Plots). But we could keep a referral in the readme "the awesome groupapply recipes have been moved to a standalone package". Let me know your thoughts. |
Actually one of the reasons why I developed this functionality is to be able to systematize routine data analysis that I end up doing generally (a lot of select + split/apply/combine). In this way it's easier for inexperienced users to make somewhat complicated plots with simpler commands: I am also working on a GUI to simplify the process even more (you can load a csv, select, split and plot with a few clicks: see https://github.com/piever/PlugAndPlot.jl/). Ideally it has other applications than can be generalized to other plots than Then maybe GroupedErrors could take care of all the data manipulation dependencies (thanks to the macro, we no longer need DataFrames on StatPlots) and take care of the data shuffling while the plots are actually drawn in StatPlots. The recipe for the I'll try to develop this separately and when it's mature enough we can add a link to it (as well as perhaps to the GUI if that also develops into a mature package). I'm unsure why the thought hasn't occurred to me before you mentioned it, it really makes a lot of sense... |
On a second thought, there may be technical reasons why the some |
Great, I think that will be really useful to people, and your PlugAndPlot package also looks pretty sweet. BTW, do you know CrossfilterCharts.jl ? http://nbviewer.jupyter.org/github/tawheeler/CrossfilterCharts.jl/blob/master/docs/CrossfilterCharts.ipynb |
Looks interesting, didn't know about that one! |
It would certainly be great if we could integrate this with the Query.jl story! I published a new version recently, that brings a pipe syntax to Query.jl, so this seems good timing to try to figure out how we can make these things match and work together. I have to admit I don't fully understand the original thing that is happening in the graph at the top. I think the beginning of the whole data analysis we could create in Query with school |> @where(_.Minrty=="Yes") |> @groupby(_.Sx) this would give you two groups. Or should things be split by school |> @where(_.Minrty=="Yes") |> @groupby({_.Sx, _.School}) But then I don't understand what the next step in the original analysis is. |
I really like of the new Query syntax and I'm actually trying to implement something similar. The way Then the
There are a some examples at the end of the README in the In pure Query.jl terms I actually thought that this s = @from i in school begin
@where i.Minrty == "Yes"
@groupby i by i.Sx into j
@select {j..MAch, j..School, } into k
@let shared_axis = compute_axis(k..MAch)
@group k.MAch by k.School into l
@select {axis = shared_axis, density = compute_density(l, shared_axis) } into m
@group m by m.axis into n
@select {mean(n..density), sem(n..density)}
@collect DataFrame
end I've started a pipeline based implementation in https://github.com/piever/GroupedErrors.jl but it's still not finalized/documented: I'll start adding docs and clarify syntax and open an issue to discuss Query integration as soon as I'm back from holidays (in a week or so). What I think is a better way to implement this stuff, both in terms of code clarity and performance (and what I have implemented so far) is to use IterableTables (or maybe even Query - so far I have reimplemented the bits of Query that I needed but hopefully there is an approach with less code duplication) to:
Then I create an IndexedTable with grouping and compute_error variables as index columns and data variables as data column(s). Once in that format it seems to me that all the data manipulations I need seems easy enough ( |
Actually, after looking more closely at the new "pipeline style" Query (once again, I think the new syntax really is a big improvement) I should actually be able to use it to a large extent. There only is one last thing that is bugging me: I tend to need to use arbitrary sets of columns (maybe found programmatically) which is really not compatible with NamedTuples, at least now. It'd be hugely helpful to be able to using DataFrames, Query, IndexedTables
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
x = df |>
@where(_.age>40) |>
@select(( _.name, _.children)) |>
Columns As |
This can be closed now that the whole |
More bikeshedding! After fixing the syntax to plot DataFrames in #74, we need to update the
groupapply
syntax to match it.Example call (as it is now on master):
What this does is to take the
:MAch
column from the dataframeschool
, splits by:Sx
, splits again by:School
, computes the kernel density withbandwidth = 0.5
, put back together all the schools using the functions provided insummarize
to compute mean and error and then plots the traces corresponding to each:Sx
.Issues:
groupapply
call gets messy pretty quiclky and the result is a bit magicalA simple solution of problem 1 would be the obvious translation:
grp_error = @df school groupapply(:density, :MAch, group = :Sx, compute_error = (:across, :School), axis_type = :continuous, summarize = (mean, sem), bandwidth = 0.5)
but problem 2 remains.
A possible proposal is to draw inspiration from the DataFramesMeta syntax (as we did with
@df
), in particular the@linq
macro to concatenate operator. What I'd propose is something along the lines of:Thoughts? In particular I'd like to understand if we're happy with a more
pipeline
syntax, what considerations we have about the order of the functions and whether they should be together or separate.The text was updated successfully, but these errors were encountered: