Discussion: new groupapply syntax #88

piever · 2017-09-07T08:22:23Z

More bikeshedding! After fixing the syntax to plot DataFrames in #74, we need to update the groupapply syntax to match it.

Example call (as it is now on master):

grp_error = groupapply(:density, school, :MAch, group = :Sx, compute_error = (:across, :School), axis_type = :continuous, summarize = (mean, sem), bandwidth = 0.5)
plot(grp_error, linewidth = 2)

What this does is to take the :MAch column from the dataframe school, splits by :Sx, splits again by :School, computes the kernel density with bandwidth = 0.5, put back together all the schools using the functions provided in summarize to compute mean and error and then plots the traces corresponding to each:Sx.

Issues:

not compatible with the new syntax
the groupapply call gets messy pretty quiclky and the result is a bit magical

A simple solution of problem 1 would be the obvious translation:

grp_error = @df school groupapply(:density, :MAch, group = :Sx, compute_error = (:across, :School), axis_type = :continuous, summarize = (mean, sem), bandwidth = 0.5)

but problem 2 remains.

A possible proposal is to draw inspiration from the DataFramesMeta syntax (as we did with @df), in particular the @linq macro to concatenate operator. What I'd propose is something along the lines of:

@groupapply school |>
    where(:Minrty .== "Yes") |> # data selection (currently not possible, but I think it can be nice to have)
    group(:Sx) |> # split (correspons do group = :Sx in a plot call)
    compute_error(:across, :School) |> # How to split to compute error (:bootstrap or :none also possible)
    summarize(mean, sem) |> #how to summarize the traces from previous step to get estimate and error
    axis_type(:continuous) |> #define how to treat the x axis: (:binned :discrete or :continuous)
    density(:MAch, bandwidth = 0.5) |> # analysis function (in this case kernel density)
    plot(linewidth = 2) #plot command (if omitted the statistical object is returned instead)

Thoughts? In particular I'd like to understand if we're happy with a more pipeline syntax, what considerations we have about the order of the functions and whether they should be together or separate.

The text was updated successfully, but these errors were encountered:

piever · 2017-09-07T10:28:22Z

I'm also taking the liberty to ping @davidanthoff to get a more experienced opinion as to what a query-like syntax for groupapply should look like (and whether it makes sense in the first place), or more generally what would be the preferred way of combining some basic data manipulation with a plot in terms of syntax.

mkborregaard · 2017-09-07T12:54:32Z

I can't help thinking that this is starting to look like a whole analytical framework in itself - if the groupapply would have a special bespoke linq-like syntax that wouldn't be used anywhere else in StatPlots, and actually doesn't draw very much on StatPlots either - except for the plotting. Also it sounds like you may develop this into a whole data-analytical framework focusing on splitting, calculating error and combining.

Could it be an idea to make a package "GroupedErrors.jl" instead? It sounds to me as if all it would need from StatPlots was the user recipe on groupederror. That would give you complete freedom to develop it in any direction you wanted, and maybe create an even more Query-compatible data analysis framework. It sounds like there are good arguments for making it a separate package, and no strong need to keep it together.

I can understand if you're concerned about discoverability - many people know StatPlots (though my impression is the vast majority just use Plots). But we could keep a referral in the readme "the awesome groupapply recipes have been moved to a standalone package".

Let me know your thoughts.

piever · 2017-09-07T14:21:34Z

Actually one of the reasons why I developed this functionality is to be able to systematize routine data analysis that I end up doing generally (a lot of select + split/apply/combine). In this way it's easier for inexperienced users to make somewhat complicated plots with simpler commands: I am also working on a GUI to simplify the process even more (you can load a csv, select, split and plot with a few clicks: see https://github.com/piever/PlugAndPlot.jl/). Ideally it has other applications than can be generalized to other plots than groupapply, but all of this doesn't need to live in StatPlots: it's actually easier for me to develop it if it doesn't.

Then maybe GroupedErrors could take care of all the data manipulation dependencies (thanks to the macro, we no longer need DataFrames on StatPlots) and take care of the data shuffling while the plots are actually drawn in StatPlots. The recipe for the groupederror is something that I'm reimplementing to use IndexedTables, so in terms of dependencies it's probably better if it lives in GroupedErrors (and I don't need a StatPlots dependency to have it, I'd just need RecipesBase). Also the Loess dependency can be dropped, as it's only really used by groupapply.

I'll try to develop this separately and when it's mature enough we can add a link to it (as well as perhaps to the GUI if that also develops into a mature package). I'm unsure why the thought hasn't occurred to me before you mentioned it, it really makes a lot of sense...

piever · 2017-09-07T14:54:08Z

On a second thought, there may be technical reasons why the some groupederror like recipe should live in StatPlots, but I think we can worry about that when it's time to register GroupedErrors.jl

mkborregaard · 2017-09-07T15:17:14Z

Great, I think that will be really useful to people, and your PlugAndPlot package also looks pretty sweet. BTW, do you know CrossfilterCharts.jl ? http://nbviewer.jupyter.org/github/tawheeler/CrossfilterCharts.jl/blob/master/docs/CrossfilterCharts.ipynb

piever · 2017-09-07T15:26:03Z

Looks interesting, didn't know about that one!

davidanthoff · 2017-09-12T00:24:42Z

It would certainly be great if we could integrate this with the Query.jl story! I published a new version recently, that brings a pipe syntax to Query.jl, so this seems good timing to try to figure out how we can make these things match and work together.

I have to admit I don't fully understand the original thing that is happening in the graph at the top. I think the beginning of the whole data analysis we could create in Query with

school |> @where(_.Minrty=="Yes") |> @groupby(_.Sx)

this would give you two groups. Or should things be split by Sx and School in this stage? That would be

school |> @where(_.Minrty=="Yes") |> @groupby({_.Sx, _.School})

But then I don't understand what the next step in the original analysis is.

piever · 2017-09-12T08:30:04Z

I really like of the new Query syntax and I'm actually trying to implement something similar. The way groupapply works is actually a bit convoluted. The idea is that there is a first phase of selecting and grouping which really is just:
school |> @where(_.Minrty=="Yes") |> @groupby(_.Sx)

Then the compute_error(:across, _.School) means that the following happens:

a common discretization of the x axis is chosen
the data is further split across :School and the desired curve is drawn for each school (in this case :density of :MAch column, whatever that is). However, given that there are many curves like that, only the mean and standard error across all of those curves are shown, to give an idea of what is the average shape of :density of :MAch and how much variability we have from one school to another.

There are a some examples at the end of the README in the groupapplysection, maybe they are helpful.

In pure Query.jl terms I actually thought that this compute_error part was a hard to express, the whole thing would probably look something like (I may have got some of Query's syntax wrong but I hope you get the idea):

s = @from i in school begin
    @where i.Minrty == "Yes"
    @groupby i by i.Sx into j
    @select {j..MAch, j..School, } into k
    @let shared_axis = compute_axis(k..MAch)
    @group k.MAch by k.School into l
    @select {axis = shared_axis, density = compute_density(l, shared_axis) } into m
    @group m by m.axis into n
    @select {mean(n..density), sem(n..density)}
    @collect DataFrame
end

I've started a pipeline based implementation in https://github.com/piever/GroupedErrors.jl but it's still not finalized/documented: I'll start adding docs and clarify syntax and open an issue to discuss Query integration as soon as I'm back from holidays (in a week or so).

What I think is a better way to implement this stuff, both in terms of code clarity and performance (and what I have implemented so far) is to use IterableTables (or maybe even Query - so far I have reimplemented the bits of Query that I needed but hopefully there is an approach with less code duplication) to:

filter the data
extract columns corresponding to grouping variables (:Sx, but I accept an arbitrary number of grouping variables), compute_error variables (:School) and data variables (:MAch)

Then I create an IndexedTable with grouping and compute_error variables as index columns and data variables as data column(s). Once in that format it seems to me that all the data manipulations I need seems easy enough (mapslices and reducedim essentially). Still I feel we should move the discussion in that package as soon as the code is clean/documented enough that it makes sense for you to look into it.

piever · 2017-09-12T19:13:32Z

Actually, after looking more closely at the new "pipeline style" Query (once again, I think the new syntax really is a big improvement) I should actually be able to use it to a large extent. There only is one last thing that is bugging me: I tend to need to use arbitrary sets of columns (maybe found programmatically) which is really not compatible with NamedTuples, at least now. It'd be hugely helpful to be able to @collect also an iterator that returns tuples (and not named tuples). For example I believe the following could be made to work pretty easily:

using DataFrames, Query, IndexedTables
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
x = df |>
    @where(_.age>40) |>
    @select((  _.name,  _.children)) |>
    Columns

As Columns has both a named and unnamed version. This way I would solve all the issues related to working with NamedTuples and programmatically found sets of columns.

piever · 2017-10-13T14:23:09Z

This can be closed now that the whole groupapply functionality has been moved to GroupedErrors

piever mentioned this issue Sep 19, 2017

Proposal: easier column extraction and data cleaning queryverse/Query.jl#146

Open

piever closed this as completed Oct 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: new groupapply syntax #88

Discussion: new groupapply syntax #88

piever commented Sep 7, 2017

piever commented Sep 7, 2017

mkborregaard commented Sep 7, 2017 •

edited

piever commented Sep 7, 2017

piever commented Sep 7, 2017

mkborregaard commented Sep 7, 2017

piever commented Sep 7, 2017

davidanthoff commented Sep 12, 2017

piever commented Sep 12, 2017

piever commented Sep 12, 2017

piever commented Oct 13, 2017

Discussion: new groupapply syntax #88

Discussion: new groupapply syntax #88

Comments

piever commented Sep 7, 2017

piever commented Sep 7, 2017

mkborregaard commented Sep 7, 2017 • edited

piever commented Sep 7, 2017

piever commented Sep 7, 2017

mkborregaard commented Sep 7, 2017

piever commented Sep 7, 2017

davidanthoff commented Sep 12, 2017

piever commented Sep 12, 2017

piever commented Sep 12, 2017

piever commented Oct 13, 2017

mkborregaard commented Sep 7, 2017 •

edited