Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: new groupapply syntax #88

Closed
piever opened this issue Sep 7, 2017 · 10 comments
Closed

Discussion: new groupapply syntax #88

piever opened this issue Sep 7, 2017 · 10 comments

Comments

@piever
Copy link
Member

piever commented Sep 7, 2017

More bikeshedding! After fixing the syntax to plot DataFrames in #74, we need to update the groupapply syntax to match it.

Example call (as it is now on master):

grp_error = groupapply(:density, school, :MAch, group = :Sx, compute_error = (:across, :School), axis_type = :continuous, summarize = (mean, sem), bandwidth = 0.5)
plot(grp_error, linewidth = 2)

grp

What this does is to take the :MAch column from the dataframe school, splits by :Sx, splits again by :School, computes the kernel density with bandwidth = 0.5, put back together all the schools using the functions provided in summarize to compute mean and error and then plots the traces corresponding to each:Sx.

Issues:

  • not compatible with the new syntax
  • the groupapply call gets messy pretty quiclky and the result is a bit magical

A simple solution of problem 1 would be the obvious translation:

grp_error = @df school groupapply(:density, :MAch, group = :Sx, compute_error = (:across, :School), axis_type = :continuous, summarize = (mean, sem), bandwidth = 0.5)

but problem 2 remains.

A possible proposal is to draw inspiration from the DataFramesMeta syntax (as we did with @df), in particular the @linq macro to concatenate operator. What I'd propose is something along the lines of:

@groupapply school |>
    where(:Minrty .== "Yes") |> # data selection (currently not possible, but I think it can be nice to have)
    group(:Sx) |> # split (correspons do group = :Sx in a plot call)
    compute_error(:across, :School) |> # How to split to compute error (:bootstrap or :none also possible)
    summarize(mean, sem) |> #how to summarize the traces from previous step to get estimate and error
    axis_type(:continuous) |> #define how to treat the x axis: (:binned :discrete or :continuous)
    density(:MAch, bandwidth = 0.5) |> # analysis function (in this case kernel density)
    plot(linewidth = 2) #plot command (if omitted the statistical object is returned instead)

Thoughts? In particular I'd like to understand if we're happy with a more pipeline syntax, what considerations we have about the order of the functions and whether they should be together or separate.

@piever
Copy link
Member Author

piever commented Sep 7, 2017

I'm also taking the liberty to ping @davidanthoff to get a more experienced opinion as to what a query-like syntax for groupapply should look like (and whether it makes sense in the first place), or more generally what would be the preferred way of combining some basic data manipulation with a plot in terms of syntax.

@mkborregaard
Copy link
Member

mkborregaard commented Sep 7, 2017

I can't help thinking that this is starting to look like a whole analytical framework in itself - if the groupapply would have a special bespoke linq-like syntax that wouldn't be used anywhere else in StatPlots, and actually doesn't draw very much on StatPlots either - except for the plotting. Also it sounds like you may develop this into a whole data-analytical framework focusing on splitting, calculating error and combining.

Could it be an idea to make a package "GroupedErrors.jl" instead? It sounds to me as if all it would need from StatPlots was the user recipe on groupederror. That would give you complete freedom to develop it in any direction you wanted, and maybe create an even more Query-compatible data analysis framework. It sounds like there are good arguments for making it a separate package, and no strong need to keep it together.

I can understand if you're concerned about discoverability - many people know StatPlots (though my impression is the vast majority just use Plots). But we could keep a referral in the readme "the awesome groupapply recipes have been moved to a standalone package".

Let me know your thoughts.

@piever
Copy link
Member Author

piever commented Sep 7, 2017

Actually one of the reasons why I developed this functionality is to be able to systematize routine data analysis that I end up doing generally (a lot of select + split/apply/combine). In this way it's easier for inexperienced users to make somewhat complicated plots with simpler commands: I am also working on a GUI to simplify the process even more (you can load a csv, select, split and plot with a few clicks: see https://github.com/piever/PlugAndPlot.jl/). Ideally it has other applications than can be generalized to other plots than groupapply, but all of this doesn't need to live in StatPlots: it's actually easier for me to develop it if it doesn't.

Then maybe GroupedErrors could take care of all the data manipulation dependencies (thanks to the macro, we no longer need DataFrames on StatPlots) and take care of the data shuffling while the plots are actually drawn in StatPlots. The recipe for the groupederror is something that I'm reimplementing to use IndexedTables, so in terms of dependencies it's probably better if it lives in GroupedErrors (and I don't need a StatPlots dependency to have it, I'd just need RecipesBase). Also the Loess dependency can be dropped, as it's only really used by groupapply.

I'll try to develop this separately and when it's mature enough we can add a link to it (as well as perhaps to the GUI if that also develops into a mature package). I'm unsure why the thought hasn't occurred to me before you mentioned it, it really makes a lot of sense...

@piever
Copy link
Member Author

piever commented Sep 7, 2017

On a second thought, there may be technical reasons why the some groupederror like recipe should live in StatPlots, but I think we can worry about that when it's time to register GroupedErrors.jl

@mkborregaard
Copy link
Member

Great, I think that will be really useful to people, and your PlugAndPlot package also looks pretty sweet. BTW, do you know CrossfilterCharts.jl ? http://nbviewer.jupyter.org/github/tawheeler/CrossfilterCharts.jl/blob/master/docs/CrossfilterCharts.ipynb

@piever
Copy link
Member Author

piever commented Sep 7, 2017

Looks interesting, didn't know about that one!

@davidanthoff
Copy link

It would certainly be great if we could integrate this with the Query.jl story! I published a new version recently, that brings a pipe syntax to Query.jl, so this seems good timing to try to figure out how we can make these things match and work together.

I have to admit I don't fully understand the original thing that is happening in the graph at the top. I think the beginning of the whole data analysis we could create in Query with

school |> @where(_.Minrty=="Yes") |> @groupby(_.Sx)

this would give you two groups. Or should things be split by Sx and School in this stage? That would be

school |> @where(_.Minrty=="Yes") |> @groupby({_.Sx, _.School})

But then I don't understand what the next step in the original analysis is.

@piever
Copy link
Member Author

piever commented Sep 12, 2017

I really like of the new Query syntax and I'm actually trying to implement something similar. The way groupapply works is actually a bit convoluted. The idea is that there is a first phase of selecting and grouping which really is just:
school |> @where(_.Minrty=="Yes") |> @groupby(_.Sx)

Then the compute_error(:across, _.School) means that the following happens:

  • a common discretization of the x axis is chosen
  • the data is further split across :School and the desired curve is drawn for each school (in this case :density of :MAch column, whatever that is). However, given that there are many curves like that, only the mean and standard error across all of those curves are shown, to give an idea of what is the average shape of :density of :MAch and how much variability we have from one school to another.

There are a some examples at the end of the README in the groupapplysection, maybe they are helpful.

In pure Query.jl terms I actually thought that this compute_error part was a hard to express, the whole thing would probably look something like (I may have got some of Query's syntax wrong but I hope you get the idea):

s = @from i in school begin
    @where i.Minrty == "Yes"
    @groupby i by i.Sx into j
    @select {j..MAch, j..School, } into k
    @let shared_axis = compute_axis(k..MAch)
    @group k.MAch by k.School into l
    @select {axis = shared_axis, density = compute_density(l, shared_axis) } into m
    @group m by m.axis into n
    @select {mean(n..density), sem(n..density)}
    @collect DataFrame
end

I've started a pipeline based implementation in https://github.com/piever/GroupedErrors.jl but it's still not finalized/documented: I'll start adding docs and clarify syntax and open an issue to discuss Query integration as soon as I'm back from holidays (in a week or so).

What I think is a better way to implement this stuff, both in terms of code clarity and performance (and what I have implemented so far) is to use IterableTables (or maybe even Query - so far I have reimplemented the bits of Query that I needed but hopefully there is an approach with less code duplication) to:

  • filter the data
  • extract columns corresponding to grouping variables (:Sx, but I accept an arbitrary number of grouping variables), compute_error variables (:School) and data variables (:MAch)

Then I create an IndexedTable with grouping and compute_error variables as index columns and data variables as data column(s). Once in that format it seems to me that all the data manipulations I need seems easy enough (mapslices and reducedim essentially). Still I feel we should move the discussion in that package as soon as the code is clean/documented enough that it makes sense for you to look into it.

@piever
Copy link
Member Author

piever commented Sep 12, 2017

Actually, after looking more closely at the new "pipeline style" Query (once again, I think the new syntax really is a big improvement) I should actually be able to use it to a large extent. There only is one last thing that is bugging me: I tend to need to use arbitrary sets of columns (maybe found programmatically) which is really not compatible with NamedTuples, at least now. It'd be hugely helpful to be able to @collect also an iterator that returns tuples (and not named tuples). For example I believe the following could be made to work pretty easily:

using DataFrames, Query, IndexedTables
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
x = df |>
    @where(_.age>40) |>
    @select((  _.name,  _.children)) |>
    Columns

As Columns has both a named and unnamed version. This way I would solve all the issues related to working with NamedTuples and programmatically found sets of columns.

@piever
Copy link
Member Author

piever commented Oct 13, 2017

This can be closed now that the whole groupapply functionality has been moved to GroupedErrors

@piever piever closed this as completed Oct 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants