Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add population analysis with error bars across population for continuous plots #30

Merged
merged 25 commits into from
Jan 18, 2017
Merged

Conversation

piever
Copy link
Member

@piever piever commented Dec 10, 2016

After the discussion in #27 I've tried to come up with a consistent way to insert error bars for continuous plots representing standard error across a population. I've chosen the dataset

school = dataset("mlmRev", "Hsb82")

where a bunch of school are sampled and in each school one can get a distribution of values. Then it is natural to run a given analysis per school (e.g. cumulative of a value, a density plot, regress y from x possibly nonparametrically) and then plot the mean of this analysis across schools with a shaded standard error across schools. I've added the possibility of doing that (for shaded error plots, bar plots and scatter plots) giving the analysis function, the dataset and the arguments necessary for the analysis function. I've added the possibility to split data with the keyword group. The keyword "across" means which variable represents the population to compute s.e. The others keywords are transmitted to the plot. The legend is computed automatically, but can be modified (a vectorial keyword will be cycled across the different split data). I've also added a few functions that could be useful to run this analysis (one that computes density, one for the cumulative and one for locally-linear regression), both for a continuous and categorical x axis.

Example:

screenshot 2016-12-10 19 23 21

The same type of analysis can be done with bar plots (or also scatter plots):

screenshot 2016-12-10 18 55 32

I've added a different package (KernelEstimator) because it allows you to compute these statistical analysis specifying what points to have on the x-axis (when I split the data across the population I need the same x-values to compare y-values and get a standard error) and it has an implementation of nonparametric regression, which I couldn't find on KernelDensity. As I mentioned in #27 , the alternative is to not require that and, should different subjects have different x-values, one could just choose common x-values at the end and interpolate what the values of the various subjects would be at the common x-values. I'm working on that right now.

I'd be very happy to get suggestions on what is the best syntax and what is the best way to incorporate this stuff in the plots.jl ecosystem: my intuition is that, if one adds the interpolation option, it should be reasonably easy to get this kind of error-plots using Plots.jl features in a possibly smarter way (e.g. a keyword that does the same as "group" to split the data but instead of plotting separately finds mean and s.e.m. and plots correspondingly). However I don't think I understand the Plots.jl machinery well-enough to implement that kind of solution, so I'd be happy to get directions/help/code.

@mkborregaard
Copy link
Member

I am happy to have a look at this as soon as I have a moment, but consider this: would it not be nice to think of the errors as simply another series to add to a plot, or something that could be lifted from a Fit object? I have wanted to do a PR on that for a long time, so I really hope we can consider this possibility before moving StatPlots off in the direction of this PR.

I'll open an issue this week detailing my ideas for support of statistical objects and error bars in StatPlots, so that everyone can review it.

@piever
Copy link
Member Author

piever commented Dec 12, 2016

That's why I put the "WIP" label: so we can consider all the different options. I though the best way to start a discussion was to propose one way and then see what other people think. I do believe that if one knows Plots.jl better than I do, maybe there's an even easier way of doing this and I'm looking forward to seeing it :)

@mkborregaard
Copy link
Member

mkborregaard commented Dec 15, 2016

Hi @piever I haven't forgotten I promised you feedback on this :-) It is a nice job on the code, but I have been wrecking my brain on how these recipes could fall naturally into the plots ecosystem. Here are some things to consider, I am happy to discuss any of this:

  1. I agree with what you said in Added shaded error plots #27 that density(quakes,:Mag; population = :Stations, split = :Deep) is better than shadederror(f,:quakes,:Mag; population = :Stations, split = :Deep). The density plot is what the user wants, whereas the mean-and-error is just a way to show the variation. I think it would be better to make density and cumulative the user plots (though I think my suggestion under 3 is better than this).

  2. I am not so keen on importing bar and scatter from Plots and giving them new keywords (it becomes extra confusing as group is already a keyword in Plots).

  3. My main worry is that it becomes a little too magical for someone not very clear on what is going on. Might you consider separating the analysis from the plot? This would make it a two-step process: using a function to create a grouped_summary object, and plotting that object. For the first step you could have a function that splits the dataset, applies some function and summarizes the variation using eg s.e.m. or confidence intervals, and gives the result as a custom type. For the second step you could have a type recipe for that type of object - then the series type (line, bar or scatter) and error type (error bars, shaded polygons) could be specified in the plotting or automatically based on attributes of the object. I think that would make the code more modular and extensible, and more transparent and intuitive for the user.

@piever
Copy link
Member Author

piever commented Dec 16, 2016

Actually, 3 seems like the best solution (also, ideally this wouldn't be the only case case where a model has some uncertainty and one wants to visualise it with a shade: simple linear regression with one predictor is another obvious example). I'll be playing a bit with recipes to see if I can get it to work, but actually it should lead to simpler code than what I wrote so far (as I had to basically reimplement group and things like that).

@mkborregaard
Copy link
Member

mkborregaard commented Dec 16, 2016

simple linear regression with one predictor is another obvious example

Exactly! I have in fact recently opened a PR on GLM to allow predict to generate confidence intervals (JuliaStats/GLM.jl#171) that can be plotted as shaded errors. I see some nice further possibilities in this - eg regressions with a categorical and a continous variable can be plotted as several colored lines with shaded errors, a categorical anova as a bar plot with contrasts and significance levels etc.
The idea is to keep the statistical analysis and the plotting separate - to put the analysis completely in the hands of the user, but allow extensive support for intuitive plotting of statistical objects.

@piever
Copy link
Member Author

piever commented Dec 16, 2016

The only trick seems to be that one has to pass the analysis function as a keyword argument. As far as I understand, what you tried on gitter can't really work:

plot(quakes, kde(:Mag), group = :Deep)

because plots.jl will try to split the second argument (i.e. kde(quakes[:Mag])) according to quakes[:Deep], but now that array doesn't even have the same length, it is whatever kde outputs, whereas instead you should be splitting the dataset. So one should try a specific type series (called for example population analysis) with a signature like

populationanalysis(quakes, :Mag, group = :Deep, analysisfunction = density)

and now I believe that the first thing happening will be the splitting of the column quakes[:Mag] and then you can run whatever analysis you want, based on keyword analysisfunction.

With this (plus generalizing group to work with more than one label) I think one could get the same functionality as I have now but in a smarter way, I'll try it as soon as I have enough time and will keep you posted.

@mkborregaard
Copy link
Member

mkborregaard commented Dec 17, 2016

No, that's right, the group approach I tried on Gitter cannot work, because the kde function is not elementwise on the elements of x. That is how the function interface to Plots works by convention (the implementation is map(f,x) in fact). kde is whole-vector, along with a few other methods such as generating histograms, and because of that I don't think it should use the function-passing syntax of Plots.

Instead, I think most users would find it intuitive to have a densityplot and a cumdensityplot function, that follows the same syntax formula as histogram (which should be easy to do with parts of your code). IMHO, though, they should not take 3-4 arguments that are specialized for one particular way of generating error bars. That should be taken care of in an additional step.

The challenge with specifying your suggested error function correctly is that both the mean line and the errors are created from the subsetting, i.e. the values for the line itself is not in the dataset, contrasting with Plots' dichotomy between values and errors. That is why I think it could be nice to have two steps (non-working pseudocode):

by_Sx = groupapply(cumulative, school, on = :Mach, by = :Sx, summarize = sem) #no problem with having the function first here
plot(by_Sx, error = :ribbon)

So in principle what we have is a function for a statistical analysis (which I have called groupapply in the example) that creates an object that can be used in any further analysis, and a plot function that follows the generic syntax. I think this, though requiring two lines, will give a more intuitive workflow for most people with less requirement for looking into documentation.
Especially as I would like to expand StatPlots to:

reg = lm(y ~ x + x^2, df)
plot(reg, error = :ribbon)

@piever
Copy link
Member Author

piever commented Dec 17, 2016

Actually I guess I could easily get that pseudocode to work with code that I have already. I still have one doubt however as to what should be the type of outcome from groupapply.

Option 1)
A dataframe, with one column for x values, one for y values, one for error, one group membership. In this case the plot call would be:

by_Sx = groupapply(cumulative, school, on = :Mach, by = :Sx, summarize = sem)
shadederror(by_Sx, :x, :y; shade = :error, group = :group)

Disadvantage: wouldn't work with groupedbar! But maybe one could also fix groupedbar to work with group.

Option 2)
A specific type where by_Sx.x is the vector of x points, by_Sx.y is the 2D array of y points (one axis is x coordinate, the other the group), by_Sx.error is the 2D array of errors, by_Sx.label gives some lable to the different groups. The call could be:

by_Sx = groupapply(cumulative, school, on = :Mach, by = :Sx, summarize = sem)
shadederror(by_Sx.x, by_Sx.y; shade = :by_Sx.error, label = by_Sx.label)

or maybe one could add a type recipe for that.
Disadvantage: x axis for different groups may not have same number of points, so it would be a bit artificial to force them to. Also maybe the dataframe format is more handy if one wants and continue the analysis.

What do you think?

@piever
Copy link
Member Author

piever commented Dec 18, 2016

Extra issue with giving a dataframe: shadederror doesn't play with grouping as well as say ribbon or err keywords, so that this:

by_Sx = groupapply(cumulative, school, on = :Mach, by = :Sx, summarize = sem)
shadederror(by_Sx, :x, :y; shade = :error, group = :group)

wouldn't work whereas this would (but not in plotlyjs):

by_Sx = groupapply(cumulative, school, on = :Mach, by = :Sx, summarize = sem)
plot(by_Sx, :x, :y; ribbon = :error, group = :group, fillalpha = 0.5)

If you know how to make shadederror behave just like ribbon, that'd be nice :)

Also, I have a technical question. If we go for the specialized type, say by_Sx::GroupedError then one would want plot(by_Sx, linetype = :line) to return shadederror(by_Sx.x, by_Sx.y; shade = :by_Sx.error, label = by_Sx.label) whereas plot(by_Sx, linetype = :bar) would return groupedbar(by_Sx.x, by_Sx.y; err = :by_Sx.error, label = by_Sx.label) and so on. However while it's straightforward to do it importing the function plot and expanding it with the new type by_Sx, I'm not quite sure what's the elegant way to do this with recipes in a way that plays well with the grouping of the data (because, as far as I understand, the type recipes happen before the data splitting). Could you give some example code that would do that?

@mkborregaard
Copy link
Member

Thanks for all the considerations - I'll answer on Tuesday

@tbreloff
Copy link
Member

tbreloff commented Dec 18, 2016 via email

@mkborregaard
Copy link
Member

mkborregaard commented Dec 18, 2016

Glad to hear you agree @tbreloff :-)
In my opinion, a statistical plotting image is not one that does the statistical analysis and plot it - it is one that helps the user visualize the statistics that she is doing! And specialized type recipes for statistical objects is the way of doing that.

@piever about your first question, I meant number 2 - create a specialized object, then a type recipe for that (i.e. call it using plot - in general, I think shadederror isn't the best name for this functionality).

The reason I hesitated instead of responding immediately - was the thought "could this be done with a type recipe for GroupedDataFrame that split each group into series?". You could by_Sx = groupby(schools, :Sx), then plot(by_Sx, KernelDensity.kde, :MAch), which should create a density plot with two series. There could then be an extra argument called summarize that takes either :none or a tuple of functions for the central trend and variation, e.g. (mean, sem). Just a thought, I think that would be both modular and intuitive (but I haven't thought it all the way through).

@piever
Copy link
Member Author

piever commented Dec 18, 2016

Thanks for the feedback. I changed it to work in the most general scenario (with mean and sem as standard analysis, but it can be changed). I've added a recipe that takes this group error type and depending on the seriestype does line plot with shadederror or scatter with errorbar or groupedbar with errorbar. The only thing that remain to be decided is what type this groupederror should be. I'm of the opinion that the most sensible would be for it to be a dataframe (either grouped or with a column indicating group membership) and I could add a small function that reshapes a dataframe so that it can work with groupedbar.

@mkborregaard
Copy link
Member

Can you update the PR to show your changes?

@piever
Copy link
Member Author

piever commented Dec 19, 2016

Still need to fix this decision of a type that works easily both for line plots and groupedbar, but I should be able to push by today.

@mkborregaard
Copy link
Member

Just make your own.

@piever
Copy link
Member Author

piever commented Dec 19, 2016

This one should be working, and I think I've incorporated more or less all the feedback. There is a groupapply function that splits the data across kw "group", then applies "summarize" (default is (mean,sem), but you can put any pair of functions. Also you have the option shared_xaxis whether you want a common x axis for all the split data (which is required for groupedbar but not recommended for shaded error). Also, for the local regression I'm using Loess (like Gadfly does) instead of KernelEstimator because it seems better maintained.

Example use:

using DataFrames
using RDatasets
using Plots
using StatPlots
gr()
school = dataset("mlmRev","Hsb82");
grp_error = StatPlots.groupapply(:cumulative, school, :MAch; across = :School, group = :Sx)
plot(grp_error, line = :line)

screenshot 2016-12-19 12 28 27

Keywords for loess or kerneldensity can be given to groupapply:

df = StatPlots.groupapply(:density, school, :CSES; bandwidth = 1., across = :School, group = :Minrty)
plot(df, line = :line)

screenshot 2016-12-19 12 34 43

The bar plot

pool!(school, :Sx)
grp_error = StatPlots.groupapply(school, :Sx, :MAch; across = :School, group = :Minrty, shared_xaxis = true)
plot(grp_error, line = :bar)

screenshot 2016-12-19 12 29 52

Only problem left (but independent of this PR): with plotlyjs() groupedbar doesn't work well with errorbar, I'll open an issue for that.

@mkborregaard
Copy link
Member

Great to hear! I think the across parameter should not be so complicated. across can take a variable name and act as it does now. If not specified it works over all observations.

@piever
Copy link
Member Author

piever commented Jan 6, 2017

That'd be ideal, but the main issue is that for some of the functions it doesn't make sense. For example if the x axis is continuous you can't draw the curve with one observation, so you can't really split by observation (hence the suggestion to use bootstrap). For density and cumulative, in the discrete case, I guess you can do it individually per observation (getting density and cumulative of delta functions) and the average would be correct, and the sem would also be informative. Still in general I think that there could be some analysis where splitting by observation and taking the mean has little to do with doing the analysis on all the observations together (or, is the same but is way less efficient) so I would be in favor of giving the option of not splitting and simply not getting the error. I'm trying to come up with a way of letting the user do that that is not too error-prone.

@piever
Copy link
Member Author

piever commented Jan 6, 2017

By thinking a bit more, given that we plan on adding built-in errors (see linreg, but also for locreg) it would maybe more sense to have a more general keyword:
compute_error

It could have as options: compute_error = (:across, column_name) to work with the across keyword as this PR is currently doing, compute_error =:across would be across all observations, compute_error = :none to have a faster computation, or compute_error = :built_in if the user prefers to compute its own version - or for linreg we could give one ourselves- or compute_error = :bootstrap for example to run the bootstrap analysis (we don't need to implement all of them right now). In this case I would maybe propose compute_error =:none as default, but I guess that's up for debate.

What do you think?

@recipe f(k::KernelDensity.UnivariateKDE) = k.x, k.density
@recipe f(k::KernelDensity.BivariateKDE) = k.x, k.y, k.density

@shorthands cdensity

export groupapply
export get_summary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to export get_summary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinions on that. The actual plotting is done as in the README:

grp_error = groupapply(:cumulative, school, :MAch; compute_error = (:across,:School), group = :Sx)
plot(grp_error, line = :path)

and in practice I never really use it. Is it bad to export it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it doesn't seem necessary, but I admit I don't actually know when you'd call it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was proposed by @mkborregaard. The idea is that in the normal scenario, you would have a dataframe that you'd split according to a column and the summarize the split data (for example getting mean and sem). The use case would be somebody having obtained a grouped dataframe already by some other means and wanting to get the summary, which is not possible with only groupapply.
The pipeline is: groupapply calls get_summary who calls the analysis function on the subdataframes. groupapply, get_summary and the built-in analysis function have docstrings. The docstrings of groupapply refer to get_summary, so maybe the user is expecting that it should be exported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on a second thought I'm in favour of keeping it also due to performance reasons: for large datasets it may be inconvenient to have to group it every time. In the implementation of compute_error = :bootstrap I see a seizable performance improvement by working with GroupedDataFrames directly rather than putting everything in a big DataFrame and then applying groupby.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exatly, that was the idea. I think it should be renamed to something more specific - "get_summary" is too generic a name to put in the user's workspace for a very specific function used for one type of plot only

@piever
Copy link
Member Author

piever commented Jan 10, 2017

@mkborregaard I've managed to add everything I wanted. Now compute_error = :none doesn't compute any error. compute_error = :across is across all observations. compute_error = (:across, col_name) is across the variable col_name. compute_error = (:bootstrap, n_samples), generates n_samples fake datasets using nonparametric bootstrapping and then summarizes them. Everything is explained in the README. Addressed your concerned about the naming of get_mean_sem. As far as I'm concerned, I think it's ready to go.

@piever
Copy link
Member Author

piever commented Jan 11, 2017

Slight off-topic, but I'm not sure where to ask: I'm making a jupyter notebook with many examples from this library (mainly for my lab-mates). In particular I'll show some examples similar to the README but in more detail and I'll also show how to use the functionality of groupederror together with the Interact.jl package. Basically you can do a simple @manipulate adding a couple of sliders for smoothing paramaters, and 3 or 4 more symbol variables representing the choice of the variable on the x axis, the choice of the analysis (i.e. cumulative, density or the name of a variable to plot on the y axis as fct of x), some grouping variable and how to compute the error.
With a few lines of code you can actually produce some sliders and toggle buttons that will allow you to do interactive exploratory data analysis (i.e. look at more or less all the variables and their relationships) on your dataset with a few clicks (I think R has a GUI that does that, or so I'm told).
I was wondering where I should upload this notebook? Should I make a separate repo and add the link or there is an appropriate place? I was thinking maybe ExamplePlots.jl, even if it's not a Plots example but a StatPlots example.

@mkborregaard
Copy link
Member

Thanks for all your work on this - I am at a conference in the US currently but will look it over as soon as I get a chance.

@tbreloff
Copy link
Member

ExamplePlots would probably be the right home for something like this.

@piever
Copy link
Member Author

piever commented Jan 12, 2017

Ok, I've added a PR there.

@piever
Copy link
Member Author

piever commented Jan 16, 2017

@mkborregaard I've been testing the framework more extensively and 2 minor issues came up:

  1. So far I compute the axis separately for the different groups. Sometimes, if the ranges are very different, it looks a bit silly to have for example two cdfs ending in different places. What I can do is to keep this behaviour as default and add the option of having a common x axis / giving the x axis by hand. EDIT= after playing with it, when given an out of range axis KernelDensity does weird things, maybe it should be assumed that the axis are separate, but one can give it by hand , so if the user insists it can simply give the same to all groups. What do you think? 2nd EDIT = This other option also doesn't seem like a fantastic user experience, I'm getting convinced that maybe it's better to leave it as it is and if the user wants to extend the axis, he/she can do so by end directly on the GroupedError object (possibly using the provided extend_axis function).
  2. Sometimes (in the dataset that I'm testing, but it never happened in my own data), after splitting the dataset there's not enough data to run some slightly more sophisticated analysis (i.e. Loess or KernelDensity) so it gives an error. Should I add some try/catch to just give NaNs for the subjects not having enough data to run the analysis or should I just let it error and let the user figure it out? My feeling is that if each subject doesn't have enough data, this approach is ill-advised anyway and one should try hierarchical models.

@mkborregaard
Copy link
Member

  1. Agree, leave as is.
  2. Let it error and let the user figure it out. Or alternatively catch the exception, write a suitable error describing what went wrong, and re-throw the exception.

@mkborregaard
Copy link
Member

Other than the name change of get_summary (and possibly the suggested error handling), I am ready to merge if you still think it is ready  @piever . Thanks for the great and focused work on this!

@piever
Copy link
Member Author

piever commented Jan 17, 2017

I have a very minor restructuring of the code (to allow the user to give a axis to get_summary) that I can push later today and I think it's good to go. What name do you propose for get_summary ? Maybe get_groupederror?

@mkborregaard
Copy link
Member

Yes that could work

@piever
Copy link
Member Author

piever commented Jan 17, 2017

Changed the name of get_summary, I think it's ready now.

@mkborregaard
Copy link
Member

Nice, great work!

@mkborregaard mkborregaard merged commit b420720 into JuliaPlots:master Jan 18, 2017
tbreloff added a commit to JuliaPlots/ExamplePlots.jl that referenced this pull request Jan 18, 2017
@piever piever deleted the pull-request/1e6921bf branch September 5, 2017 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants