-
-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add population analysis with error bars across population for continuous plots #30
[WIP] Add population analysis with error bars across population for continuous plots #30
Conversation
I am happy to have a look at this as soon as I have a moment, but consider this: would it not be nice to think of the errors as simply another series to add to a plot, or something that could be lifted from a Fit object? I have wanted to do a PR on that for a long time, so I really hope we can consider this possibility before moving StatPlots off in the direction of this PR. I'll open an issue this week detailing my ideas for support of statistical objects and error bars in StatPlots, so that everyone can review it. |
That's why I put the "WIP" label: so we can consider all the different options. I though the best way to start a discussion was to propose one way and then see what other people think. I do believe that if one knows Plots.jl better than I do, maybe there's an even easier way of doing this and I'm looking forward to seeing it :) |
Hi @piever I haven't forgotten I promised you feedback on this :-) It is a nice job on the code, but I have been wrecking my brain on how these recipes could fall naturally into the plots ecosystem. Here are some things to consider, I am happy to discuss any of this:
|
Actually, 3 seems like the best solution (also, ideally this wouldn't be the only case case where a model has some uncertainty and one wants to visualise it with a shade: simple linear regression with one predictor is another obvious example). I'll be playing a bit with recipes to see if I can get it to work, but actually it should lead to simpler code than what I wrote so far (as I had to basically reimplement group and things like that). |
Exactly! I have in fact recently opened a PR on GLM to allow |
The only trick seems to be that one has to pass the analysis function as a keyword argument. As far as I understand, what you tried on gitter can't really work:
because plots.jl will try to split the second argument (i.e. kde(quakes[:Mag])) according to quakes[:Deep], but now that array doesn't even have the same length, it is whatever kde outputs, whereas instead you should be splitting the dataset. So one should try a specific type series (called for example population analysis) with a signature like
and now I believe that the first thing happening will be the splitting of the column quakes[:Mag] and then you can run whatever analysis you want, based on keyword analysisfunction. With this (plus generalizing group to work with more than one label) I think one could get the same functionality as I have now but in a smarter way, I'll try it as soon as I have enough time and will keep you posted. |
No, that's right, the Instead, I think most users would find it intuitive to have a The challenge with specifying your suggested error function correctly is that both the mean line and the errors are created from the subsetting, i.e. the values for the line itself is not in the dataset, contrasting with Plots' dichotomy between values and errors. That is why I think it could be nice to have two steps (non-working pseudocode): by_Sx = groupapply(cumulative, school, on = :Mach, by = :Sx, summarize = sem) #no problem with having the function first here
plot(by_Sx, error = :ribbon) So in principle what we have is a function for a statistical analysis (which I have called reg = lm(y ~ x + x^2, df)
plot(reg, error = :ribbon) |
Actually I guess I could easily get that pseudocode to work with code that I have already. I still have one doubt however as to what should be the type of outcome from groupapply. Option 1)
Disadvantage: wouldn't work with groupedbar! But maybe one could also fix groupedbar to work with group. Option 2)
or maybe one could add a type recipe for that. What do you think? |
Extra issue with giving a dataframe: shadederror doesn't play with grouping as well as say ribbon or err keywords, so that this:
wouldn't work whereas this would (but not in plotlyjs):
If you know how to make shadederror behave just like ribbon, that'd be nice :) Also, I have a technical question. If we go for the specialized type, say by_Sx::GroupedError then one would want |
Thanks for all the considerations - I'll answer on Tuesday |
If you create a user recipe for a new type GroupedError, then you can do
all that logic in the recipe for GroupedError. ie:
bar(grperr) // return GroupedBar object from recipe
User recipes are processed recursively, so you can return another plottable
object, with whatever logic you want. Just check the seriestype inside the
recipe and change it however you want.
And in general I much prefer option #3 for something like this... Create
some custom object for complex statistics on a dataset, then make a user
recipe for that object. The functionality isn't generic enough to be adding
all these keywords. (IMO)
…On Sun, Dec 18, 2016 at 6:27 AM Michael Krabbe Borregaard < ***@***.***> wrote:
Thanks for all the considerations - I'll answer on Tuesday
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#30 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA492o26MWiuFOM8s7SK7md29nferPXBks5rJRihgaJpZM4LJvIP>
.
|
Glad to hear you agree @tbreloff :-) @piever about your first question, I meant number 2 - create a specialized object, then a type recipe for that (i.e. call it using The reason I hesitated instead of responding immediately - was the thought "could this be done with a type recipe for GroupedDataFrame that split each group into series?". You could |
Thanks for the feedback. I changed it to work in the most general scenario (with mean and sem as standard analysis, but it can be changed). I've added a recipe that takes this group error type and depending on the seriestype does line plot with shadederror or scatter with errorbar or groupedbar with errorbar. The only thing that remain to be decided is what type this groupederror should be. I'm of the opinion that the most sensible would be for it to be a dataframe (either grouped or with a column indicating group membership) and I could add a small function that reshapes a dataframe so that it can work with groupedbar. |
Can you update the PR to show your changes? |
Still need to fix this decision of a type that works easily both for line plots and groupedbar, but I should be able to push by today. |
Just make your own. |
This one should be working, and I think I've incorporated more or less all the feedback. There is a groupapply function that splits the data across kw "group", then applies "summarize" (default is (mean,sem), but you can put any pair of functions. Also you have the option shared_xaxis whether you want a common x axis for all the split data (which is required for groupedbar but not recommended for shaded error). Also, for the local regression I'm using Loess (like Gadfly does) instead of KernelEstimator because it seems better maintained. Example use:
Keywords for loess or kerneldensity can be given to groupapply:
The bar plot
Only problem left (but independent of this PR): with plotlyjs() groupedbar doesn't work well with errorbar, I'll open an issue for that. |
Great to hear! I think the across parameter should not be so complicated. across can take a variable name and act as it does now. If not specified it works over all observations. |
That'd be ideal, but the main issue is that for some of the functions it doesn't make sense. For example if the x axis is continuous you can't draw the curve with one observation, so you can't really split by observation (hence the suggestion to use bootstrap). For density and cumulative, in the discrete case, I guess you can do it individually per observation (getting density and cumulative of delta functions) and the average would be correct, and the sem would also be informative. Still in general I think that there could be some analysis where splitting by observation and taking the mean has little to do with doing the analysis on all the observations together (or, is the same but is way less efficient) so I would be in favor of giving the option of not splitting and simply not getting the error. I'm trying to come up with a way of letting the user do that that is not too error-prone. |
By thinking a bit more, given that we plan on adding built-in errors (see linreg, but also for locreg) it would maybe more sense to have a more general keyword: It could have as options: What do you think? |
@recipe f(k::KernelDensity.UnivariateKDE) = k.x, k.density | ||
@recipe f(k::KernelDensity.BivariateKDE) = k.x, k.y, k.density | ||
|
||
@shorthands cdensity | ||
|
||
export groupapply | ||
export get_summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to export get_summary
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong opinions on that. The actual plotting is done as in the README:
grp_error = groupapply(:cumulative, school, :MAch; compute_error = (:across,:School), group = :Sx)
plot(grp_error, line = :path)
and in practice I never really use it. Is it bad to export it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it doesn't seem necessary, but I admit I don't actually know when you'd call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was proposed by @mkborregaard. The idea is that in the normal scenario, you would have a dataframe that you'd split according to a column and the summarize the split data (for example getting mean and sem). The use case would be somebody having obtained a grouped dataframe already by some other means and wanting to get the summary, which is not possible with only groupapply
.
The pipeline is: groupapply
calls get_summary
who calls the analysis function on the subdataframes. groupapply
, get_summary
and the built-in analysis function have docstrings. The docstrings of groupapply
refer to get_summary
, so maybe the user is expecting that it should be exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on a second thought I'm in favour of keeping it also due to performance reasons: for large datasets it may be inconvenient to have to group it every time. In the implementation of compute_error = :bootstrap
I see a seizable performance improvement by working with GroupedDataFrames directly rather than putting everything in a big DataFrame and then applying groupby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exatly, that was the idea. I think it should be renamed to something more specific - "get_summary" is too generic a name to put in the user's workspace for a very specific function used for one type of plot only
@mkborregaard I've managed to add everything I wanted. Now |
Slight off-topic, but I'm not sure where to ask: I'm making a jupyter notebook with many examples from this library (mainly for my lab-mates). In particular I'll show some examples similar to the README but in more detail and I'll also show how to use the functionality of |
Thanks for all your work on this - I am at a conference in the US currently but will look it over as soon as I get a chance. |
ExamplePlots would probably be the right home for something like this. |
Ok, I've added a PR there. |
@mkborregaard I've been testing the framework more extensively and 2 minor issues came up:
|
|
Other than the name change of |
I have a very minor restructuring of the code (to allow the user to give a axis to get_summary) that I can push later today and I think it's good to go. What name do you propose for get_summary ? Maybe get_groupederror? |
Yes that could work |
Changed the name of get_summary, I think it's ready now. |
Nice, great work! |
Example of JuliaPlots/StatsPlots.jl#30, to be merged after that
After the discussion in #27 I've tried to come up with a consistent way to insert error bars for continuous plots representing standard error across a population. I've chosen the dataset
school = dataset("mlmRev", "Hsb82")
where a bunch of school are sampled and in each school one can get a distribution of values. Then it is natural to run a given analysis per school (e.g. cumulative of a value, a density plot, regress y from x possibly nonparametrically) and then plot the mean of this analysis across schools with a shaded standard error across schools. I've added the possibility of doing that (for shaded error plots, bar plots and scatter plots) giving the analysis function, the dataset and the arguments necessary for the analysis function. I've added the possibility to split data with the keyword group. The keyword "across" means which variable represents the population to compute s.e. The others keywords are transmitted to the plot. The legend is computed automatically, but can be modified (a vectorial keyword will be cycled across the different split data). I've also added a few functions that could be useful to run this analysis (one that computes density, one for the cumulative and one for locally-linear regression), both for a continuous and categorical x axis.
Example:
The same type of analysis can be done with bar plots (or also scatter plots):
I've added a different package (KernelEstimator) because it allows you to compute these statistical analysis specifying what points to have on the x-axis (when I split the data across the population I need the same x-values to compare y-values and get a standard error) and it has an implementation of nonparametric regression, which I couldn't find on KernelDensity. As I mentioned in #27 , the alternative is to not require that and, should different subjects have different x-values, one could just choose common x-values at the end and interpolate what the values of the various subjects would be at the common x-values. I'm working on that right now.
I'd be very happy to get suggestions on what is the best syntax and what is the best way to incorporate this stuff in the plots.jl ecosystem: my intuition is that, if one adds the interpolation option, it should be reasonably easy to get this kind of error-plots using Plots.jl features in a possibly smarter way (e.g. a keyword that does the same as "group" to split the data but instead of plotting separately finds mean and s.e.m. and plots correspondingly). However I don't think I understand the Plots.jl machinery well-enough to implement that kind of solution, so I'd be happy to get directions/help/code.