-
-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Violin Plots #2996
Violin Plots #2996
Conversation
Wow, awesome (again)! @phobson Can you take a look at this (as who I have labeled as the current box-plot expert)? From skimming the code it looks like you dealt with the scipy dependency by just ripping out the bits you needed? |
Yep, |
I had never even heard of violin plots. Very interesting...good stuff! One thing I noticed from your example plot is that the top of the violin has some space between the error bar tick mark and the distribution while the bottom does not. |
max_val = kde.dataset.max() | ||
mean = np.mean(kde.dataset) | ||
median = np.median(kde.dataset) | ||
coords = np.arange(min_val, max_val, (max_val - min_val)/points) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Looks like an off by "1" error. I think changing this line to
coords = np.linspace(min_val, max_val, points)
would probably do the trick.
I haven't gone through the code, but just wanted to say: 👏! |
I glanced at the code last night and everything seemed in order and well thought out. Tests, docs, and examples look good. Since Tom and Mike seem excited about this, it's indoibtedly more appropriate for matplotlib than I initially thought. That said, here are some things I think we should consider: My main concern is that violin plots are currently available in both seaborn and statsmodels. While seaborn does required scipy, it is pip-installable from source on a basic Windows set up. So that barrier to entry is pretty low. I'm concerned about fragmentation here, both with the plotting code and the KDE implementation. Off the top of my head, this means that a new python user will be able to choose between scipy, sklearn, statsmodels, and this stripped down version in matplotlib for KDE. While I'm sure the KDE implementation is fine, what if the scipy team finds a bug? Keeping on top of scipy's implementation and tests is something we'll need to be diligent about. Both statmodels and pandas now have a policy of directing PRs of more advanced visualizations to seaborn and python-ggplot. That might not apply to us, but I think that policy is conducive to making a more seamless and cohesive environment for new users. TL;DR |
re Would it be possible to re-factor this the way that boxplot is now? ie separate the code that draws the pretty lines and the code that turns raw data -> something that the drawing code can deal with. It looks like L6888 - L6902 (https://github.com/matplotlib/matplotlib/pull/2996/files#diff-7e79bc5b4cdd21de353697e9ada248b7R6888) could be extracted into a separate function which generates a list/dict of the parameters that the This is another facet of a simmering discussion about how much computation |
How squashed should it be? I have it at 28 by collapsing sequential commits from the same user, but I could make it into one big commit if that would be better. |
@phobson I went with 22 commits after squashing, let me know if I should do them all. Even written from scratch, the kde code is another thing that could require maintenance down the road, but it seems like an alright fit in mlab (and of course, we needed it for violin plots). Maybe it will end up in a dependency down the road. @tacaswell I took a stab at re-factoring violinplot, as you suggested. It uses axes.vplot to produce the plot, cbook.vp_stats to arrange the data, and axes.violinplot to put the two together. I dealt with the dependency between the data arranging code and mlab by using a method parameter in cbook.vp_stats. I'm wondering if vp_stats could be generalized to do what bxp_stats does too, given a different method, but I didn't want to make this PR any bigger. |
That is exactly what I had in mind. That looks like a reasonable number of commits, squash much more and you would start dropping people from the work log which is not cool. I think |
I would do Does it make sense to allow people to specify a KDE computation method/function/lambda, provided it returns suitable output? The default ( |
I'm happy with those names. Specifying a method is supported through a parameter to Also, I separated some code that converts 1d objects into 2d objects from |
Do watch out with np.atleast_2d(). IIRC, older version of numpy had that On Tue, Apr 22, 2014 at 11:04 PM, Per Parker notifications@github.comwrote:
|
@solvents Yeah. Well, I was thinking that users could supply a lambda for function that called say, statsmodels KDE to compute the values. But right when I went to bed last night I realized that's a pretty dumb suggestion. As long as the "schema" of the output of |
Ah, I didn't know atleast_2d could do that. Will leave it as is. My last thought on the current specification for the returned dictionary is by adding a key called, say, extra_lines, it might be possible to push |
Renamed methods, updated whats_new and CHANGELOG |
Why did you merge master into your branch? Our typical procedure is to rebase feature branches on top of current master. |
Ah, alright. Should I just drop the merge commit, or do I do the rebase? |
Unless there is a conflict you need to address, yes. |
There were a couple of conflicts but they were trivial, not sure if that counts. |
For the github website merge to work the merge needs to be clean. That probably means you need to rebase (which if the conflicts are trivial and I assume in CHANGELOG/whats_new.rst) will be easy. |
OK, sorry about that. Used a rebase this time. |
No worries, it's your first PR. Also, please don't be discouraged by the amount of feed back you are getting, it means people are excited about your work ;). |
@solvents Could you rebase this again? I hope your exams went well. |
Rebased. I'm not sure if the Travis results are related to anything we did. Exams went great, thanks for asking! |
if vert: | ||
fill = self.fill_betweenx | ||
rlines = self.hlines | ||
blines = self.vlines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do 'r' and 'b' stand for? I think this code is clear, I am just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are supposed to be right-angle and base... perp_lines and par_lines would probably make more sense.
Sorry, miss-typed @JanSchulz |
I don't think we have any API consideration. As we do our own API we fiddle the data until it fits matplotlib. As long as it has the same signature as boxplots we are probably fine. CC: @has2k1 to correct me :-) One requirement we have for faceting (and which I think our code is currently broken) is that we have a way to specify "gaps", meaning that a violin (or boxplot/bar) is not present in one facet/axis (but is in the other facets/axes). So, it is possible to specify a "gap" in the violins? Either by saying |
…unction to accept new KDE. Added basic violinplot demo in examples
…density is now referred to as gaussian_kde and exists as a class in mlab. Fixed list comp position bug and updated examples
Fixed several style issues.
Added comments for test cases referencing the origins.
Conflicts: lib/matplotlib/tests/test_mlab.py
Some course related tests were added. These are removed later.
…linplot and updated demo
…st_mlab.py Fixed a syntax error in python 3 and fixed up some violinplot tests. Fixed some style problems. Removed course-related test from list of tests.
…uncated. Updated test images and reran boilerplate.py
…lot data for drawing), axes.violin (draws pre-arranged violin plot data), and axes.violinplot (uses cbook.violin_stats to draw violin plots via axes.violin) Updated whats_new.rst. Updated CHANGELOG.
Gaps can be specified by using the positions argument, the way you've written it there. I suspect having an empty data column would raise an error. |
I think the computation separation will work for python-ggplot. Time permitting, I will go a head and implement it on top of this branch just to make sure. |
This PR implements violin plots.
Violin plots can be used to represent the distribution of sample data. They are similar to box plots, but use a kernel density estimation function to present a smooth approximation of the data sample used.
We wrote this in response to issue #2873 as part of a software engineering course project (the same project as PR #2961's). It's designed to have a call signature similar to boxplot, but some customization options remain to be added.