Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Violin Plots #2996

Merged
merged 22 commits into from May 26, 2014
Merged

Violin Plots #2996

merged 22 commits into from May 26, 2014

Conversation

solvents
Copy link
Contributor

This PR implements violin plots.

Violin plots can be used to represent the distribution of sample data. They are similar to box plots, but use a kernel density estimation function to present a smooth approximation of the data sample used.

We wrote this in response to issue #2873 as part of a software engineering course project (the same project as PR #2961's). It's designed to have a call signature similar to boxplot, but some customization options remain to be added.

violinplot_vert_showextrema

@tacaswell tacaswell added this to the v1.4.0 milestone Apr 20, 2014
@tacaswell
Copy link
Member

Wow, awesome (again)!

@phobson Can you take a look at this (as who I have labeled as the current box-plot expert)?

From skimming the code it looks like you dealt with the scipy dependency by just ripping out the bits you needed?

@solvents
Copy link
Contributor Author

Yep,
We had a couple members talking to some stats experts about how they might implement kde from the ground up, but I believe they decided it was either too much work or would be sub-par for the amount of time we had to work on it.

@cimarronm
Copy link
Contributor

I had never even heard of violin plots. Very interesting...good stuff!

One thing I noticed from your example plot is that the top of the violin has some space between the error bar tick mark and the distribution while the bottom does not.

max_val = kde.dataset.max()
mean = np.mean(kde.dataset)
median = np.median(kde.dataset)
coords = np.arange(min_val, max_val, (max_val - min_val)/points)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Looks like an off by "1" error. I think changing this line to

coords = np.linspace(min_val, max_val, points)

would probably do the trick.

@mdboom
Copy link
Member

mdboom commented Apr 20, 2014

I haven't gone through the code, but just wanted to say: 👏!

@phobson
Copy link
Member

phobson commented Apr 21, 2014

I glanced at the code last night and everything seemed in order and well thought out. Tests, docs, and examples look good. 

Since Tom and Mike seem excited about this, it's indoibtedly more appropriate for matplotlib than I initially thought. 

That said, here are some things I think we should consider:

My main concern is that violin plots are currently available in both seaborn and statsmodels. While seaborn does required scipy, it is pip-installable from source on a basic Windows set up. So that barrier to entry is pretty low. I'm concerned about fragmentation here, both with the plotting code and the KDE implementation. Off the top of my head, this means that a new python user will be able to choose between scipy, sklearn, statsmodels, and this stripped down version in matplotlib for KDE. 

While I'm sure the KDE implementation is fine, what if the scipy team finds a bug? Keeping on top of scipy's implementation and tests is something we'll need to be diligent about. 

Both statmodels and pandas now have a policy of directing PRs of more advanced visualizations to seaborn and python-ggplot. That might not apply to us, but I think that policy is conducive to making a more seamless and cohesive environment for new users. 

TL;DR
This looks like a great PR. I'd squash the commits if possible before merging. 

@tacaswell
Copy link
Member

re seaborn and python-ggplot (and @olgabot prettyplot), I think we need to work with them to start pulling more of the 'basic' advanced plots back down into matplotlib.

Would it be possible to re-factor this the way that boxplot is now? ie separate the code that draws the pretty lines and the code that turns raw data -> something that the drawing code can deal with.

It looks like L6888 - L6902 (https://github.com/matplotlib/matplotlib/pull/2996/files#diff-7e79bc5b4cdd21de353697e9ada248b7R6888) could be extracted into a separate function which generates a list/dict of the parameters that the fill call on L6904 takes + the mean/max/min... scalars used farther down. This would then give the higher level libraries (seaborn/ggplot) a more powerful building block and hopefully cut down on duplicated effort.

This is another facet of a simmering discussion about how much computation matplotlib should be doing in the plotting functions.

@solvents
Copy link
Contributor Author

How squashed should it be? I have it at 28 by collapsing sequential commits from the same user, but I could make it into one big commit if that would be better.

@solvents
Copy link
Contributor Author

@phobson I went with 22 commits after squashing, let me know if I should do them all. Even written from scratch, the kde code is another thing that could require maintenance down the road, but it seems like an alright fit in mlab (and of course, we needed it for violin plots). Maybe it will end up in a dependency down the road.

@tacaswell I took a stab at re-factoring violinplot, as you suggested. It uses axes.vplot to produce the plot, cbook.vp_stats to arrange the data, and axes.violinplot to put the two together. I dealt with the dependency between the data arranging code and mlab by using a method parameter in cbook.vp_stats. I'm wondering if vp_stats could be generalized to do what bxp_stats does too, given a different method, but I didn't want to make this PR any bigger.

@tacaswell
Copy link
Member

That is exactly what I had in mind.

That looks like a reasonable number of commits, squash much more and you would start dropping people from the work log which is not cool.

I think vplot is a bad name (but do not have a better suggestion).

@phobson
Copy link
Member

phobson commented Apr 23, 2014

I would do violin_stats, violin, violinplot.

Does it make sense to allow people to specify a KDE computation method/function/lambda, provided it returns suitable output? The default (None) would of course fallback to the KDE implementation in this PR, but it might be nice to allow someone complete control over the kernal shape and size and just whichever implementation they prefer (e.g., sklearn, scipy, statsmodels).

@solvents
Copy link
Contributor Author

I'm happy with those names.

Specifying a method is supported through a parameter to vp_stats (which I'll rename violin_stats). Are you thinking as a parameter to violinplot as well?

Also, I separated some code that converts 1d objects into 2d objects from bxp_stats as it was useful in violin_stats as well, but now I'm wondering if np.atleast_2d would work instead?

@WeatherGod
Copy link
Member

Do watch out with np.atleast_2d(). IIRC, older version of numpy had that
function turning masked arrays into regular arrays.

On Tue, Apr 22, 2014 at 11:04 PM, Per Parker notifications@github.comwrote:

I'm happy with those names.

Specifying a method is supported through a parameter to vp_stats (which
I'll rename violin_stats). Are you thinking as a parameter to violinplotas well?

Also, I separated some code that converts 1d objects into 2d objects from
bxp_stats as it was useful in violin_stats as well, but now I'm wondering
if np.atleast_2d would work instead?


Reply to this email directly or view it on GitHubhttps://github.com//pull/2996#issuecomment-41119982
.

@phobson
Copy link
Member

phobson commented Apr 23, 2014

@solvents Yeah. Well, I was thinking that users could supply a lambda for function that called say, statsmodels KDE to compute the values. But right when I went to bed last night I realized that's a pretty dumb suggestion. As long as the "schema" of the output of violin_stats is well documented, people can generate that themselves however they wish. Problem solved.

@solvents
Copy link
Contributor Author

Ah, I didn't know atleast_2d could do that. Will leave it as is.

My last thought on the current specification for the returned dictionary is by adding a key called, say, extra_lines, it might be possible to push violin into rendering some more advanced plots (eg. bean plots). It's easy enough to add optional parameters to the dictionary though so it could be left as a possible extension.

@solvents
Copy link
Contributor Author

Renamed methods, updated whats_new and CHANGELOG

@tacaswell
Copy link
Member

Why did you merge master into your branch? Our typical procedure is to rebase feature branches on top of current master.

@solvents
Copy link
Contributor Author

Ah, alright. Should I just drop the merge commit, or do I do the rebase?

@tacaswell
Copy link
Member

Unless there is a conflict you need to address, yes.

@solvents
Copy link
Contributor Author

There were a couple of conflicts but they were trivial, not sure if that counts.

@tacaswell
Copy link
Member

For the github website merge to work the merge needs to be clean. That probably means you need to rebase (which if the conflicts are trivial and I assume in CHANGELOG/whats_new.rst) will be easy.

@solvents
Copy link
Contributor Author

OK, sorry about that. Used a rebase this time.

@tacaswell
Copy link
Member

No worries, it's your first PR.

Also, please don't be discouraged by the amount of feed back you are getting, it means people are excited about your work ;).

@tacaswell
Copy link
Member

@solvents Could you rebase this again?

I hope your exams went well.

@solvents
Copy link
Contributor Author

Rebased. I'm not sure if the Travis results are related to anything we did.

Exams went great, thanks for asking!

if vert:
fill = self.fill_betweenx
rlines = self.hlines
blines = self.vlines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do 'r' and 'b' stand for? I think this code is clear, I am just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are supposed to be right-angle and base... perp_lines and par_lines would probably make more sense.

@tacaswell
Copy link
Member

Sorry, miss-typed @JanSchulz

@jankatins
Copy link
Contributor

I don't think we have any API consideration. As we do our own API we fiddle the data until it fits matplotlib. As long as it has the same signature as boxplots we are probably fine.

CC: @has2k1 to correct me :-)

One requirement we have for faceting (and which I think our code is currently broken) is that we have a way to specify "gaps", meaning that a violin (or boxplot/bar) is not present in one facet/axis (but is in the other facets/axes).

So, it is possible to specify a "gap" in the violins? Either by saying position=[1,2,3,5,6] (gap at 4) or by specifying position=[1,2,3,4,5,6], data=[[...],[...],[...],[],[...],[...] (empty list at 4) or whatever you feel fine :-))

solvents and others added 22 commits May 24, 2014 16:47
…unction to accept new KDE.

Added basic violinplot demo in examples
…density is now referred to as gaussian_kde and exists as a class in mlab.

Fixed list comp position bug and updated examples
Fixed several style issues.
Added comments for test cases referencing the origins.
Conflicts:
	lib/matplotlib/tests/test_mlab.py
Some course related tests were added. These are removed later.
…st_mlab.py

Fixed a syntax error in python 3 and fixed up some violinplot tests.
Fixed some style problems.
Removed course-related test from list of tests.
…uncated.

Updated test images and reran boilerplate.py
…lot data for drawing), axes.violin (draws pre-arranged violin plot data), and axes.violinplot (uses cbook.violin_stats to draw violin plots via axes.violin)

Updated whats_new.rst.
Updated CHANGELOG.
@solvents
Copy link
Contributor Author

Gaps can be specified by using the positions argument, the way you've written it there. I suspect having an empty data column would raise an error.
@tacaswell I think the last commit addresses the changes you suggested.

@has2k1
Copy link
Contributor

has2k1 commented May 24, 2014

I think the computation separation will work for python-ggplot. Time permitting, I will go a head and implement it on top of this branch just to make sure.

tacaswell added a commit that referenced this pull request May 26, 2014
@tacaswell tacaswell merged commit 687286a into matplotlib:master May 26, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet