Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boxplot with zero IQR sets whiskers to max and min and leaves no outliers #5331

Closed
lmcinnes opened this issue Oct 27, 2015 · 9 comments
Closed
Milestone

Comments

@lmcinnes
Copy link

I believe the behaviour in matplotlib is deliberate (given lines 2018-2019 of cbook). However I don't believe this is the expected behaviour for a boxplot (for example R leaves the whiskers at the median and draws outlier points). I was hoping for clarification or a reference on the choice taken in matplotlib.

@tacaswell
Copy link
Member

@phobson

Do you have a reference that this is not the correct behavior?

@tacaswell tacaswell added this to the unassigned milestone Oct 28, 2015
@phobson
Copy link
Member

phobson commented Oct 28, 2015

For reference here's a link to the section of cbook in question:
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/cbook.py#L2017

@phobson
Copy link
Member

phobson commented Oct 28, 2015

Looks like the (my) decision to do so was fairly intentional
7cf8b35#diff-81b1adafd906d2736868056d906c0a29R1976

@phobson
Copy link
Member

phobson commented Oct 29, 2015

I'm looking through all my references (including R docs) and I can't find any guidance and this specific scenario.

At the time, I can only imagine myself thinking that:

  1. It's pretty messed up data where Q1 = Q2 = Q3
  2. defining (potential) outliers when IQR is zero might be meaningless.

I stand by the first thought. But now I'm one the fence about the second. The omission of guidance that you should do something different in an edge case like makes me think that we shouldn't.

However, I do think that a horizontal line with whiskers looks better than a horizontal like with dots floating around. But matters of the heart matter little in matters such as these.

@jc-healy
Copy link

Since boxplots have been tweaked and modified many times over the years I won't say your decision is wrong but it's definitely not one I've run into before. Looking at this (http://www.stata-journal.com/sjpdf.html?articlenum=gr0039) paper by Cox. The useful text from this paper is:

... Lines, often called whiskers, are drawn to span all data points within 1.5 IQR of
the nearer quartile. That is, one whisker extends to include all data points within
1.5 IQR of the upper quartile and stops at the largest such value, while the other
whisker extends to include all data within 1.5 IQR of the lower quartile and stops
at the smallest such value. Tukey called the outer limits of the whiskers adjacent
values. The whiskers also explain his alternative term, box-and-whiskers plots.
Note that either whisker could be of zero length. In practice, that will occur only
with very small datasets or heavily tied data.
2. Any data points beyond the whiskers are shown individually and often labeled
informatively.

He notes the zero length whisker possibility but doesn't worry about it. My assumption from this, other general reading, and working with R over the years is that the expected behaviour for a box-whisker plot was that the whiskers would be drawn at 1.5 IQR. As such the choice of 1.5 IQR unless IQR is zero then at the 0 and 100th percentile throws me off a bit. I feel that it's a bit misleading especially in a plot with many other features who have a non zero IQR.

That being said, there is no canonical box-whisker plot. I've seen a number of alternate variations in the literature such as drawing whiskers at 2% and 98%. But I feel that the interpretation which is truest the original description by Tukey (who himself played with variations) would be to have zero length whiskers with any remaining points differing from the median as points.

In the end it's definitely your decision. I'd at the very least recommend documenting this behaviour very explicitly in the boxplot documentation and perhaps even making it an option.

@phobson
Copy link
Member

phobson commented Oct 29, 2015

But I feel that the interpretation which is truest the original description by Tukey (who himself played with variations) would be to have zero length whiskers with any remaining points differing from the median as points.

@jc-healy I agree with this. Thanks very much with the well considered input. It's very valuable.

See the PR references above.

@jduc
Copy link

jduc commented Jan 26, 2016

Just to bump the conversation as I encountered this issue today and am curious to know what the plans are for next updates.

In the case of plotting multiple boxplots on the same plot, it happens that one of the dataset has a IQR of 0 and I think it brings a lot of confusion to have a 90% of the boxplots using 1.5IQR whiskers and 10% of them using the "min/max" whiskers. Adding the possibility to chose what should happen to the whiskers in case of IQR=0 would add a bit more flexibility to the function and solve the issue. Thanks

@phobson
Copy link
Member

phobson commented Jan 26, 2016

@jduc see #5343

That PR standardizes the behavior as you describe. When IQR is 0, the whiskers only go out to min/max when then autorange parameter is set to true (the default is False).

I need refresh that PR. Hopefully I can get to that soon.

@tacaswell tacaswell modified the milestones: 1.5.2 (Critical bug fix release), unassigned Mar 9, 2016
@jenshnielsen
Copy link
Member

Fixed in #5343

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants