qcut can fail for highly discontinuous data distributions #15069

wesm · 2017-01-05T20:16:07Z

Code Sample, a copy-pastable example if possible

This code fails for any K:

# Your code here
K = 100

pd.qcut([0] * K + [1] * (K + 1), 2)

Problem description

With pandas 0.19.2, I have:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-782385490865> in <module>()
----> 1 pd.qcut([0] * K + [1] * (K + 1), 2)

pandas/tools/tile.py in qcut(x, q, labels, retbins, precision)
    173     bins = algos.quantile(x, quantiles)
    174     return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,
--> 175                          precision=precision, include_lowest=True)
    176 
    177 

pandas/tools/tile.py in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
    192 
    193     if len(algos.unique(bins)) < len(bins):
--> 194         raise ValueError('Bin edges must be unique: %s' % repr(bins))
    195 
    196     if include_lowest:

ValueError: Bin edges must be unique: array([0, 1, 1])

Expected Output

We need some kind of option to decide how to assign values to a quantile bucket in the event that two quantiles have the same value prior to the searchsorted call. In this case, the appropriate behavior may be to assign all 1 values to the 50% quantile bucket.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2017-01-05T21:08:55Z

There recently has been some improvement regarding this. With master:

In [101]: pd.qcut([0] * K + [1] * (K + 1), 2)
...
ValueError: Bin edges must be unique: array([0, 1, 1]).
You can drop duplicate edges by setting the 'duplicates' kwarg

In [102]: pd.qcut([0] * K + [1] * (K + 1), 2, duplicates='drop')
Out[102]: 
[[0, 1], [0, 1], [0, 1], [0, 1], [0, 1], ..., [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]
Length: 201
Categories (1, object): [[0, 1]]

So there is option to deal with duplicates edges, but the option chosen here is to take less bins instead of assigning all values to one of the bins.

jorisvandenbossche · 2017-01-05T21:11:10Z

See the issue #7751 and final PR merged couple days ago #15000

ashishsingal1 · 2017-01-06T21:49:38Z

Effectively, the duplicates='drop' will assign all of clumped values into a single bin. In my experience with data, this happens most when there's a zero value for most observations and a tail of non zero values. For example, take 'snowfall_in_inches'. Most for days, this will be zero. If we want to split into quantiles, we'll need to group all of the zero values into one bucket. duplicates='drop' should do this. Happy to improve if there's a better way though.

wesm · 2017-01-25T16:09:17Z

I looked at the behavior after #15000 -- I'm going to leave this issue open for now. We should look into the quantile algorithms in other statistical packages. for example we have:

> quantile(data, c(0, 0.5, 1), type=1)
  0%  50% 100% 
   0    1    1 
> quantile(data, c(0, 0.5, 1), type=2)
  0%  50% 100% 
   0    1    1 
> quantile(data, c(0, 0.5, 1), type=3)
  0%  50% 100% 
   0    0    1

I think having duplicate bin edges is fine as long as we have a convention about which bin to assign the data to. I would argue in this case, the correct sample quantiles are [0, 1), [1, 1], and so we have two well-defined bins to assign the data to. In the degenerate case where multiple bins have the same start and end point, we would want to assign values to the leftmost (or rightmost) bin

Needs refactoring

puneet29 · 2020-02-02T11:46:40Z

Can someone review the commit I just made? This is the first time I am contributing to pandas. Do I need to write a test for the same? Also, I need to know what to do with the duplicates parameter. Thank you!

alimcmaster1 · 2020-02-04T19:24:14Z

Can someone review the commit I just made? This is the first time I am contributing to pandas. Do I need to write a test for the same? Also, I need to know what to do with the duplicates parameter. Thank you!

Yes please to writing a test - this is often a good first step @puneet29

mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Oct 21, 2018

jbrockmendel added the quantile quantile method label Nov 1, 2019

puneet29 added a commit to puneet29/pandas that referenced this issue Feb 2, 2020

Fixed pandas-dev#15069

0cbeae5

Needs refactoring

puneet29 mentioned this issue Feb 3, 2020

BUG: qcut can fail for highly discontinuous data distributions #31626

Closed

5 tasks

mroeschke added Bug cut cut, qcut and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations quantile quantile method labels Apr 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qcut can fail for highly discontinuous data distributions #15069

qcut can fail for highly discontinuous data distributions #15069

wesm commented Jan 5, 2017

jorisvandenbossche commented Jan 5, 2017

jorisvandenbossche commented Jan 5, 2017

ashishsingal1 commented Jan 6, 2017

wesm commented Jan 25, 2017

puneet29 commented Feb 2, 2020

alimcmaster1 commented Feb 4, 2020

qcut can fail for highly discontinuous data distributions #15069

qcut can fail for highly discontinuous data distributions #15069

Comments

wesm commented Jan 5, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

jorisvandenbossche commented Jan 5, 2017

jorisvandenbossche commented Jan 5, 2017

ashishsingal1 commented Jan 6, 2017

wesm commented Jan 25, 2017

puneet29 commented Feb 2, 2020

alimcmaster1 commented Feb 4, 2020