Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qcut can fail for highly discontinuous data distributions #15069

Open
wesm opened this issue Jan 5, 2017 · 6 comments
Open

qcut can fail for highly discontinuous data distributions #15069

wesm opened this issue Jan 5, 2017 · 6 comments
Labels
Bug cut cut, qcut

Comments

@wesm
Copy link
Member

wesm commented Jan 5, 2017

Code Sample, a copy-pastable example if possible

This code fails for any K:

# Your code here
K = 100

pd.qcut([0] * K + [1] * (K + 1), 2)

Problem description

With pandas 0.19.2, I have:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-782385490865> in <module>()
----> 1 pd.qcut([0] * K + [1] * (K + 1), 2)

pandas/tools/tile.py in qcut(x, q, labels, retbins, precision)
    173     bins = algos.quantile(x, quantiles)
    174     return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,
--> 175                          precision=precision, include_lowest=True)
    176 
    177 

pandas/tools/tile.py in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
    192 
    193     if len(algos.unique(bins)) < len(bins):
--> 194         raise ValueError('Bin edges must be unique: %s' % repr(bins))
    195 
    196     if include_lowest:

ValueError: Bin edges must be unique: array([0, 1, 1])

Expected Output

We need some kind of option to decide how to assign values to a quantile bucket in the event that two quantiles have the same value prior to the searchsorted call. In this case, the appropriate behavior may be to assign all 1 values to the 50% quantile bucket.

@jorisvandenbossche
Copy link
Member

There recently has been some improvement regarding this. With master:

In [101]: pd.qcut([0] * K + [1] * (K + 1), 2)
...
ValueError: Bin edges must be unique: array([0, 1, 1]).
You can drop duplicate edges by setting the 'duplicates' kwarg

In [102]: pd.qcut([0] * K + [1] * (K + 1), 2, duplicates='drop')
Out[102]: 
[[0, 1], [0, 1], [0, 1], [0, 1], [0, 1], ..., [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]
Length: 201
Categories (1, object): [[0, 1]]

So there is option to deal with duplicates edges, but the option chosen here is to take less bins instead of assigning all values to one of the bins.

@jorisvandenbossche
Copy link
Member

See the issue #7751 and final PR merged couple days ago #15000

@ashishsingal1
Copy link
Contributor

Effectively, the duplicates='drop' will assign all of clumped values into a single bin. In my experience with data, this happens most when there's a zero value for most observations and a tail of non zero values. For example, take 'snowfall_in_inches'. Most for days, this will be zero. If we want to split into quantiles, we'll need to group all of the zero values into one bucket. duplicates='drop' should do this. Happy to improve if there's a better way though.

@wesm
Copy link
Member Author

wesm commented Jan 25, 2017

I looked at the behavior after #15000 -- I'm going to leave this issue open for now. We should look into the quantile algorithms in other statistical packages. for example we have:

> quantile(data, c(0, 0.5, 1), type=1)
  0%  50% 100% 
   0    1    1 
> quantile(data, c(0, 0.5, 1), type=2)
  0%  50% 100% 
   0    1    1 
> quantile(data, c(0, 0.5, 1), type=3)
  0%  50% 100% 
   0    0    1 

I think having duplicate bin edges is fine as long as we have a convention about which bin to assign the data to. I would argue in this case, the correct sample quantiles are [0, 1), [1, 1], and so we have two well-defined bins to assign the data to. In the degenerate case where multiple bins have the same start and end point, we would want to assign values to the leftmost (or rightmost) bin

@mroeschke mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Oct 21, 2018
@jbrockmendel jbrockmendel added the quantile quantile method label Nov 1, 2019
puneet29 added a commit to puneet29/pandas that referenced this issue Feb 2, 2020
Needs refactoring
@puneet29
Copy link

puneet29 commented Feb 2, 2020

Can someone review the commit I just made? This is the first time I am contributing to pandas. Do I need to write a test for the same? Also, I need to know what to do with the duplicates parameter. Thank you!

@alimcmaster1
Copy link
Member

Can someone review the commit I just made? This is the first time I am contributing to pandas. Do I need to write a test for the same? Also, I need to know what to do with the duplicates parameter. Thank you!

Yes please to writing a test - this is often a good first step @puneet29

@mroeschke mroeschke added Bug cut cut, qcut and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations quantile quantile method labels Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug cut cut, qcut
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants