Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: qcut uniquess checking (try 2) #15000

Closed
wants to merge 12 commits into from
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ Other enhancements
of sorting or an incorrect key. See :ref:`here <advanced.unsorted>`

- ``pd.cut`` and ``pd.qcut`` now support datetime64 and timedelta64 dtypes (:issue:`14714`, :issue:`14798`)
- ``pd.qcut`` can optionally remove duplicate edges instead of throwing an error (:issue:`7751`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.cut/qcut have gained the duplicates kw to control whether to raise on duplicated edges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant for cut or just qcut?

- ``Series`` provides a ``to_excel`` method to output Excel files (:issue:`8825`)
- The ``usecols`` argument in ``pd.read_csv`` now accepts a callable function as a value (:issue:`14154`)
- ``pd.DataFrame.plot`` now prints a title above each subplot if ``suplots=True`` and ``title`` is a list of strings (:issue:`14753`)
Expand Down
12 changes: 12 additions & 0 deletions pandas/tools/tests/test_tile.py
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,18 @@ def test_series_retbins(self):
np.array([0, 0, 1, 1], dtype=np.int8))
tm.assert_numpy_array_equal(bins, np.array([0, 1.5, 3]))

def test_qcut_duplicates_drop(self):
# GH 7751
values = [0, 0, 0, 0, 1, 2, 3]
cats = qcut(values, 3, duplicates='drop')
ex_levels = ['[0, 1]', '(1, 3]']
self.assertTrue((cats.categories == ex_levels).all())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for duplicates='raise' (the default) as well? (assert that it raises a ValueError)


def test_qcut_duplicates_raise(self):
# GH 7751
values = [0, 0, 0, 0, 1, 2, 3]
self.assertRaises(ValueError, qcut, values, 3, duplicates='raise')

def test_single_bin(self):
# issue 14652
expected = Series([0, 0])
Expand Down
25 changes: 19 additions & 6 deletions pandas/tools/tile.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
series_index, name)


def qcut(x, q, labels=None, retbins=False, precision=3):
def qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise'):
"""
Quantile-based discretization function. Discretize variable into
equal-sized buckets based on rank or based on sample quantiles. For example
Expand All @@ -151,6 +151,9 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
as a scalar.
precision : int
The precision at which to store and display the bins labels
duplicates : {'raise', 'drop'}, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add "default 'raise'"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you can also add it in the description like "Default is to raise an error" (one of both is good)

If bin edges are not unique, raise ValueError or drop non-uniques.
.. versionadded:: 0.20.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a whiteline above this one (rst specifics ...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a blank line above .. versionadded ... (rst specifics ..)


Returns
-------
Expand Down Expand Up @@ -187,22 +190,32 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
bins = algos.quantile(x, quantiles)
fac, bins = _bins_to_cuts(x, bins, labels=labels,
precision=precision, include_lowest=True,
dtype=dtype)
dtype=dtype, duplicates=duplicates)

return _postprocess_for_cut(fac, bins, retbins, x_is_series,
series_index, name)


def _bins_to_cuts(x, bins, right=True, labels=None,
precision=3, include_lowest=False,
dtype=None):
dtype=None, duplicates='raise'):

if duplicates not in ['raise', 'drop']:
raise ValueError("invalid value for 'duplicates' parameter, "
"valid options are: raise, drop")

unique_bins = algos.unique(bins)
if len(unique_bins) < len(bins):
if duplicates == 'raise':
raise ValueError("Bin edges must be unique: {}. You "
"can drop duplicate edges by setting "
"'duplicates' param".format(repr(bins)))
else:
bins = unique_bins

side = 'left' if right else 'right'
ids = bins.searchsorted(x, side=side)

if len(algos.unique(bins)) < len(bins):
raise ValueError('Bin edges must be unique: %s' % repr(bins))

if include_lowest:
ids[x == bins[0]] = 1

Expand Down