Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: qcut uniquess checking (try 2) #15000

Closed
wants to merge 12 commits into from

Conversation

ashishsingal1
Copy link
Contributor

@ashishsingal1 ashishsingal1 commented Dec 27, 2016

Add option to drop non-unique bins.

@ashishsingal1
Copy link
Contributor Author

My second try on #14455 -- I wasn't able to rebase this one.

@codecov-io
Copy link

codecov-io commented Dec 28, 2016

Current coverage is 84.78% (diff: 87.50%)

Merging #15000 into master will decrease coverage by <.01%

@@             master     #15000   diff @@
==========================================
  Files           145        145          
  Lines         51090      51095     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43315      43319     +4   
- Misses         7775       7776     +1   
  Partials          0          0          

Powered by Codecov. Last update 7f0eefc...698b4ec

@@ -151,6 +151,8 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
as a scalar.
precision : int
The precision at which to store and display the bins labels
duplicates : {'raise', 'drop'}, optional
If binned edges are not unique, raise ValueError or drop non-uniques.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded tag (0.20.0)

raise ValueError("invalid value for 'duplicates' parameter, "
+ "valid options are: raise, drop")

if len(algos.unique(bins)) < len(bins):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do

bins = algos.unique(bins)

first (so you have it already for below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand -- if I do that before line 213, how would I compare len of bins before and after in line 207 and how would I output the original non unique edges in 209 in case duplicates='raise'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call it something else so it doesn't overwrite this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Is this for performance reasons?

@jreback
Copy link
Contributor

jreback commented Dec 28, 2016

pls add a whatsnew in 0.20.0 (other enhancements)

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 28, 2016
@jorisvandenbossche
Copy link
Member

@ashishsingal1 For future reference, it is not needed to start a new PR, you can always clean up locally and just force push to the same branch in your fork, and this PR will be updated automatically. In fact, we rather much prefer that instead of opening a new PR, to keep discussions gathered at one place. If you need help with those git aspects, don't hesitate to ask!

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Dec 28, 2016
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good already! Added some more comments.

@@ -151,6 +151,9 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
as a scalar.
precision : int
The precision at which to store and display the bins labels
duplicates : {'raise', 'drop'}, optional
If binned edges are not unique, raise ValueError or drop non-uniques.
.. versionadded:: 0.20.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a whiteline above this one (rst specifics ...)

@@ -151,6 +151,9 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
as a scalar.
precision : int
The precision at which to store and display the bins labels
duplicates : {'raise', 'drop'}, optional
If binned edges are not unique, raise ValueError or drop non-uniques.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"binned edges" -> bin edges ?

values = [0, 0, 0, 0, 1, 2, 3]
cats = qcut(values, 3, duplicates='drop')
ex_levels = ['[0, 1]', '(1, 3]']
self.assertTrue((cats.categories == ex_levels).all())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for duplicates='raise' (the default) as well? (assert that it raises a ValueError)


if duplicates not in ['raise', 'drop']:
raise ValueError("invalid value for 'duplicates' parameter, "
+ "valid options are: raise, drop")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the + is not needed here. The strings will automatically be concatenated due to Python's implied line continuation inside parentheses

if duplicates == 'raise':
raise ValueError('Bin edges must be unique: %s' % repr(bins) +
' You can drop duplicate edges ' +
'by setting \'duplicates\' param')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here about the +'s. Further, if you use " for the string, you don not need to escape the ' inside the string.
Lastly, can you put the injection of "repr(bins)" at the end of the string (and maybe also use .format instead of %)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashishsingal1 Thanks for the update. Looking good! Just added some small comments on the docs

@@ -151,6 +151,9 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
as a scalar.
precision : int
The precision at which to store and display the bins labels
duplicates : {'raise', 'drop'}, optional
If bin edges are not unique, raise ValueError or drop non-uniques.
.. versionadded:: 0.20.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a blank line above .. versionadded ... (rst specifics ..)

@@ -151,6 +151,9 @@ def qcut(x, q, labels=None, retbins=False, precision=3):
as a scalar.
precision : int
The precision at which to store and display the bins labels
duplicates : {'raise', 'drop'}, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add "default 'raise'"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you can also add it in the description like "Default is to raise an error" (one of both is good)

Docstring update
@@ -105,6 +105,7 @@ Other enhancements
of sorting or an incorrect key. See :ref:`here <advanced.unsorted>`

- ``pd.cut`` and ``pd.qcut`` now support datetime64 and timedelta64 dtypes (:issue:`14714`, :issue:`14798`)
- ``pd.qcut`` can optionally remove duplicate edges instead of throwing an error (:issue:`7751`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.cut/qcut have gained the duplicates kw to control whether to raise on duplicated edges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant for cut or just qcut?

Deleting whitespace.
@ashishsingal1
Copy link
Contributor Author

@jreback @jorisvandenbossche anything left for this PR on my side?

@jreback jreback closed this in 8051d61 Dec 30, 2016
@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

qcut() should make sure the bins bounderies are unique before passing them to _bins_to_cuts
4 participants