Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong ordering of ordered pandas categoricals #137

Closed
arogozhnikov opened this issue Oct 9, 2021 · 10 comments
Closed

Wrong ordering of ordered pandas categoricals #137

arogozhnikov opened this issue Oct 9, 2021 · 10 comments

Comments

@arogozhnikov
Copy link

Describe the bug

There is a separate dtype in pandas - Categorical, which can optionally be turned to sortable with arbitrary order of elements. It is very convenient when order of elements should be fixed and can't be described by any logic.

Expected behavior

natsorted should when dealing with ordered categorical columns should just use comparison of elements defined by ordering.

Environment (please complete the following information):

  • Python Version: 3.7
  • OS [e.g. Windows, Fedora] ubuntu, macos

To Reproduce

n [1]: import pandas as pd

# create a series with categorical dtype. Note a special flag 'ordered', 
# most cateoricals are unordered and those are handled by natsort ok,  but not ordered
In [2]: x = pd.Categorical(['a', 'b', 'c', 'b', 'a', 'c'], categories=['c', 'b', 'a'], ordered=True)

In [3]: x.sort_values()
Out[3]:
['c', 'c', 'b', 'b', 'a', 'a']
Categories (3, object): ['c' < 'b' < 'a']

In [4]: from natsort import natsorted

In [5]: natsorted(x)
Out[5]: ['a', 'a', 'b', 'b', 'c', 'c']

# expected output is ['c', 'c', 'b', 'b', 'a', 'a']
@SethMMorton
Copy link
Owner

Out of curiosity, why use natsorted on a column that is sorted by categorical? E.g. if x.sort_values() already gives the correct answer why choose to try and sort with a different method?

@arogozhnikov
Copy link
Author

it is part of generic code that handles different cases.

In my case, I have an auto-generated report table. That's a pandas dataframe that can have multiple index levels.
I.e. one level can be string, second is integer. natsorted works well in those cases, and I've been applying it always

It is recently emerged as a useful usecase that some levels can be categorical with a strict pre-defined sorting order.

Example: level1 - string, level2 - categorical. Sorting should be done on both.

df.sort_index() will properly sort by level1, but not level2; natsort can sort by level2 but not level1.

@SethMMorton
Copy link
Owner

Could you encapsulate this in a function?

def mysort(x):
    try:
        if x.ordered:
            return x.sort_values()
    except AttributeError:
        pass
    return natsorted(x)

This way you can pass any data collection, and it will sort with natsorted unless the input is an ordered categorical in which case is use its internal sorting method. This may be preferred to using natsorted anyway since it would always return a list.

@arogozhnikov
Copy link
Author

arogozhnikov commented Oct 11, 2021

I'll try to provide a more complete example:

# create dataframe with three columns
df = pd.DataFrame(dict(col1=[1, 2, 3, 4, 5], col2=['third', 'fourth', 'first', 'first', 'third'], col3=['type 10', 'type 10', 'type 12', 'type 2', 'type 2']))
# convert 'col2' to categorical. 
# Values are only possible from the provided list of values, and comparison between them is defined by an order 
df['col2'] = pd.Categorical(df['col2'], ['first', 'second', 'third', 'fourth'], ordered=True)
# Move two columns into index
df = df.set_index(['col2', 'col3'])
df = pd.DataFrame(dict(col1=[1, 2, 3, 4, 5], col2=['third', 'second', 'first', 'first', 'third'], col3=['type 10', 'type 10', 'type 12', 'type 2', 'type 2']))
df

                col1
col2   col3         
third  type 10     1
fourth type 10     2
first  type 12     3
       type 2      4
third  type 2      5

Expected order:

df.iloc[[3, 2, 4, 0, 1]]

                col1
col2   col3         
first  type 2      4
       type 12     3
third  type 2      5
       type 10     1
fourth type 10     2

Normal sorting can't deal with col2 - it sorts those as strings, but uses given ordering for the col1

In [14]: df.sort_index()

                col1
col2   col3         
first  type 12     3
       type 2      4
third  type 10     1
       type 2      5
fourth type 10     2

Natsorting doesn't properly handle the first one:

# index_natsorted returns order of sorting, df.iloc is a way to reorder rows using integer indices
# natsorted can't recognize it is not a string really and sorts them as strings
In [18]: df.iloc[natsort.index_natsorted(df.index)]

                col1
col2   col3         
first  type 2      4
       type 12     3
fourth type 10     2
third  type 2      5
       type 10     1

If that's too abstract, you can imagine setting like col2->department name, col3 -> task (number with description).
Departments just have historically given order, tasks are natsortable.

In my case the structure of index (number of levels and their particular types) is different each time.
Some of them are categorical, some are not

@arogozhnikov
Copy link
Author

After a bit of digging I see that it is challenging because of auto-conversion to string type. I wonder if there is some other string-like type with custom ordering that both natsort and pandas can handle.

@SethMMorton
Copy link
Owner

I'm sorry, but I am still not clear what you want to achieve. Can you annotate your examples a bit more?

@arogozhnikov
Copy link
Author

tried to add more details and some comments to the example above

@SethMMorton
Copy link
Owner

Would it be possible to enforce all index columns to be categorical?

For example, for the column with the "non obvious" order you did

df['col2'] = pd.Categorical(df['col2'], ['first', 'second', 'third', 'fourth'], ordered=True)

For the other index column you could do

df['col3'] = pd.Categorical(df['col3'], natsorted(set(df['col3'])), ordered=True)

Then sort_index() will work as expected.

@SethMMorton
Copy link
Owner

@arogozhnikov Did this solution work for you?

@arogozhnikov
Copy link
Author

@SethMMorton sorry for dropping the ball, yes I believe this should cover my needs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants