Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document "gotchas" when dealing with Categorical dtype #25

Closed
jarq6c opened this issue Jan 28, 2021 · 4 comments · Fixed by #32
Closed

Document "gotchas" when dealing with Categorical dtype #25

jarq6c opened this issue Jan 28, 2021 · 4 comments · Fixed by #32
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@jarq6c
Copy link
Collaborator

jarq6c commented Jan 28, 2021

@aaraney @hellkite500
I think it's a good idea to explicitly document some peculiarities of dealing with pandas.Categorical which are quite common in evaluation_tools canonical pandas.Dataframe. Bare minimum, I'll add something like this to README.md. Thoughts?

Note about pandas.Categorical data types

evaluation_tools uses pandas.Dataframe that contain pandas.Categorical values to increase memory efficiency. Depending upon your use-case, these values may require special consideration. To see if a Dataframe returned by evaluation_tools contains pandas.Categorical you can use pandas.Dataframe.info like so:

print(my_dataframe.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5706954 entries, 0 to 5706953
Data columns (total 7 columns):
 #   Column            Dtype         
---  ------            -----         
 0   value_date        datetime64[ns]
 1   variable_name     category      
 2   usgs_site_code    category      
 3   measurement_unit  category      
 4   value             float32       
 5   qualifiers        category      
 6   series            category      
dtypes: category(5), datetime64[ns](1), float32(1)
memory usage: 141.5 MB
None

Columns with Dtype category are pandas.Categorical. It's important to note that these categories persist even if your Dataframe does not contain corresponding values. A possible consequence of this can be found on this stackoverflow question.

Three possible solutions to this issue include:

Casting to string

my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].astype(str)

Remove unused categories

my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].cat.remove_unused_categories()

Use observed option with groupby

mean_flow = my_dataframe.groupby('usgs_site_code', observed=True).mean()
@jarq6c jarq6c added the documentation Improvements or additions to documentation label Jan 28, 2021
@jarq6c jarq6c self-assigned this Jan 28, 2021
@aaraney
Copy link
Member

aaraney commented Jan 28, 2021

It's important to note that these categories persist even if your Dataframe does not contain corresponding values.

@jarq6c, having read the stackoverflow post and with regard to the above quote, I don't think I am completely understanding the issue. Can you elaborate?

@jarq6c
Copy link
Collaborator Author

jarq6c commented Jan 28, 2021

@aaraney Here's a minimal example to duplicate the problem.

import pandas as pd

# Generate data
df = pd.DataFrame({
    'treatment': ['A' for i in range(2)] + ['B' for i in range(2)],
    'value': [float(i) for i in range(4)]
})
print(df)

  treatment  value
0         A    0.0
1         A    1.0
2         B    2.0
3         B    3.0

# Cast treatments to category
df['treatment'] = df['treatment'].astype('category')

# Add a category for which there is no data
df['treatment'] = df['treatment'].cat.add_categories('C')

# Note: adding the new category doesn't result in more data
print(df)
  treatment  value
0         A    0.0
1         A    1.0
2         B    2.0
3         B    3.0

# Find the mean for each category
#  Note the 'C' treatment
mean_values = df.groupby('treatment').mean()

print(mean_values)

           value
treatment       
A           0.5
B           2.5
C            NaN

# Find the mean for each category
#  Use observed=True to avoid expanding categories without associated values
mean_values = df.groupby('treatment', observed=True).mean()

print(mean_values)

           value
treatment       
A           0.5
B           2.5

@jarq6c
Copy link
Collaborator Author

jarq6c commented Jan 28, 2021

I edited the original example to use a smaller dataframe. Basically, the problem is groupby will explode with NaN if your categorical columns are not tidy.

@aaraney
Copy link
Member

aaraney commented Jan 28, 2021

Ah I see, thanks for posting the example. To your original point, yeah I think mention of this behavior in the readme will be sufficient. It would not hurt to make note of this and give advice in method docstrings' that return cols with dtype categorical. I prefer df.groupby('treatment', observed=True).mean() as the way to avoid the issue or

my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].apply(str)

I've found that using apply(str) is much faster than an astype(str) cast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants