Document "gotchas" when dealing with Categorical dtype #25

jarq6c · 2021-01-28T17:15:09Z

@aaraney @hellkite500
I think it's a good idea to explicitly document some peculiarities of dealing with pandas.Categorical which are quite common in evaluation_tools canonical pandas.Dataframe. Bare minimum, I'll add something like this to README.md. Thoughts?

Note about `pandas.Categorical` data types

evaluation_tools uses pandas.Dataframe that contain pandas.Categorical values to increase memory efficiency. Depending upon your use-case, these values may require special consideration. To see if a Dataframe returned by evaluation_tools contains pandas.Categorical you can use pandas.Dataframe.info like so:

print(my_dataframe.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5706954 entries, 0 to 5706953
Data columns (total 7 columns):
 #   Column            Dtype         
---  ------            -----         
 0   value_date        datetime64[ns]
 1   variable_name     category      
 2   usgs_site_code    category      
 3   measurement_unit  category      
 4   value             float32       
 5   qualifiers        category      
 6   series            category      
dtypes: category(5), datetime64[ns](1), float32(1)
memory usage: 141.5 MB
None

Columns with Dtype category are pandas.Categorical. It's important to note that these categories persist even if your Dataframe does not contain corresponding values. A possible consequence of this can be found on this stackoverflow question.

Three possible solutions to this issue include:

Casting to `string`

my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].astype(str)

Remove unused categories

my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].cat.remove_unused_categories()

Use `observed` option with `groupby`

mean_flow = my_dataframe.groupby('usgs_site_code', observed=True).mean()

The text was updated successfully, but these errors were encountered:

aaraney · 2021-01-28T18:21:36Z

It's important to note that these categories persist even if your Dataframe does not contain corresponding values.

@jarq6c, having read the stackoverflow post and with regard to the above quote, I don't think I am completely understanding the issue. Can you elaborate?

jarq6c · 2021-01-28T18:51:42Z

@aaraney Here's a minimal example to duplicate the problem.

import pandas as pd

# Generate data
df = pd.DataFrame({
    'treatment': ['A' for i in range(2)] + ['B' for i in range(2)],
    'value': [float(i) for i in range(4)]
})
print(df)

  treatment  value
0         A    0.0
1         A    1.0
2         B    2.0
3         B    3.0

# Cast treatments to category
df['treatment'] = df['treatment'].astype('category')

# Add a category for which there is no data
df['treatment'] = df['treatment'].cat.add_categories('C')

# Note: adding the new category doesn't result in more data
print(df)
  treatment  value
0         A    0.0
1         A    1.0
2         B    2.0
3         B    3.0

# Find the mean for each category
#  Note the 'C' treatment
mean_values = df.groupby('treatment').mean()

print(mean_values)

           value
treatment       
A           0.5
B           2.5
C            NaN

# Find the mean for each category
#  Use observed=True to avoid expanding categories without associated values
mean_values = df.groupby('treatment', observed=True).mean()

print(mean_values)

           value
treatment       
A           0.5
B           2.5

jarq6c · 2021-01-28T19:00:28Z

I edited the original example to use a smaller dataframe. Basically, the problem is groupby will explode with NaN if your categorical columns are not tidy.

aaraney · 2021-01-28T19:33:51Z

Ah I see, thanks for posting the example. To your original point, yeah I think mention of this behavior in the readme will be sufficient. It would not hurt to make note of this and give advice in method docstrings' that return cols with dtype categorical. I prefer df.groupby('treatment', observed=True).mean() as the way to avoid the issue or

my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].apply(str)

I've found that using apply(str) is much faster than an astype(str) cast.

jarq6c added the documentation Improvements or additions to documentation label Jan 28, 2021

jarq6c self-assigned this Jan 28, 2021

aaraney mentioned this issue Jan 28, 2021

Cache GCP client responses to reduce repeated operations and network calls #19

Closed

jarq6c mentioned this issue Feb 4, 2021

Update Readme with Categorical Notes #32

Merged

10 tasks

jarq6c closed this as completed in #32 Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document "gotchas" when dealing with Categorical dtype #25

Document "gotchas" when dealing with Categorical dtype #25

jarq6c commented Jan 28, 2021

aaraney commented Jan 28, 2021

jarq6c commented Jan 28, 2021 •

edited

Loading

jarq6c commented Jan 28, 2021

aaraney commented Jan 28, 2021

Document "gotchas" when dealing with Categorical dtype #25

Document "gotchas" when dealing with Categorical dtype #25

Comments

jarq6c commented Jan 28, 2021

Note about pandas.Categorical data types

Casting to string

Remove unused categories

Use observed option with groupby

aaraney commented Jan 28, 2021

jarq6c commented Jan 28, 2021 • edited Loading

jarq6c commented Jan 28, 2021

aaraney commented Jan 28, 2021

Note about `pandas.Categorical` data types

Casting to `string`

Use `observed` option with `groupby`

jarq6c commented Jan 28, 2021 •

edited

Loading