-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document "gotchas" when dealing with Categorical dtype #25
Comments
@jarq6c, having read the stackoverflow post and with regard to the above quote, I don't think I am completely understanding the issue. Can you elaborate? |
@aaraney Here's a minimal example to duplicate the problem. import pandas as pd
# Generate data
df = pd.DataFrame({
'treatment': ['A' for i in range(2)] + ['B' for i in range(2)],
'value': [float(i) for i in range(4)]
})
print(df)
treatment value
0 A 0.0
1 A 1.0
2 B 2.0
3 B 3.0
# Cast treatments to category
df['treatment'] = df['treatment'].astype('category')
# Add a category for which there is no data
df['treatment'] = df['treatment'].cat.add_categories('C')
# Note: adding the new category doesn't result in more data
print(df)
treatment value
0 A 0.0
1 A 1.0
2 B 2.0
3 B 3.0
# Find the mean for each category
# Note the 'C' treatment
mean_values = df.groupby('treatment').mean()
print(mean_values)
value
treatment
A 0.5
B 2.5
C NaN
# Find the mean for each category
# Use observed=True to avoid expanding categories without associated values
mean_values = df.groupby('treatment', observed=True).mean()
print(mean_values)
value
treatment
A 0.5
B 2.5 |
I edited the original example to use a smaller dataframe. Basically, the problem is |
Ah I see, thanks for posting the example. To your original point, yeah I think mention of this behavior in the readme will be sufficient. It would not hurt to make note of this and give advice in method docstrings' that return cols with dtype categorical. I prefer my_dataframe['usgs_site_code`] = my_dataframe['usgs_site_code'].apply(str) I've found that using |
@aaraney @hellkite500
I think it's a good idea to explicitly document some peculiarities of dealing with
pandas.Categorical
which are quite common inevaluation_tools
canonicalpandas.Dataframe
. Bare minimum, I'll add something like this to README.md. Thoughts?Note about
pandas.Categorical
data typesevaluation_tools
usespandas.Dataframe
that containpandas.Categorical
values to increase memory efficiency. Depending upon your use-case, these values may require special consideration. To see if aDataframe
returned byevaluation_tools
containspandas.Categorical
you can usepandas.Dataframe.info
like so:Columns with
Dtype
category
arepandas.Categorical
. It's important to note that these categories persist even if yourDataframe
does not contain corresponding values. A possible consequence of this can be found on this stackoverflow question.Three possible solutions to this issue include:
Casting to
string
Remove unused categories
Use
observed
option withgroupby
The text was updated successfully, but these errors were encountered: