Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5],
                    "raw_grade": ['a', 'b', 'c', 'd', 'e']})

In [4]:
#Convert the raw grades to a categorical data type
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    c
3    d
4    e
Name: grade, dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

In [5]:
#Rename the categories to more meaningful names:
df["grade"].cat.categories = ["very bad","very good","better","good","bad"]

  df["grade"].cat.categories = ["very bad","very good","better","good","bad"]


In [6]:
#Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series by default).
df["grade"] = df["grade"].cat.set_categories(["very bad","very good","better","good","bad"])
df["grade"]

0     very bad
1    very good
2       better
3         good
4          bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'very good', 'better', 'good', 'bad']

In [7]:
#Sorting is per order in the categories, not lexical order:
df.sort_values(by='grade')

Unnamed: 0,id,raw_grade,grade
0,1,a,very bad
1,2,b,very good
2,3,c,better
3,4,d,good
4,5,e,bad


In [8]:
#Grouping by a categorical column also shows empty categories:
df.groupby("grade").size()

grade
very bad     1
very good    1
better       1
good         1
bad          1
dtype: int64