# [Expressions: Categorical data and enums](https://docs.pola.rs/user-guide/expressions/categorical-data-and-enums/)

Polars recommends to use enum when there is a fixed amount of categories, ans use category if not. That's a bit odd.

In [1]:
import polars as pl

# data type Enum

In [3]:
bears_enum = pl.Enum(["Polar","Panda","Brown"])
bears = pl.Series(["Polar","Panda","Brown","Brown","Polar"], dtype=bears_enum)
bears

"""Polar"""
"""Panda"""
"""Brown"""
"""Brown"""
"""Polar"""


In [9]:
from polars.exceptions import InvalidOperationError
try:
    pl.Series(["Polar", "Panda", "Brown", "Polar","Shark"], dtype=bears_enum)
except InvalidOperationError as exc:
    print("InvalidOperationError:",exc)

InvalidOperationError: conversion from `str` to `enum` failed in column '' for 1 out of 5 values: ["Shark"]

Ensure that all values in the input column are present in the categories of the enum datatype.


### Category ordering and comparison

In [11]:
log_levels = pl.Enum(["debug", "info", "warning", "error"])

logs = pl.DataFrame(
    {
        "level": ["debug", "info", "debug", "error"],
        "message": [
            "process id: 525",
            "Service started correctly",
            "startup time: 67ms",
            "Cannot connect to DB!",
        ],
    },
    schema_overrides={
        "level": log_levels,
    },
)

non_debug_logs = logs.filter(
    pl.col("level") > "debug",
)
non_debug_logs

level,message
enum,str
"""info""","""Service started correctly"""
"""error""","""Cannot connect to DB!"""


## Data type Categorical

In [14]:
bears_cat = pl.Series(
    ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
)
bears_cat

"""Polar"""
"""Panda"""
"""Brown"""
"""Brown"""
"""Polar"""


here is the reason why they prefer enum

In [16]:
bears_cat < "Cat"

false
False
True
True
False


In [17]:
bears_str = pl.Series(
    ["Panda", "Brown", "Brown", "Polar", "Polar"],
)
print(bears_cat == bears_str)

shape: (5,)
Series: '' [bool]
[
	false
	false
	true
	false
	true
]


In [18]:
from polars.exceptions import StringCacheMismatchError

bears_cat2 = pl.Series(
    ["Panda", "Brown", "Brown", "Polar", "Polar"],
    dtype=pl.Categorical,
)

try:
    print(bears_cat == bears_cat2)
except StringCacheMismatchError as exc:
    exc_str = str(exc).splitlines()[0]
    print("StringCacheMismatchError:",exc_str)

StringCacheMismatchError: cannot compare categoricals coming from different sources, consider setting a global StringCache.


polars can't figure out if two categorical columns are the same if the order of items is different between two columns. For this you need the polars StringCache()

In [19]:
with pl.StringCache():
    bears_cat = pl.Series(
        ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
    )
    bears_cat2 = pl.Series(
        ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
    )

print(bears_cat == bears_cat2)

shape: (5,)
Series: '' [bool]
[
	false
	false
	true
	false
	true
]


## Combining Categorical columns

In [22]:
import warnings

from polars.exceptions import CategoricalRemappingWarning

male_bears = pl.DataFrame(
    {
        "species": ["Polar", "Brown", "Panda"],
        "weight": [450, 500, 110],  # kg
    },
    schema_overrides={"species": pl.Categorical},
)
female_bears = pl.DataFrame(
    {
        "species": ["Brown", "Polar", "Panda"],
        "weight": [340, 200, 90],  # kg
    },
    schema_overrides={"species": pl.Categorical},
)

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=CategoricalRemappingWarning)
    bears = pl.concat([male_bears, female_bears], how="vertical")
bears

species,weight
cat,i64
"""Polar""",450
"""Brown""",500
"""Panda""",110
"""Brown""",340
"""Polar""",200
"""Panda""",90


### Comparison between Categorical columns is not lexical

to counter this, use the `ordering="lexical"` argument

In [24]:
with pl.StringCache():
    bears_cat = pl.Series(
        ["Polar", "Panda", "Brown", "Brown", "Polar"],
        dtype=pl.Categorical(ordering="lexical"),
    )
    bears_cat2 = pl.Series(
        ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
    )

print(bears_cat > bears_cat2)

shape: (5,)
Series: '' [bool]
[
	true
	true
	false
	false
	false
]


Otherwise, the order is inferred together with the values:

In [25]:
with pl.StringCache():
    bears_cat = pl.Series(
        # Polar <  Panda <  Brown
        ["Polar", "Panda", "Brown", "Brown", "Polar"],
        dtype=pl.Categorical,
    )
    bears_cat2 = pl.Series(
        ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
    )

print(bears_cat > bears_cat2)

shape: (5,)
Series: '' [bool]
[
	false
	false
	false
	true
	false
]
