Skip to content

Enum encoding fails on UTF-8 byte-string arrays from HDF5 datasets #487

@anth-volk

Description

@anth-volk

Context

While comparing a staged PolicyEngine US data publication, Microsimulation(dataset="US.h5") failed before any calculations ran. The failure happened while loading enum-valued dataset inputs from HDF5.

The staged US.h5 contained county/2024 as a fixed-width byte-string array (|S31). One valid county enum member name was stored as UTF-8 bytes:

b"DO\xc3\x91A_ANA_COUNTY_NM" -> "DOÑA_ANA_COUNTY_NM"

That value appeared 15 times. The policyengine-us county enum includes the corresponding member, so the value is semantically valid.

Failure

During dataset-backed simulation construction, Core calls Enum.encode() for enum variables. For fixed-width byte strings, the current path uses array.astype(str), which decodes bytes as ASCII and raises:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

The path observed was:

Simulation.build_from_dataset()
  -> set_input("county", ...)
  -> Holder._to_array()
  -> variable.possible_values.encode(...)
  -> Enum.encode()
  -> array.astype(str)

Expected behavior

For HDF5 byte-string arrays, Core should decode bytes as UTF-8 before matching enum member names. The intended conversion is:

HDF5 UTF-8 bytes -> Unicode enum member name -> EnumArray integer code

For example:

b"DO\xc3\x91A_ANA_COUNTY_NM" -> "DOÑA_ANA_COUNTY_NM" -> enum index

ASCII-only byte strings should continue to work because ASCII is valid UTF-8.

Notes

Normal situation input with the Unicode enum member name already works:

"county": {"2024": "DOÑA_ANA_COUNTY_NM"}

The failure is specific to byte-string arrays returned by HDF5/h5py dataset loading.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions