Context
While comparing a staged PolicyEngine US data publication, Microsimulation(dataset="US.h5") failed before any calculations ran. The failure happened while loading enum-valued dataset inputs from HDF5.
The staged US.h5 contained county/2024 as a fixed-width byte-string array (|S31). One valid county enum member name was stored as UTF-8 bytes:
b"DO\xc3\x91A_ANA_COUNTY_NM" -> "DOÑA_ANA_COUNTY_NM"
That value appeared 15 times. The policyengine-us county enum includes the corresponding member, so the value is semantically valid.
Failure
During dataset-backed simulation construction, Core calls Enum.encode() for enum variables. For fixed-width byte strings, the current path uses array.astype(str), which decodes bytes as ASCII and raises:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
The path observed was:
Simulation.build_from_dataset()
-> set_input("county", ...)
-> Holder._to_array()
-> variable.possible_values.encode(...)
-> Enum.encode()
-> array.astype(str)
Expected behavior
For HDF5 byte-string arrays, Core should decode bytes as UTF-8 before matching enum member names. The intended conversion is:
HDF5 UTF-8 bytes -> Unicode enum member name -> EnumArray integer code
For example:
b"DO\xc3\x91A_ANA_COUNTY_NM" -> "DOÑA_ANA_COUNTY_NM" -> enum index
ASCII-only byte strings should continue to work because ASCII is valid UTF-8.
Notes
Normal situation input with the Unicode enum member name already works:
"county": {"2024": "DOÑA_ANA_COUNTY_NM"}
The failure is specific to byte-string arrays returned by HDF5/h5py dataset loading.
Context
While comparing a staged PolicyEngine US data publication,
Microsimulation(dataset="US.h5")failed before any calculations ran. The failure happened while loading enum-valued dataset inputs from HDF5.The staged
US.h5containedcounty/2024as a fixed-width byte-string array (|S31). One valid county enum member name was stored as UTF-8 bytes:That value appeared 15 times. The
policyengine-uscounty enum includes the corresponding member, so the value is semantically valid.Failure
During dataset-backed simulation construction, Core calls
Enum.encode()for enum variables. For fixed-width byte strings, the current path usesarray.astype(str), which decodes bytes as ASCII and raises:The path observed was:
Expected behavior
For HDF5 byte-string arrays, Core should decode bytes as UTF-8 before matching enum member names. The intended conversion is:
For example:
ASCII-only byte strings should continue to work because ASCII is valid UTF-8.
Notes
Normal situation input with the Unicode enum member name already works:
The failure is specific to byte-string arrays returned by HDF5/h5py dataset loading.