Enum encoding fails on UTF-8 byte-string arrays from HDF5 datasets

## Context

While comparing a staged PolicyEngine US data publication, `Microsimulation(dataset="US.h5")` failed before any calculations ran. The failure happened while loading enum-valued dataset inputs from HDF5.

The staged `US.h5` contained `county/2024` as a fixed-width byte-string array (`|S31`). One valid county enum member name was stored as UTF-8 bytes:

```text
b"DO\xc3\x91A_ANA_COUNTY_NM" -> "DOÑA_ANA_COUNTY_NM"
```

That value appeared 15 times. The `policyengine-us` county enum includes the corresponding member, so the value is semantically valid.

## Failure

During dataset-backed simulation construction, Core calls `Enum.encode()` for enum variables. For fixed-width byte strings, the current path uses `array.astype(str)`, which decodes bytes as ASCII and raises:

```text
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
```

The path observed was:

```text
Simulation.build_from_dataset()
  -> set_input("county", ...)
  -> Holder._to_array()
  -> variable.possible_values.encode(...)
  -> Enum.encode()
  -> array.astype(str)
```

## Expected behavior

For HDF5 byte-string arrays, Core should decode bytes as UTF-8 before matching enum member names. The intended conversion is:

```text
HDF5 UTF-8 bytes -> Unicode enum member name -> EnumArray integer code
```

For example:

```text
b"DO\xc3\x91A_ANA_COUNTY_NM" -> "DOÑA_ANA_COUNTY_NM" -> enum index
```

ASCII-only byte strings should continue to work because ASCII is valid UTF-8.

## Notes

Normal situation input with the Unicode enum member name already works:

```python
"county": {"2024": "DOÑA_ANA_COUNTY_NM"}
```

The failure is specific to byte-string arrays returned by HDF5/h5py dataset loading.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enum encoding fails on UTF-8 byte-string arrays from HDF5 datasets #487

Context

Failure

Expected behavior

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enum encoding fails on UTF-8 byte-string arrays from HDF5 datasets #487

Description

Context

Failure

Expected behavior

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions