Skip to content

Fix enum encoding for UTF-8 byte strings#488

Merged
anth-volk merged 2 commits intomasterfrom
fix/utf8-byte-enum-encoding
May 7, 2026
Merged

Fix enum encoding for UTF-8 byte strings#488
anth-volk merged 2 commits intomasterfrom
fix/utf8-byte-enum-encoding

Conversation

@anth-volk
Copy link
Copy Markdown
Collaborator

@anth-volk anth-volk commented May 7, 2026

Fixes #487

Context

This came up while comparing a staged PolicyEngine US data publication. The staged US.h5 file stored county/2024 as an HDF5 fixed-width byte-string array, and 15 rows contained the valid enum member name DOÑA_ANA_COUNTY_NM as UTF-8 bytes:

b"DO\xc3\x91A_ANA_COUNTY_NM"

policyengine-us already has DOÑA_ANA_COUNTY_NM in its county enum, and normal situation input with the Unicode string works. The failure was specific to dataset-backed simulation loading: h5py returns HDF5 string data as bytes, then Core's enum encoding path used array.astype(str), which attempts ASCII decoding and raises before Microsimulation(dataset=...) can finish constructing.

The intended conversion is:

HDF5 UTF-8 bytes -> Unicode enum member name -> EnumArray integer code

Summary

  • Decode fixed-width byte-string arrays as UTF-8 before enum-name lookup.
  • Add a regression test for HDF5-style UTF-8 byte strings with a non-ASCII enum member name.
  • Document in the test that the ñ character is intentional because it exercises the non-ASCII byte-string decode failure.
  • Add provider-neutral AI guidance for tests, documentation review, PR creation, and changelog fragments so agents check and add towncrier fragments before opening PRs.
  • Add towncrier changelog fragments under changelog.d/.

Testing

  • uv run pytest tests/core/enums/test_enum.py
  • uv run ruff format policyengine_core/enums/enum.py tests/core/enums/test_enum.py
  • uv run ruff check policyengine_core/enums/enum.py tests/core/enums/test_enum.py
  • uv run --no-sync --with towncrier towncrier build --draft --version 0.0.0

@anth-volk anth-volk force-pushed the fix/utf8-byte-enum-encoding branch from f88de77 to d6f0183 Compare May 7, 2026 22:25
@anth-volk anth-volk force-pushed the fix/utf8-byte-enum-encoding branch from d6f0183 to a658c1f Compare May 7, 2026 22:30
@anth-volk anth-volk marked this pull request as ready for review May 7, 2026 23:45
@anth-volk anth-volk merged commit 0f93efd into master May 7, 2026
23 checks passed
@anth-volk anth-volk deleted the fix/utf8-byte-enum-encoding branch May 7, 2026 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enum encoding fails on UTF-8 byte-string arrays from HDF5 datasets

1 participant