Skip to content

Add entity-level HDFStore output format alongside h5py#568

Draft
anth-volk wants to merge 1 commit intomainfrom
add-hdfstore-output
Draft

Add entity-level HDFStore output format alongside h5py#568
anth-volk wants to merge 1 commit intomainfrom
add-hdfstore-output

Conversation

@anth-volk
Copy link
Collaborator

Fixes #567

Summary

  • stacked_dataset_builder.py now produces a Pandas HDFStore file (.hdfstore.h5) alongside the existing h5py file, with one table per entity and an embedded uprating manifest
  • Upload pipeline uploads HDFStore files to dedicated subdirectories (states_hdfstore/, districts_hdfstore/, cities_hdfstore/)
  • Comparison test validates both formats contain identical data for all ~183 variables

Test plan

  • Run stacked_dataset_builder on a single CD/state and confirm both .h5 and .hdfstore.h5 files are created
  • Run pytest test_format_comparison.py --h5py-path STATE.h5 --hdfstore-path STATE.hdfstore.h5 and confirm all variables match
  • Verify HDFStore contains _variable_metadata manifest with correct entity and uprating columns
  • Verify all 6 entity tables are present with correct row counts

🤖 Generated with Claude Code

The stacked_dataset_builder now produces a Pandas HDFStore file
(.hdfstore.h5) in addition to the existing h5py file. The HDFStore
contains one table per entity (person, household, tax_unit, spm_unit,
family, marital_unit) plus an embedded _variable_metadata manifest
recording each variable's entity and uprating parameter path.

The upload pipeline uploads HDFStore files to dedicated subdirectories
(states_hdfstore/, districts_hdfstore/, cities_hdfstore/).

A comparison test (test_format_comparison.py) validates that both
formats contain identical data for all variables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add entity-level HDFStore output format alongside h5py

1 participant