Skip to content

Add public transfer artifact manifest#377

Draft
MaxGhenis wants to merge 6 commits intomainfrom
codex/public-transfer-artifact-manifest
Draft

Add public transfer artifact manifest#377
MaxGhenis wants to merge 6 commits intomainfrom
codex/public-transfer-artifact-manifest

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

@MaxGhenis MaxGhenis commented Apr 26, 2026

Summary

  • add a committed enhanced_cps_manifest_2025.json with row counts, checksums, git blob SHAs, exchange-rate/build assumptions, loss diagnostics, and weight diagnostics
  • add fast contract tests that compare the manifest, README, source CSV, H5 row counts, checksums, and weight diagnostics
  • document public transfer methodology/limitations and clarify that policybench_transfer_* files are legacy 1k artifacts while Python aliases point to the current enhanced_cps builder
  • add a separate manual workflow and Make target for uploading only the public transfer artifacts to the public Hugging Face repo
  • move disability-benefit survey amount interpretation into UK-data by emitting category leaves for PIP, DLA, and Attendance Allowance instead of synthetic/reported amount model inputs

Dependency

The disability-benefit category schema change depends on PolicyEngine UK PR #1656, which makes PIP, DLA, and Attendance Allowance category variables direct model input leaves and removes their reported amount variables from the model. The disability-benefit tests below were run with that local PE-UK branch installed.

Why

The public transfer artifact is now part of the PolicyBench path, but its provenance should not depend on the private eFRS deployment workflow. This keeps public artifact publication separate from private UKDS-derived artifacts and makes misuse harder. Disability benefit reported amounts are survey/data-prep signals, not policy model inputs, so the model should consume category leaves.

Validation

  • UV_FROZEN=0 uv run --python 3.13 ruff check policyengine_uk_data/datasets/enhanced_cps.py policyengine_uk_data/datasets/__init__.py policyengine_uk_data/datasets/policybench_transfer.py policyengine_uk_data/utils/enhanced_cps_manifest.py policyengine_uk_data/storage/write_enhanced_cps_manifest.py policyengine_uk_data/storage/upload_public_transfer_dataset.py policyengine_uk_data/tests/test_enhanced_cps_artifact_manifest.py
  • git diff --check
  • UV_FROZEN=0 uv run --python 3.13 pytest policyengine_uk_data/tests/test_enhanced_cps_artifact_manifest.py policyengine_uk_data/tests/test_policybench_transfer.py policyengine_uk_data/tests/test_release_manifest.py policyengine_uk_data/tests/test_hf_destinations.py
  • uv run --no-sync ruff check policyengine_uk_data/datasets/enhanced_cps.py policyengine_uk_data/datasets/frs.py policyengine_uk_data/datasets/imputations/frs_only.py policyengine_uk_data/tests/test_policybench_transfer.py policyengine_uk_data/tests/test_frs_only_imputation.py
  • uv run --no-sync pytest -q policyengine_uk_data/tests/test_policybench_transfer.py policyengine_uk_data/tests/test_frs_only_imputation.py policyengine_uk_data/tests/test_legacy_benefit_proxies.py

@vahid-ahmadi
Copy link
Copy Markdown
Collaborator

Safe to merge from a data-protection standpoint. This PR creates a parallel public upload path that touches none of the FRS-derived plumbing CLAUDE.md protects. Code quality is solid; tests are unusually thorough.

A few things worth a look before merge:

  1. Token scope. The workflow uses secrets.HUGGING_FACE_TOKEN. If that token has write access to -private, a future swap of PUBLIC_REPOPRIVATE_REPO could leak. Two follow-ups worth considering: un-xfail test_every_hf_upload_routes_through_guard_constants once remaining literal call sites are migrated, and/or provision a dedicated public-only token.

  2. POLICYBENCH_TRANSFER_SOURCE_FILE = ENHANCED_CPS_SOURCE_FILE is a footgun. Anyone importing the legacy alias silently gets the 28k CSV rather than the 1k file. Documented in the docstring, but either a louder warning or a runtime DeprecationWarning on import would surface this to consumers earlier.

  3. Legacy policybench_transfer_2025.h5 provenance. Was the 1,000-row legacy artifact ever FRS-derived in an earlier commit? Worth a one-liner confirmation from whoever produced it. Not a regression here.

  4. Repo size. enhanced_cps_2025.h5 is committed to git. If it's >50MB, git lfs would be friendlier; if it's small, ignore.

  5. Minor: _pick_region uses household_id * 2654435761 — a # Knuth multiplicative hash comment would help future readers.

@MaxGhenis
Copy link
Copy Markdown
Contributor Author

Pushed a fix for the uploader runtime error: upload_public_transfer_dataset now passes the required version argument through to upload_files_to_hf, defaulting to the installed policyengine-uk-data package version, and the public-upload test now uses an autospecced mock and asserts the version.

Verification:

  • ruff check policyengine_uk_data/storage/upload_public_transfer_dataset.py policyengine_uk_data/tests/test_enhanced_cps_artifact_manifest.py
  • ruff format --check policyengine_uk_data/storage/upload_public_transfer_dataset.py policyengine_uk_data/tests/test_enhanced_cps_artifact_manifest.py
  • pytest policyengine_uk_data/tests/test_enhanced_cps_artifact_manifest.py::test_public_transfer_upload_targets_public_hf_repo -q

I also tried the full artifact-manifest test file locally, but the HDF-reading tests fail in this Python 3.14 environment because PyTables cannot load libblosc2.dylib; the targeted public-upload test passes.

Current GitHub state after the push is still mergeable=CONFLICTING / mergeStateStatus=DIRTY, so I am leaving this draft until it is rebased/resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants