[docs] Design sketch: content-addressed microdata publishing by MaxGhenis · Pull Request #311 · PolicyEngine/policyengine.py

MaxGhenis · 2026-04-20T17:48:12Z

Design sketch for what we'd build if publishing PolicyEngine microdata from scratch. Explicitly a sketch, not an accepted plan — written to capture the shape while today's session has the context loaded.

Motivation

Today's stack layers PyPI (country models), PyPI and HF (country data), GitHub repos (build code), and policyengine.py release manifests (sha256 pins). Refreshing any link touches six files across three repos and mixes identity ("what dataset is this?") with distribution ("where do I get it?") and discovery ("what's current?").

Two pain points from today's session that the sketch resolves:

HF is a model-repo or dataset-repo. us-data publishes as a model repo, uk-data is private-gated; Bump us-data 1.73.0 → 1.78.2 + fix HF model/dataset repo detection #310 hit this. A plain object-store path has one shape.
"Is our data out of date?" required spelunking because inputs.model_package.sha256 isn't stored anywhere queryable. Under the sketch, CI diffs it against the active channel's manifest.

Shape

s3://policyengine-data/
  us/{build_id}/
    enhanced_cps_2024.parquet     # content-addressed payload
    manifest.json                 # 2 KB: schema, inputs, digests
  channels/us/
    latest.json                   # { "build_id": "..." }
    stable.json

build_id = sha256 of inputs (data vintage + model wheel + calibration sha + seed + lockfile). Deterministic; retagging impossible.
Channels are tiny JSON pointers. Bumping stable is one S3 PUT.
manifest.json carries schema + entity map, so agents learn column types without downloading 100 MB.
pe.py's consumer resolves channel → build_id → manifest → payload with sha256 verification.

What stays

Country data repos still own construction (boundary from release-bundles.md intact)
TRACE TRO sidecars still ship with policyengine.py, same trov: / pe: vocabulary
UK privacy boundary: bucket can be org-gated

Migration cost

~3 engineer-weeks if genuinely pursued.

Explicit non-goals

Doesn't solve model drift (pe-us 1.653 and 1.700 will still produce different enhanced CPS)
Doesn't solve calibration (still an open research problem)

Open questions (in the doc)

GCS vs S3
Parquet vs HDF5 for new builds
Channel-history signing (transparency log)
Single bucket or per-country namespaces

This PR

Adds docs/data-publishing-design.md and wires it into the Quarto nav under Reference. No code change. Merge-neutral for v4.x — this is a research doc.

Companion to docs/release-bundles.md. Sketches what we'd build if we started over: content-addressed object storage (build_id = sha256 of inputs), release channels (latest/stable/lts-* pointing at build_ids), a one-command builder CLI, and a pe.py consumer that resolves channel -> manifest -> payload with sha256 verification. Motivation: today's stack layers PyPI + HF + GitHub + release manifests, and refreshing any link touches six files across three repos. Today's session hit it twice — the us-data HF path was a model repo not a dataset repo (broke refresh_release_bundle until we fixed it), and answering "is our data out of date?" required spelunking because inputs.model_package.sha256 isn't stored anywhere. Keeps intact: - Country data repos own construction (release-bundles.md boundary) - TRACE TRO sidecars with the same trov:/pe: vocabulary - UK privacy boundary (bucket can be org-gated) Explicitly a design sketch, not an accepted plan. Open questions section flags GCS vs S3, parquet vs HDF5, channel-history signing, and single-bucket-vs-namespaces.

Subagent stress-test surfaced five scenarios the sketch handled clumsily or not at all. Rewritten to: - State upfront that the motivating pains (#310 HF repo-type bug, "is data stale?") do NOT require this architecture — they're solvable with a one-time HF repo-type fix + a 50-line CI job. The sketch is a "where do we want to be in a year?" question, not an immediate-fix proposal. - Replace the single "Open questions" section with an explicit "Unresolved risks" section covering: * UK Data Service audit trail (today HF logs downloader identity; sketch loses this unless we explicitly gate manifests + log resolver hits) * Silent-promote attack (channel JSON has no signature; sketch is strictly weaker than PyPI/HF platform auth until channel-history signing ships) * Non-deterministic builds (today's Enhanced CPS pipeline uses torch+pandas imputation; v1 needs If-None-Match conditional writes or explicit first-writer-wins semantics) * Licence revocation vs immutability (tombstone build_ids with status=revoked, explicit licence-continuity qualifier on the replicability guarantee) * Cross-cloud replication (mirror story is payload-only; channels require proxy or consumer multi-mirror config) - Revised cost estimate: the earlier "3 engineer-weeks" was ~3x optimistic. Realistic range is 8-12 engineer-weeks. Recategorised as v5 scope. No change to the core three-concepts model (identity / distribution / discovery separation). That part held up.

MaxGhenis · 2026-04-20T17:51:19Z

Stress-test review folded in (commit `10fa058`)

Subagent stress-tested the sketch against five scenarios. Four needed explicit design answers; one meta-flag affected framing.

Core architecture held up. The identity/distribution/discovery separation and the channels+manifest layer survive the review intact. No change to the shape.

What changed in the doc:

Honest framing up top. The motivating pains (Bump us-data 1.73.0 → 1.78.2 + fix HF model/dataset repo detection #310's HF repo-type bug and "is our data stale?") are fixable with a one-time HF migration + a 50-line CI job. The sketch is a cleaner architecture, but adopting it is a "where do we want to be in a year?" question, not an immediate-fix. Readers should pick it for architectural reasons, not because they think it solves those two bugs.
New "Unresolved risks" section covering all five scenarios with concrete "decision needed" notes:
- UK Data Service audit trail (today HF logs downloader identity; sketch needs explicit gating on manifest fetches + resolver-hit logging)
- Silent-promote attack (channel JSON is unsigned → sketch is strictly weaker than PyPI/HF until channel signing ships → must promote signing to v1)
- Non-deterministic builds (today's Enhanced CPS pipeline has known non-determinism → need If-None-Match conditional writes or explicit first-writer-wins semantics)
- Licence revocation vs immutability (tombstones + qualified replicability guarantee)
- Cross-cloud replication (payloads mirror trivially; channels don't)
Revised cost estimate: 8–12 engineer-weeks, recategorised as v5 scope. The earlier "3 weeks" was ~3x optimistic.

The core three-concepts model is unchanged — just the operational realism around it.

…-bundles Codex review caught the main architectural overclaim in the earlier sketch: it slid between recipe-addressed ("sha256 of inputs" — an identifier derived from declared inputs) and content-addressed ("sha256 of output bytes" — an identifier derived from the bytes themselves), and framed the whole design as a replacement for release-bundles.md when release-bundles is load-bearing for the scientific citation and certification surface. Rewritten to: - Scope the sketch explicitly to a storage substrate. release-bundles remains the authoritative citation + certification surface; the substrate is infrastructure underneath. - Switch the primary identifier from `build_id = sha256(inputs)` to `artifact_sha256 = sha256(output bytes)`. Input digest becomes a derived queryable field in the manifest, not the primary key. This is how OCI/Nix actually work in the parts that deliver. - Drop the `stable` and `lts-{quarter}` channel names. Their semantics for microdata are ambiguous (four meanings per codex: "latest official source vintage" vs "methodologically preferred reconstruction" vs "legally redistributable build" vs "paper-citation freeze"). Keep only `latest` (operational) and `next` (staging, feeds certification). Authoritative / stable stays on the release-bundle side. - Drop claims of org-independent identity. `data_vintage: "cps_asec_2024"` is a label, not a raw-bytes hash; `built_at` / `built_by` break bitwise identity across orgs anyway. The current release-bundles schema records raw-input hashes, so regressing on that would be real. - New section: "The release-bundle boundary (what doesn't change)" spelling out that certification, staged promotion, compatibility rules, `*.trace.tro.jsonld` sidecars, and the replicability guarantee all remain in release-bundles.md. - Revised "whether to pursue" section leads with the honest conclusion: keep the storage idea, drop the "replace release bundles" framing, don't build it to fix #310 or "is our data stale?" (which have cheap targeted fixes), and revisit if the UK Data Service relationship gets stricter. - Honest migration cost table (7-11 engineer-weeks, independent tracks), explicitly v5 scope. Both review findings (general-purpose + codex) carried forward under "Unresolved risks"; that section barely changed except that "non-deterministic builds" is now actually *cleaner* under output- hash identity — two runs produce two different sha256s, they don't silently collide. Structure now: scope / motivating pains (and what's actually on the critical path) / what the substrate provides / output-hash identity / narrow channel semantics / release-bundle boundary preservation / consumer resolver changes / unresolved risks / what this fixes vs. what it doesn't / honest cost / whether to pursue / open questions.

MaxGhenis · 2026-04-20T18:13:17Z

Codex review folded in — structural rewrite (commit `f951f5a`)

Codex caught a load-bearing overclaim the first review missed: the sketch slid between recipe-addressed (sha256 of declared inputs) and content-addressed (sha256 of output bytes). Those are meaningfully different, and the earlier draft used "content-addressed" to claim properties only true of the latter.

Second, Codex flagged that the earlier framing ("replace PyPI + HF + GitHub + release manifests") was actively regressive: it proposed collapsing a scientific-citation-and-certification surface into a storage primitive. That's the pattern where a separate certification layer reappears on top within 2 years (codex: 60–85% probability), and we'd end up with two systems.

What changed in the rewrite

Scoped to storage substrate only. release-bundles.md remains authoritative for certification + citation. The substrate is infrastructure underneath, not a replacement.
Primary identifier is artifact_sha256 (sha256 of output bytes), not build_id (sha256 of inputs). Matches what OCI / Nix actually do in the parts that work.
Dropped stable and lts-* channels. Codex: "stable" for microdata can mean four different things. Kept only latest (operational) and next (staging → feeds certification). "Authoritative" stays on the release-bundle side.
Dropped the "org-independent identity" claim. data_vintage: "cps_asec_2024" is a label, not a hash; built_at/built_by break bitwise identity anyway. The current release-bundles schema actually records raw-input hashes, so regressing on that would be real.
New section: "The release-bundle boundary (what doesn't change)" makes the preservation explicit.
Honest cost table: 7–11 engineer-weeks, independent tracks, v5 scope at most.
Honest "whether to pursue": keep the storage idea, drop the replace-release-bundles framing, don't build it to fix Bump us-data 1.73.0 → 1.78.2 + fix HF model/dataset repo detection #310 or "is data stale" (cheap targeted fixes exist), revisit if UK Data Service relationship gets stricter.

The doc's load-bearing sentence, now explicit: "When a release bundle is certified, it promotes an artifact from channels/next to a concrete artifact_sha256 pin in the country release manifest. After that, the release manifest is what papers cite; the storage channel is just the cache."

Both prior review's risks (UK audit, silent-promote, non-determinism, licence revocation, cross-cloud mirroring) carried forward. "Non-deterministic builds" is actually cleaner under output-hash identity — two runs produce two distinct artifact_sha256 values rather than silently colliding.

MaxGhenis added 2 commits April 20, 2026 13:47

MaxGhenis merged commit 4e7a602 into main Apr 20, 2026
1 check passed

MaxGhenis mentioned this pull request Apr 20, 2026

Small follow-ups from v4 audit #312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Design sketch: content-addressed microdata publishing#311

[docs] Design sketch: content-addressed microdata publishing#311
MaxGhenis merged 3 commits intomainfrom
data-publishing-design

MaxGhenis commented Apr 20, 2026

Uh oh!

MaxGhenis commented Apr 20, 2026

Uh oh!

MaxGhenis commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 20, 2026

Motivation

Shape

What stays

Migration cost

Explicit non-goals

Open questions (in the doc)

This PR

Uh oh!

MaxGhenis commented Apr 20, 2026

Stress-test review folded in (commit 10fa058)

Uh oh!

MaxGhenis commented Apr 20, 2026

Codex review folded in — structural rewrite (commit f951f5a)

What changed in the rewrite

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Stress-test review folded in (commit `10fa058`)

Codex review folded in — structural rewrite (commit `f951f5a`)