-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Motivation
Pipeline artifacts (datasets, calibration weights, logs, geographic H5 files) are currently uploaded to the production HuggingFace repo (policyengine/policyengine-us-data) with no stage organization. Calibration artifacts overwrite in place at fixed paths with no versioning. There is no unified view of what a given pipeline run produced.
This is the first phase of a broader set of pipeline improvements. Later phases will build on this foundation to add a calibration dashboard with past-run browsing, per-stage sanity checks, content-addressable caching, and variable lifecycle tracking.
Approach
Mirror existing build artifacts to a new HuggingFace repo (policyengine/policyengine-us-data-pipeline) with a stage-organized folder structure. This is purely additive — existing uploads to the production repo are untouched.
Folder structure
{run_id}/ # UTC timestamp, e.g. "20260317T143000Z"
stage_0_raw/
manifest.json
policy_data.db
stage_1_base/
manifest.json
cps_2024.h5
enhanced_cps_2024.h5
small_enhanced_cps_2024.h5
stage_4_source_imputed/
manifest.json
source_imputed_stratified_extended_cps.h5
stage_6_weights/
manifest.json
calibration_weights.npy
geography.npz
calibration_log.csv
unified_diagnostics.csv
unified_run_config.json
stage_7_local_area/
manifest.json # checksums only, no files (~50GB too large to double-upload)
Each manifest.json records stage name, run ID, timestamp, git provenance, and per-file SHA256 checksums.
Implementation
- New utility module
policyengine_us_data/utils/pipeline_artifacts.pywithmirror_to_pipeline()as a single-call interface - Hook calls added at 4 existing upload points (additive only, no changes to existing behavior)
- Failures are logged as warnings and never block the main pipeline