Skip to content

Add stage-organized pipeline artifact uploads to HuggingFace #616

@anth-volk

Description

@anth-volk

Motivation

Pipeline artifacts (datasets, calibration weights, logs, geographic H5 files) are currently uploaded to the production HuggingFace repo (policyengine/policyengine-us-data) with no stage organization. Calibration artifacts overwrite in place at fixed paths with no versioning. There is no unified view of what a given pipeline run produced.

This is the first phase of a broader set of pipeline improvements. Later phases will build on this foundation to add a calibration dashboard with past-run browsing, per-stage sanity checks, content-addressable caching, and variable lifecycle tracking.

Approach

Mirror existing build artifacts to a new HuggingFace repo (policyengine/policyengine-us-data-pipeline) with a stage-organized folder structure. This is purely additive — existing uploads to the production repo are untouched.

Folder structure

{run_id}/                              # UTC timestamp, e.g. "20260317T143000Z"
  stage_0_raw/
    manifest.json
    policy_data.db
  stage_1_base/
    manifest.json
    cps_2024.h5
    enhanced_cps_2024.h5
    small_enhanced_cps_2024.h5
  stage_4_source_imputed/
    manifest.json
    source_imputed_stratified_extended_cps.h5
  stage_6_weights/
    manifest.json
    calibration_weights.npy
    geography.npz
    calibration_log.csv
    unified_diagnostics.csv
    unified_run_config.json
  stage_7_local_area/
    manifest.json                      # checksums only, no files (~50GB too large to double-upload)

Each manifest.json records stage name, run ID, timestamp, git provenance, and per-file SHA256 checksums.

Implementation

  • New utility module policyengine_us_data/utils/pipeline_artifacts.py with mirror_to_pipeline() as a single-call interface
  • Hook calls added at 4 existing upload points (additive only, no changes to existing behavior)
  • Failures are logged as warnings and never block the main pipeline

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions